Home/Concepts/Artificial Intelligence/Natural Language Processing

Natural Language Processing

Process text, extract meaning, and build language understanding

⏱️ 25 min18 interactions

What is Natural Language Processing?

Natural Language Processing (NLP) is the bridge between human communication and machine understanding. It enables computers to read, interpret, and generate human language, powering everything from chatbots to translation services.

💡 The Core Challenge

📝
Text is Complex
Ambiguity, context, sarcasm, idioms - human language is incredibly nuanced
🔢
Computers Need Numbers
Machines work with vectors and matrices, not words and sentences
🧠
NLP Bridges the Gap
Transform text into mathematical representations while preserving meaning

Tokenization: From Text to Processable Units

🔤 Why Tokenization Matters

The Fundamental Problem

Computers process discrete units: Neural networks work with vectors and matrices, not raw strings
Text is continuous: "understanding" vs "understand" vs "understood" - same concept, different forms
Tokenization is the bridge: Split text into manageable pieces that can be mapped to numerical representations

⚖️ Three Approaches & Their Trade-offs

1. Word-Level Tokenization
"The cat sat" → ["The", "cat", "sat"]
Pros:
- Intuitive and human-readable
- Preserves semantic meaning (each word is a unit)
- Fast to tokenize (just split on spaces/punctuation)
Cons:
- Massive vocabulary: English has 170,000+ words
- Out-of-vocabulary (OOV) problem: "ChatGPT" wasn't in 2020 dictionaries
- Morphological blindness: "play", "playing", "played" are separate tokens
- Rare words: "antidisestablishmentarianism" gets same treatment as "the"
2. Character-Level Tokenization
"cat" → ["c", "a", "t"]
Pros:
- Tiny vocabulary: ~100 characters (a-z, A-Z, punctuation, digits)
- No OOV problem: Can represent any text, including typos and new words
- Multilingual friendly: Works with non-space-separated languages (Chinese)
Cons:
- Long sequences: "understanding" = 13 tokens (vs 1 in word-level)
- Loses semantics: Model must learn "c-a-t" means something
- Computationally expensive: 10× more tokens = 10× slower training
3. Subword Tokenization (Modern Standard)
"understanding" → ["under", "stand", "ing"]
🎯 Best of Both Worlds:
Advantages:
- Balanced vocabulary: 30K-50K tokens (vs 170K+ words or 100 chars)
- Handles rare words: Break into known parts → "antibiotic" = ["anti", "bio", "tic"]
- Learns morphology: "play", "playing" share "play" subword
- No OOV: Can decompose any word to characters if needed
- Efficient: Shorter sequences than char-level, broader coverage than word-level

🔧 Subword Algorithms

Byte-Pair Encoding (BPE)
Used by: GPT-2, GPT-3, RoBERTa
Algorithm:
1. Start with character vocabulary
2. Find most frequent pair of tokens
3. Merge pair into new token
4. Repeat until vocabulary size reached
// Example iteration:
"l o w" (freq: 5)
"l o w e r" (freq: 2)
→ merge "l o" to "lo"
"lo w", "lo w e r"
WordPiece
Used by: BERT, DistilBERT
Algorithm:
1. Similar to BPE but different metric
2. Maximizes likelihood of training data
3. Uses "##" prefix for subwords
4. Greedy longest-match-first at inference
// Example:
"playing"
→ ["play", "##ing"]
// "##" = continuation
SentencePiece (Unigram LM)
Used by: T5, ALBERT, XLNet
• Language-agnostic: Treats text as raw unicode (no pre-tokenization needed)
• Starts with large vocabulary, removes tokens that minimize loss
• Handles spaces as special character "▁" → enables reversibility
• Ideal for multilingual models (100+ languages)

📊 Vocabulary Size Impact

Small (10K): Longer sequences, more generalization
Medium (30-50K): Sweet spot for most models
Large (100K+): Shorter sequences, more memorization

⚡ Modern Practice

• GPT-3: BPE with 50,257 tokens
• BERT: WordPiece, 30,522 tokens
• LLaMA: BPE, 32,000 tokens

🎯 Key Insight

Tokenization is learned from data, not hand-crafted. Train tokenizer on your corpus before training the model!

1. Tokenization: Breaking Text Apart

✂️ Interactive: Tokenize Text

Tokenization splits text into smaller units (tokens). Different methods serve different purposes!

Tokens (6)word tokenization
natural
language
processing
is
amazing
!
Word Tokenization
Splits by spaces/punctuation. Most common. Good for English.
Character Tokenization
Individual characters. Large vocabulary, handles any word.
Subword Tokenization
Balance of both. Used by BERT, GPT. Handles rare words well.

Word Embeddings: Capturing Semantic Meaning

🧠 Distributional Semantics

The Core Hypothesis

"You shall know a word by the company it keeps"

— J.R. Firth (1957)

Distributional hypothesis: Words appearing in similar contexts have similar meanings
Example: "dog" and "cat" both appear with: "pet", "animal", "fur", "tail" → similar vectors
Word embeddings: Dense, low-dimensional representations (typically 100-300d) that capture semantic relationships

📊 From One-Hot to Dense Vectors

One-Hot Encoding (Naive Approach)
// Vocabulary: [cat, dog, king, queen, apple]
cat: [1, 0, 0, 0, 0]
dog: [0, 1, 0, 0, 0]
king: [0, 0, 1, 0, 0]
queen: [0, 0, 0, 1, 0]
apple: [0, 0, 0, 0, 1]
Problems:
- Dimensionality = vocabulary size (170K+ for English!)
- Sparse vectors (99.999% zeros)
- No semantic information: "cat" and "dog" are equally distant from each other as "cat" and "king"
- Orthogonal vectors: dot product always 0
Dense Embeddings (Modern Approach)
// 3D vectors (typically 100-300D)
cat: [0.2, 0.8, 0.3]
dog: [0.3, 0.7, 0.4] ← close to cat!
king: [0.8, 0.3, 0.6]
queen: [0.7, 0.4, 0.5] ← close to king!
apple: [0.1, 0.2, 0.9]
Advantages:
- Low dimensional (300D vs 170K)
- Dense (no zeros, every dimension meaningful)
- Semantic similarity: Similar words have similar vectors (high cosine similarity)
- Analogical reasoning: king - man + woman ≈ queen

🔬 Word2Vec: Learning Embeddings from Context

CBOW (Continuous Bag of Words)
Predict target word from context
Context: "The [___] sat on"
Target: "cat"
Architecture:
1. Input: Context words (one-hot)
2. Embedding layer: W (V×D)
3. Average context vectors
4. Output layer: W' (D×V)
5. Softmax → predict target
✓ Fast training (simpler)
✓ Good for frequent words
✗ Loses word order
Skip-gram
Predict context from target word
Target: "cat"
Context: "The", "sat", "on", "mat"
Architecture:
1. Input: Target word (one-hot)
2. Embedding layer: W (V×D)
3. Output layer: W' (D×V)
4. Softmax → predict each context word
5. Maximize P(context | target)
✓ Better for rare words
✓ More training data per word
✗ Slower (predicts C words)
Training Tricks: Negative Sampling
Problem: Softmax over 170K vocab is computationally expensive
Solution: Instead of updating all words, sample k negative examples (words that shouldn't appear)
Positive: ("cat", "sat") → maximize P(sat | cat)
Negatives: ("cat", "airplane"), ("cat", "democracy") → minimize
• Typical k=5-20 for small datasets, k=2-5 for large
• Makes training ~100× faster!

🌐 GloVe: Global Vectors for Word Representation

Key idea: Word2Vec captures local context, GloVe captures global co-occurrence statistics
Objective: wiᵀ wj + bi + bj = log(Xij)
// Xij = how often word i and j co-occur
Process:
1. Build co-occurrence matrix X from entire corpus (global statistics)
2. Factorize matrix to learn word vectors that reconstruct X
3. Weighted least squares: frequent pairs get more weight
Advantage: Combines global statistics (like LSA) with local context (like Word2Vec)
Used by: Many pre-2018 NLP systems before contextual embeddings

⚡ Semantic Relationships & Vector Arithmetic

Famous Example: Analogical Reasoning
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")
vec("walked") - vec("walk") + vec("swim") ≈ vec("swam")
Why this works: Embeddings capture semantic dimensions
- Gender axis: king↔queen, man↔woman, boy↔girl
- Capital axis: France↔Paris, Italy↔Rome, Japan↔Tokyo
- Tense axis: walk↔walked, swim↔swam, eat↔ate
Similarity metric: Cosine similarity
sim(u, v) = (u · v) / (||u|| ||v||) ∈ [-1, 1]
// 1 = identical, 0 = orthogonal, -1 = opposite

📦 Static vs Contextual

Static (Word2Vec, GloVe):
"bank" always same vector (river vs financial?)
Contextual (BERT, ELMo):
"bank" vector changes based on sentence context!

🎯 Pre-trained Embeddings

• Word2Vec: Google News (100B words)
• GloVe: Wikipedia + Gigaword (6B tokens)
• FastText: Common Crawl (600B tokens)
→ Transfer learning for NLP!

💡 Key Insight

Embeddings are the "CNN for NLP" - they revolutionized the field by providing meaningful representations. Today: contextual embeddings dominate!

2. Word Embeddings: Meaning as Vectors

🧮 Interactive: Explore Word Vectors

Words with similar meanings have similar vectors. "King" and "Queen" are close in vector space!

Word: "king"

Vector Representation (simplified)
[0.8, 0.3, 0.6]
Similar Words
queen
100%
prince
85%
monarch
70%

🎯 Key Insight: Word2Vec, GloVe, and FastText learn these embeddings from massive text corpora. Similar contexts → similar vectors!

3. Sentiment Analysis: Understanding Emotion

😊 Interactive: Analyze Sentiment

🔍 How it Works: Sentiment analysis uses machine learning to classify emotional tone. Models can be rule-based, ML-based (Naive Bayes, SVM), or deep learning (BERT, RoBERTa).

4. Named Entity Recognition (NER)

🏷️ Interactive: Extract Entities

NER identifies and classifies named entities: people, organizations, locations, dates, etc.

Apple Inc. is located in Cupertino, California and was founded by Steve Jobs in 1976.

5. Text Classification

🗂️ Interactive: Classify Text

Automatically categorize text into predefined classes: spam/ham, topic, sentiment, intent, etc.

💡 Applications: Spam filtering, topic labeling, intent detection, priority routing, content moderation, language identification.

6. TF-IDF: Term Importance

📊 Interactive: Calculate TF-IDF

TF-IDF measures how important a word is to a document. High TF-IDF = distinctive term!

TF-IDF Score

0.375
for term "machine"
Formula:
TF-IDF = TF × IDF
TF (Term Frequency):
How often term appears in document
IDF (Inverse Doc Frequency):
How rare term is across all documents

🎯 Use Case: TF-IDF helps search engines rank documents. Terms with high TF-IDF are most relevant to that specific document!

Attention Mechanism: Learning What Matters

🎯 Why Attention Revolutionized NLP

The RNN Bottleneck Problem

Sequential processing: RNNs process one word at a time → can't parallelize
Long-term dependency problem: Information from word 1 must travel through 100+ steps to reach word 100
Fixed-size bottleneck: Entire sentence compressed into single hidden state vector
"The cat that ate the mouse that lived in the house is sleeping"
→ What is sleeping? Information gets lost!
Attention Solution: Let model directly access ALL previous words, not just hidden state!

🧮 Self-Attention Mathematics

Core Concept: Query, Key, Value (QKV)
Think of it like a database lookup:
Query (Q): "What am I looking for?" (current word's question)
Key (K): "What do I contain?" (each word's description)
Value (V): "What information do I have?" (actual content)
Example: "The cat sat on the mat"
Query from "sat": "Who performed this action?"
Keys from all words: "cat" has high match!
Value from "cat": Return cat's information
Step-by-Step Computation
Step 1: Create Q, K, V matrices
Q = X WQ // Transform input X with learned weights
K = X WK // Each word gets 3 representations
V = X WV // Learned during training
X: (seq_len, d_model) → Q,K,V: (seq_len, d_k)
Step 2: Compute attention scores
scores = (Q × Kᵀ) / √dk
• Q × Kᵀ: dot product = similarity between query and all keys
• √dk: scaling factor (prevents vanishing gradients)
• Result: (seq_len, seq_len) matrix of attention scores
Step 3: Apply softmax
attention_weights = softmax(scores)
• Normalize scores to [0,1] summing to 1
• High scores → high attention
• Each word now has distribution over all other words
Step 4: Weight the values
output = attention_weights × V
• Weighted sum of values
• Important words contribute more
• Result: context-aware representation
Complete Formula
Attention(Q, K, V) = softmax(QKᵀ / √dk) V
QKᵀ
Similarity scores
√dk
Scaling factor
softmax × V
Weighted sum

🎭 Multi-Head Attention

Problem: Single attention head captures one type of relationship
Solution: Run attention multiple times in parallel with different learned projections
MultiHead(Q,K,V) = Concat(head₁, head₂, ..., headh) WO
where headi = Attention(QWiQ, KWiK, VWiV)
Why Multiple Heads?
Head 1: Might learn syntactic relationships (subject-verb)
Head 2: Might learn semantic relationships (co-reference)
Head 3: Might learn positional patterns (beginning-end)
Head 4: Might learn domain-specific patterns
BERT uses 12 heads, GPT-3 uses 96 heads! Each captures different aspects.

⚡ Why Attention Beats RNNs

RNNs / LSTMs
Sequential: Must process word-by-word
Slow: Can't parallelize training
Memory decay: Forgets long-term context
Fixed context: Hidden state size limits info
Gradient issues: Vanishing/exploding with long sequences
Transformers (Attention)
Parallel: All positions processed simultaneously
Fast: Matrix operations → GPU accelerated
Perfect memory: Direct access to all words
Dynamic context: Attention weights adapt per input
Stable gradients: Direct connections, no chains
Computational Complexity:
RNN: O(n) sequential operations (can't parallelize)
Attention: O(1) sequential ops, O(n²) parallel ops
→ With modern GPUs, parallel O(n²) beats sequential O(n)!

🔍 Attention Variants

Self-attention: Attend to same sequence
Cross-attention: Attend to different sequence
Masked attention: Only attend to past (GPT)

🎯 Real-World Scale

• BERT: 12 layers × 12 heads = 144 attention ops
• GPT-3: 96 layers × 96 heads = 9,216 attention ops
→ Learns incredibly complex patterns!

💡 Key Insight

"Attention is All You Need" (2017) - This single paper revolutionized NLP. Transformers now power GPT, BERT, ChatGPT, Claude!

7. Attention Mechanism

👁️ Interactive: Visualize Attention

Attention helps models focus on relevant words when processing text. Click a word to see what it attends to!

Attention Weights

Word "sat" pays attention to:

The
60%
cat
80%
sat
100%
on
80%
the
60%
mat
40%

🧠 Transformers: Models like BERT and GPT use multi-head attention to capture different types of relationships simultaneously!

Language Models: Predicting the Next Word

📚 What is a Language Model?

Core Definition

A language model assigns probabilities to sequences of words

P("The cat sat on the mat") = ?
Goal: Learn P(wt | w1, ..., wt-1) - predict next word given context
Applications: Text generation, autocomplete, translation, speech recognition, code completion
Quality metric: Good LM assigns high probability to actual sentences, low to gibberish

📊 Evolution: From N-grams to Neural Networks

N-gram Models (Classical, pre-2010)
Idea: Predict next word based on previous n-1 words
Bigram (n=2): P(cat | the) = count("the cat") / count("the")
Trigram (n=3): P(sat | the cat) = count("the cat sat") / count("the cat")
✓ Simple, fast, interpretable
✗ Limited context (typically n≤5), sparse counts, no generalization
Sparsity problem: Most n-grams never seen in training → zero probability!
Neural Language Models (2013-2017)
Architecture: Embeddings → RNN/LSTM → Softmax
ht = LSTM(ht-1, embed(wt-1))
P(wt | context) = softmax(W ht + b)
✓ Dense representations, handles unseen n-grams, longer context
✗ Sequential processing (slow), still limited context (~100 tokens)
Transformer LMs (2018-present)
Breakthrough: GPT (Generative Pre-trained Transformer)
Architecture: Tokens → Embeddings → Multi-layer Transformer → Next token prediction
✓ Parallel training, very long context (GPT-4: 128K tokens!), pre-training + fine-tuning
Scale: GPT-3 has 175B parameters, trained on 300B tokens

🎲 Sampling Strategies: Controlling Generation

1. Temperature Scaling
P'(wi) = exp(logiti / T) / Σ exp(logitj / T)
T = 0.1 (Low)
Focused, deterministic
"The" → 0.95, "A" → 0.04
T = 1.0 (Default)
Balanced sampling
"The" → 0.60, "A" → 0.25
T = 2.0 (High)
Creative, random
"The" → 0.35, "A" → 0.30
2. Top-k Sampling
Idea: Only sample from top k most likely words
Probs: [sunny: 0.4, cloudy: 0.25, rainy: 0.2, windy: 0.1, foggy: 0.05]
k=3 → Renormalize: [sunny: 0.47, cloudy: 0.29, rainy: 0.24]
✓ Avoids sampling very unlikely words
Typical k=50 for factual tasks, k=100 for creative writing
3. Nucleus (Top-p) Sampling
Idea: Sample from smallest set with cumulative probability ≥ p
p=0.9 → Sample until cumulative prob reaches 90%
Sometimes 2 words, sometimes 50 words (adapts to confidence!)
✓ Dynamic cutoff, better than fixed k
OpenAI default: p=1.0, temperature=0.7
4. Beam Search (Deterministic)
Idea: Keep top k sequences at each step, expand all
Use case: Machine translation (want best translation, not diverse)
✓ Finds high-probability sequences
✗ Can be repetitive, less creative

📈 Evaluation Metric: Perplexity

Perplexity = exp(-1/N Σ log P(wi | context))
Intuition: How "surprised" is the model on average?
Lower = Better: Perplexity of 10 means model is as confused as choosing from 10 equally likely words
Benchmarks:
- Random baseline: Perplexity = vocabulary size (e.g., 50,000)
- Good trigram model: ~100-200
- Modern LSTM: ~40-60
- GPT-3 scale: ~20-30 (excellent!)

🚀 Modern LLMs: The GPT Architecture

Two-Stage Training
Stage 1: Pre-training (Unsupervised)
• Train on massive unlabeled text (web crawl, books, code)
• Objective: Predict next token (autoregressive)
• GPT-3: 300B tokens, ~$4.6M compute cost
• Result: General language understanding
Stage 2: Fine-tuning (Supervised)
• Adapt to specific tasks (Q&A, summarization, chat)
• ChatGPT: RLHF (Reinforcement Learning from Human Feedback)
• Small dataset, quick training
• Result: Task-specific expert
Key Innovations
Causal (Masked) Attention
Can only attend to previous tokens → autoregressive generation
Enormous Scale
GPT-3: 175B params, GPT-4: ~1.7T params (rumored)
In-Context Learning
Few-shot prompting without parameter updates!
Emergent Abilities
Reasoning, math, code at sufficient scale

🎯 Autoregressive LMs

GPT series: Left-to-right generation
Use: Text generation, chat, completion
Fast inference, creative outputs

🔄 Masked LMs

BERT series: Bidirectional context
Use: Classification, NER, Q&A
Better understanding, no generation

💡 Key Insight

LLMs are "compressed internet" - they memorize patterns, not facts. Temperature controls exploration vs exploitation trade-off!

8. Language Model Prediction

🤖 Interactive: Generate Text

Language models predict the next word based on context. Temperature controls randomness!

Deterministic (focused)Random (creative)

Top Predictions

sunny
40.0%
cloudy
25.0%
rainy
20.0%
perfect
15.0%
Low Temperature (0.0-0.3)
Focused, predictable, picks most likely word. Good for factual tasks.
High Temperature (0.7-1.0)
Creative, random, explores less likely words. Good for creative writing.

9. Build Your NLP Pipeline

🔧 Interactive: Design Processing Pipeline

Real NLP systems chain multiple steps. Build your pipeline by adding processing stages!

Available Steps

Your Pipeline (0 steps)

Add steps to build your pipeline →

💡 Pipeline Flow: Text → Tokenization → Cleaning → Feature Extraction → Model → Output. Order matters! Preprocessing comes before modeling.

🎯 Key Takeaways

✂️

Tokenization is Foundation

Breaking text into tokens is the first critical step. Word, character, or subword - each method has trade-offs. Modern models use subword tokenization (BPE, WordPiece).

🧮

Words Become Vectors

Word embeddings (Word2Vec, GloVe, FastText) convert text to numbers while preserving semantic relationships. Similar words have similar vectors in high-dimensional space.

🎯

Multiple NLP Tasks

Sentiment analysis, NER, classification, translation, summarization - NLP covers diverse tasks. Each requires different architectures but shares preprocessing steps.

👁️

Attention is Key

Attention mechanism revolutionized NLP. Transformers (BERT, GPT) use self-attention to capture context and relationships between all words simultaneously, not sequentially.

🔧

Pipelines are Powerful

Real NLP systems chain preprocessing, feature extraction, and models into pipelines. spaCy, NLTK, and Hugging Face Transformers provide ready-to-use components.

🚀

Modern Era: LLMs

Large Language Models (GPT-4, Claude, LLaMA) are pre-trained on massive text corpora. They excel at few-shot learning, understanding context, and generating human-like text across domains.