Home/Concepts/Artificial Intelligence/Natural Language Processing

Natural Language Processing

Process text, extract meaning, and build language understanding

⏱️ 25 min⚡ 18 interactions

What is Natural Language Processing?

Natural Language Processing (NLP) is the bridge between human communication and machine understanding. It enables computers to read, interpret, and generate human language, powering everything from chatbots to translation services.

💡 The Core Challenge

📝

Text is Complex

Ambiguity, context, sarcasm, idioms - human language is incredibly nuanced

🔢

Computers Need Numbers

Machines work with vectors and matrices, not words and sentences

🧠

NLP Bridges the Gap

Transform text into mathematical representations while preserving meaning

Tokenization: From Text to Processable Units

🔤 Why Tokenization Matters

The Fundamental Problem

• Computers process discrete units: Neural networks work with vectors and matrices, not raw strings

• Text is continuous: "understanding" vs "understand" vs "understood" - same concept, different forms

• Tokenization is the bridge: Split text into manageable pieces that can be mapped to numerical representations

⚖️ Three Approaches & Their Trade-offs

1. Word-Level Tokenization

"The cat sat" → ["The", "cat", "sat"]

✓ Pros:

- Intuitive and human-readable

- Preserves semantic meaning (each word is a unit)

- Fast to tokenize (just split on spaces/punctuation)

✗ Cons:

- Massive vocabulary: English has 170,000+ words

- Out-of-vocabulary (OOV) problem: "ChatGPT" wasn't in 2020 dictionaries

- Morphological blindness: "play", "playing", "played" are separate tokens

- Rare words: "antidisestablishmentarianism" gets same treatment as "the"

2. Character-Level Tokenization

"cat" → ["c", "a", "t"]

✓ Pros:

- Tiny vocabulary: ~100 characters (a-z, A-Z, punctuation, digits)

- No OOV problem: Can represent any text, including typos and new words

- Multilingual friendly: Works with non-space-separated languages (Chinese)

✗ Cons:

- Long sequences: "understanding" = 13 tokens (vs 1 in word-level)

- Loses semantics: Model must learn "c-a-t" means something

- Computationally expensive: 10× more tokens = 10× slower training

3. Subword Tokenization (Modern Standard)

"understanding" → ["under", "stand", "ing"]

🎯 Best of Both Worlds:

✓ Advantages:

- Balanced vocabulary: 30K-50K tokens (vs 170K+ words or 100 chars)

- Handles rare words: Break into known parts → "antibiotic" = ["anti", "bio", "tic"]

- Learns morphology: "play", "playing" share "play" subword

- No OOV: Can decompose any word to characters if needed

- Efficient: Shorter sequences than char-level, broader coverage than word-level

🔧 Subword Algorithms

Byte-Pair Encoding (BPE)

Used by: GPT-2, GPT-3, RoBERTa

Algorithm:

1. Start with character vocabulary

2. Find most frequent pair of tokens

3. Merge pair into new token

4. Repeat until vocabulary size reached

// Example iteration:

"l o w" (freq: 5)

"l o w e r" (freq: 2)

→ merge "l o" to "lo"

"lo w", "lo w e r"

WordPiece

Used by: BERT, DistilBERT

Algorithm:

1. Similar to BPE but different metric

2. Maximizes likelihood of training data

3. Uses "##" prefix for subwords

4. Greedy longest-match-first at inference

// Example:

"playing"

→ ["play", "##ing"]

// "##" = continuation

SentencePiece (Unigram LM)

Used by: T5, ALBERT, XLNet

• Language-agnostic: Treats text as raw unicode (no pre-tokenization needed)

• Starts with large vocabulary, removes tokens that minimize loss

• Handles spaces as special character "▁" → enables reversibility

• Ideal for multilingual models (100+ languages)

📊 Vocabulary Size Impact

Small (10K): Longer sequences, more generalization

Medium (30-50K): Sweet spot for most models

Large (100K+): Shorter sequences, more memorization

⚡ Modern Practice

• GPT-3: BPE with 50,257 tokens

• BERT: WordPiece, 30,522 tokens

• LLaMA: BPE, 32,000 tokens

🎯 Key Insight

Tokenization is learned from data, not hand-crafted. Train tokenizer on your corpus before training the model!

1. Tokenization: Breaking Text Apart

✂️ Interactive: Tokenize Text

Tokenization splits text into smaller units (tokens). Different methods serve different purposes!

Input Text

Tokenization Method

Tokens (6)word tokenization

natural

language

processing

amazing

Word Tokenization

Splits by spaces/punctuation. Most common. Good for English.

Character Tokenization

Individual characters. Large vocabulary, handles any word.

Subword Tokenization

Balance of both. Used by BERT, GPT. Handles rare words well.

Word Embeddings: Capturing Semantic Meaning

🧠 Distributional Semantics

The Core Hypothesis

"You shall know a word by the company it keeps"

— J.R. Firth (1957)

• Distributional hypothesis: Words appearing in similar contexts have similar meanings

• Example: "dog" and "cat" both appear with: "pet", "animal", "fur", "tail" → similar vectors

• Word embeddings: Dense, low-dimensional representations (typically 100-300d) that capture semantic relationships

📊 From One-Hot to Dense Vectors

One-Hot Encoding (Naive Approach)

// Vocabulary: [cat, dog, king, queen, apple]

cat: [1, 0, 0, 0, 0]

dog: [0, 1, 0, 0, 0]

king: [0, 0, 1, 0, 0]

queen: [0, 0, 0, 1, 0]

apple: [0, 0, 0, 0, 1]

✗ Problems:

- Dimensionality = vocabulary size (170K+ for English!)

- Sparse vectors (99.999% zeros)

- No semantic information: "cat" and "dog" are equally distant from each other as "cat" and "king"

- Orthogonal vectors: dot product always 0

Dense Embeddings (Modern Approach)

// 3D vectors (typically 100-300D)

cat: [0.2, 0.8, 0.3]

dog: [0.3, 0.7, 0.4] ← close to cat!

king: [0.8, 0.3, 0.6]

queen: [0.7, 0.4, 0.5] ← close to king!

apple: [0.1, 0.2, 0.9]

✓ Advantages:

- Low dimensional (300D vs 170K)

- Dense (no zeros, every dimension meaningful)

- Semantic similarity: Similar words have similar vectors (high cosine similarity)

- Analogical reasoning: king - man + woman ≈ queen

🔬 Word2Vec: Learning Embeddings from Context

CBOW (Continuous Bag of Words)

Predict target word from context

Context: "The [___] sat on"
Target: "cat"

Architecture:

1. Input: Context words (one-hot)

2. Embedding layer: W (V×D)

3. Average context vectors

4. Output layer: W' (D×V)

5. Softmax → predict target

✓ Fast training (simpler)

✓ Good for frequent words

✗ Loses word order

Skip-gram

Predict context from target word

Target: "cat"
Context: "The", "sat", "on", "mat"

Architecture:

1. Input: Target word (one-hot)

2. Embedding layer: W (V×D)

3. Output layer: W' (D×V)

4. Softmax → predict each context word

5. Maximize P(context | target)

✓ Better for rare words

✓ More training data per word

✗ Slower (predicts C words)

Training Tricks: Negative Sampling

• Problem: Softmax over 170K vocab is computationally expensive

• Solution: Instead of updating all words, sample k negative examples (words that shouldn't appear)

Positive: ("cat", "sat") → maximize P(sat | cat)
Negatives: ("cat", "airplane"), ("cat", "democracy") → minimize

• Typical k=5-20 for small datasets, k=2-5 for large

• Makes training ~100× faster!

🌐 GloVe: Global Vectors for Word Representation

• Key idea: Word2Vec captures local context, GloVe captures global co-occurrence statistics

Objective: w_iᵀ w_j + b_i + b_j = log(X_ij)
// X_ij = how often word i and j co-occur

• Process:

1. Build co-occurrence matrix X from entire corpus (global statistics)

2. Factorize matrix to learn word vectors that reconstruct X

3. Weighted least squares: frequent pairs get more weight

• Advantage: Combines global statistics (like LSA) with local context (like Word2Vec)

• Used by: Many pre-2018 NLP systems before contextual embeddings

⚡ Semantic Relationships & Vector Arithmetic

Famous Example: Analogical Reasoning

vec("king") - vec("man") + vec("woman") ≈ vec("queen")

vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")

vec("walked") - vec("walk") + vec("swim") ≈ vec("swam")

• Why this works: Embeddings capture semantic dimensions

- Gender axis: king↔queen, man↔woman, boy↔girl

- Capital axis: France↔Paris, Italy↔Rome, Japan↔Tokyo

- Tense axis: walk↔walked, swim↔swam, eat↔ate

• Similarity metric: Cosine similarity

sim(u, v) = (u · v) / (||u|| ||v||) ∈ [-1, 1]
// 1 = identical, 0 = orthogonal, -1 = opposite

📦 Static vs Contextual

Static (Word2Vec, GloVe):

"bank" always same vector (river vs financial?)

Contextual (BERT, ELMo):

"bank" vector changes based on sentence context!

🎯 Pre-trained Embeddings

• Word2Vec: Google News (100B words)

• GloVe: Wikipedia + Gigaword (6B tokens)

• FastText: Common Crawl (600B tokens)

→ Transfer learning for NLP!

💡 Key Insight

Embeddings are the "CNN for NLP" - they revolutionized the field by providing meaningful representations. Today: contextual embeddings dominate!

2. Word Embeddings: Meaning as Vectors

🧮 Interactive: Explore Word Vectors

Words with similar meanings have similar vectors. "King" and "Queen" are close in vector space!

Select Word

Similarity Threshold: 0.70

Word: "king"

Vector Representation (simplified)

[0.8, 0.3, 0.6]

Similar Words

queen

100%

prince

85%

monarch

70%

🎯 Key Insight: Word2Vec, GloVe, and FastText learn these embeddings from massive text corpora. Similar contexts → similar vectors!

3. Sentiment Analysis: Understanding Emotion

😊 Interactive: Analyze Sentiment

Enter Text

🔍 How it Works: Sentiment analysis uses machine learning to classify emotional tone. Models can be rule-based, ML-based (Naive Bayes, SVM), or deep learning (BERT, RoBERTa).

4. Named Entity Recognition (NER)

🏷️ Interactive: Extract Entities

NER identifies and classifies named entities: people, organizations, locations, dates, etc.

Sample Text

Apple Inc. is located in Cupertino, California and was founded by Steve Jobs in 1976.

5. Text Classification

🗂️ Interactive: Classify Text

Automatically categorize text into predefined classes: spam/ham, topic, sentiment, intent, etc.

Input Text to Classify

💡 Applications: Spam filtering, topic labeling, intent detection, priority routing, content moderation, language identification.

6. TF-IDF: Term Importance

📊 Interactive: Calculate TF-IDF

TF-IDF measures how important a word is to a document. High TF-IDF = distinctive term!

Document

Select Term

TF-IDF Score

0.375

for term "machine"

Formula:

TF-IDF = TF × IDF

TF (Term Frequency):

How often term appears in document

IDF (Inverse Doc Frequency):

How rare term is across all documents

🎯 Use Case: TF-IDF helps search engines rank documents. Terms with high TF-IDF are most relevant to that specific document!

Attention Mechanism: Learning What Matters

🎯 Why Attention Revolutionized NLP

The RNN Bottleneck Problem

• Sequential processing: RNNs process one word at a time → can't parallelize

• Long-term dependency problem: Information from word 1 must travel through 100+ steps to reach word 100

• Fixed-size bottleneck: Entire sentence compressed into single hidden state vector

"The cat that ate the mouse that lived in the house is sleeping"
→ What is sleeping? Information gets lost!

Attention Solution: Let model directly access ALL previous words, not just hidden state!

🧮 Self-Attention Mathematics

Core Concept: Query, Key, Value (QKV)

Think of it like a database lookup:

• Query (Q): "What am I looking for?" (current word's question)

• Key (K): "What do I contain?" (each word's description)

• Value (V): "What information do I have?" (actual content)

Example: "The cat sat on the mat"
Query from "sat": "Who performed this action?"
Keys from all words: "cat" has high match!
Value from "cat": Return cat's information

Step-by-Step Computation

Step 1: Create Q, K, V matrices

Q = X W_Q // Transform input X with learned weights
K = X W_K // Each word gets 3 representations
V = X W_V // Learned during training

X: (seq_len, d_model) → Q,K,V: (seq_len, d_k)

Step 2: Compute attention scores

scores = (Q × Kᵀ) / √d_k

• Q × Kᵀ: dot product = similarity between query and all keys
• √d_k: scaling factor (prevents vanishing gradients)
• Result: (seq_len, seq_len) matrix of attention scores

Step 3: Apply softmax

attention_weights = softmax(scores)

• Normalize scores to [0,1] summing to 1
• High scores → high attention
• Each word now has distribution over all other words

Step 4: Weight the values

output = attention_weights × V

• Weighted sum of values
• Important words contribute more
• Result: context-aware representation

Complete Formula

Attention(Q, K, V) = softmax(QKᵀ / √d_k) V

QKᵀ
Similarity scores

√d_k
Scaling factor

softmax × V
Weighted sum

🎭 Multi-Head Attention

• Problem: Single attention head captures one type of relationship

• Solution: Run attention multiple times in parallel with different learned projections

MultiHead(Q,K,V) = Concat(head₁, head₂, ..., head_h) W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

Why Multiple Heads?

• Head 1: Might learn syntactic relationships (subject-verb)

• Head 2: Might learn semantic relationships (co-reference)

• Head 3: Might learn positional patterns (beginning-end)

• Head 4: Might learn domain-specific patterns

BERT uses 12 heads, GPT-3 uses 96 heads! Each captures different aspects.

⚡ Why Attention Beats RNNs

RNNs / LSTMs

✗ Sequential: Must process word-by-word

✗ Slow: Can't parallelize training

✗ Memory decay: Forgets long-term context

✗ Fixed context: Hidden state size limits info

✗ Gradient issues: Vanishing/exploding with long sequences

Transformers (Attention)

✓ Parallel: All positions processed simultaneously

✓ Fast: Matrix operations → GPU accelerated

✓ Perfect memory: Direct access to all words

✓ Dynamic context: Attention weights adapt per input

✓ Stable gradients: Direct connections, no chains

Computational Complexity:

RNN: O(n) sequential operations (can't parallelize)
Attention: O(1) sequential ops, O(n²) parallel ops
→ With modern GPUs, parallel O(n²) beats sequential O(n)!

🔍 Attention Variants

Self-attention: Attend to same sequence

Cross-attention: Attend to different sequence

Masked attention: Only attend to past (GPT)

🎯 Real-World Scale

• BERT: 12 layers × 12 heads = 144 attention ops

• GPT-3: 96 layers × 96 heads = 9,216 attention ops

→ Learns incredibly complex patterns!

💡 Key Insight

"Attention is All You Need" (2017) - This single paper revolutionized NLP. Transformers now power GPT, BERT, ChatGPT, Claude!

7. Attention Mechanism

👁️ Interactive: Visualize Attention

Attention helps models focus on relevant words when processing text. Click a word to see what it attends to!

Source Text

Attention Weights

Word "sat" pays attention to:

The

60%

cat

80%

sat

100%

80%

the

60%

mat

40%

🧠 Transformers: Models like BERT and GPT use multi-head attention to capture different types of relationships simultaneously!

Language Models: Predicting the Next Word

📚 What is a Language Model?

Core Definition

A language model assigns probabilities to sequences of words

P("The cat sat on the mat") = ?

• Goal: Learn P(w_t | w₁, ..., w_t-1) - predict next word given context

• Applications: Text generation, autocomplete, translation, speech recognition, code completion

• Quality metric: Good LM assigns high probability to actual sentences, low to gibberish

📊 Evolution: From N-grams to Neural Networks

N-gram Models (Classical, pre-2010)

• Idea: Predict next word based on previous n-1 words

Bigram (n=2): P(cat | the) = count("the cat") / count("the")
Trigram (n=3): P(sat | the cat) = count("the cat sat") / count("the cat")

✓ Simple, fast, interpretable

✗ Limited context (typically n≤5), sparse counts, no generalization

Sparsity problem: Most n-grams never seen in training → zero probability!

Neural Language Models (2013-2017)

• Architecture: Embeddings → RNN/LSTM → Softmax

h_t = LSTM(h_t-1, embed(w_t-1))
P(w_t | context) = softmax(W h_t + b)

✓ Dense representations, handles unseen n-grams, longer context

✗ Sequential processing (slow), still limited context (~100 tokens)

Transformer LMs (2018-present)

• Breakthrough: GPT (Generative Pre-trained Transformer)

Architecture: Tokens → Embeddings → Multi-layer Transformer → Next token prediction

✓ Parallel training, very long context (GPT-4: 128K tokens!), pre-training + fine-tuning

Scale: GPT-3 has 175B parameters, trained on 300B tokens

🎲 Sampling Strategies: Controlling Generation

1. Temperature Scaling

P'(w_i) = exp(logit_i / T) / Σ exp(logit_j / T)

T = 0.1 (Low)
Focused, deterministic
"The" → 0.95, "A" → 0.04

T = 1.0 (Default)
Balanced sampling
"The" → 0.60, "A" → 0.25

T = 2.0 (High)
Creative, random
"The" → 0.35, "A" → 0.30

2. Top-k Sampling

• Idea: Only sample from top k most likely words

Probs: [sunny: 0.4, cloudy: 0.25, rainy: 0.2, windy: 0.1, foggy: 0.05]
k=3 → Renormalize: [sunny: 0.47, cloudy: 0.29, rainy: 0.24]

✓ Avoids sampling very unlikely words

Typical k=50 for factual tasks, k=100 for creative writing

3. Nucleus (Top-p) Sampling

• Idea: Sample from smallest set with cumulative probability ≥ p

p=0.9 → Sample until cumulative prob reaches 90%
Sometimes 2 words, sometimes 50 words (adapts to confidence!)

✓ Dynamic cutoff, better than fixed k

OpenAI default: p=1.0, temperature=0.7

4. Beam Search (Deterministic)

• Idea: Keep top k sequences at each step, expand all

• Use case: Machine translation (want best translation, not diverse)

✓ Finds high-probability sequences

✗ Can be repetitive, less creative

📈 Evaluation Metric: Perplexity

Perplexity = exp(-1/N Σ log P(w_i | context))

• Intuition: How "surprised" is the model on average?

• Lower = Better: Perplexity of 10 means model is as confused as choosing from 10 equally likely words

• Benchmarks:

- Random baseline: Perplexity = vocabulary size (e.g., 50,000)

- Good trigram model: ~100-200

- Modern LSTM: ~40-60

- GPT-3 scale: ~20-30 (excellent!)

🚀 Modern LLMs: The GPT Architecture

Two-Stage Training

Stage 1: Pre-training (Unsupervised)

• Train on massive unlabeled text (web crawl, books, code)
• Objective: Predict next token (autoregressive)
• GPT-3: 300B tokens, ~$4.6M compute cost
• Result: General language understanding

Stage 2: Fine-tuning (Supervised)

• Adapt to specific tasks (Q&A, summarization, chat)
• ChatGPT: RLHF (Reinforcement Learning from Human Feedback)
• Small dataset, quick training
• Result: Task-specific expert

Key Innovations

Causal (Masked) Attention
Can only attend to previous tokens → autoregressive generation

Enormous Scale
GPT-3: 175B params, GPT-4: ~1.7T params (rumored)

In-Context Learning
Few-shot prompting without parameter updates!

Emergent Abilities
Reasoning, math, code at sufficient scale

🎯 Autoregressive LMs

GPT series: Left-to-right generation

Use: Text generation, chat, completion

Fast inference, creative outputs

🔄 Masked LMs

BERT series: Bidirectional context

Use: Classification, NER, Q&A

Better understanding, no generation

💡 Key Insight

LLMs are "compressed internet" - they memorize patterns, not facts. Temperature controls exploration vs exploitation trade-off!

8. Language Model Prediction

🤖 Interactive: Generate Text

Language models predict the next word based on context. Temperature controls randomness!

Prompt

Temperature: 0.7

Deterministic (focused)Random (creative)

Top Predictions

sunny

40.0%

cloudy

25.0%

rainy

20.0%

perfect

15.0%

Low Temperature (0.0-0.3)

Focused, predictable, picks most likely word. Good for factual tasks.

High Temperature (0.7-1.0)

Creative, random, explores less likely words. Good for creative writing.

9. Build Your NLP Pipeline

🔧 Interactive: Design Processing Pipeline

Real NLP systems chain multiple steps. Build your pipeline by adding processing stages!

Available Steps

Your Pipeline (0 steps)

Add steps to build your pipeline →

💡 Pipeline Flow: Text → Tokenization → Cleaning → Feature Extraction → Model → Output. Order matters! Preprocessing comes before modeling.

🎯 Key Takeaways

✂️

Tokenization is Foundation

Breaking text into tokens is the first critical step. Word, character, or subword - each method has trade-offs. Modern models use subword tokenization (BPE, WordPiece).

🧮

Words Become Vectors

Word embeddings (Word2Vec, GloVe, FastText) convert text to numbers while preserving semantic relationships. Similar words have similar vectors in high-dimensional space.

🎯

Multiple NLP Tasks

Sentiment analysis, NER, classification, translation, summarization - NLP covers diverse tasks. Each requires different architectures but shares preprocessing steps.

👁️

Attention is Key

Attention mechanism revolutionized NLP. Transformers (BERT, GPT) use self-attention to capture context and relationships between all words simultaneously, not sequentially.

🔧

Pipelines are Powerful

Real NLP systems chain preprocessing, feature extraction, and models into pipelines. spaCy, NLTK, and Hugging Face Transformers provide ready-to-use components.

🚀

Modern Era: LLMs

Large Language Models (GPT-4, Claude, LLaMA) are pre-trained on massive text corpora. They excel at few-shot learning, understanding context, and generating human-like text across domains.