Home/Concepts/Artificial Intelligence/Transformer Architecture Explained

Transformer Architecture Explained

Understand attention mechanisms and modern language models

โฑ๏ธ 21 minโšก 20 interactions

What is Transformer Architecture?

The Transformer revolutionized AI by replacing recurrence with attention mechanisms. It's the architecture behind GPT, BERT, and modern language models.

๐Ÿ’ก Core Innovation: Attention Is All You Need

๐ŸŽฏ
Self-Attention
Each word attends to all other words simultaneously
โšก
Parallelization
Process entire sequences at once, unlike RNNs
๐ŸŽญ
Multi-Head Attention
Learn different types of relationships in parallel

Self-Attention: The Core Innovation

๐Ÿง  Why Attention Revolutionized NLP

Before transformers, RNNs processed words sequentiallyโ€”slow and prone to forgetting. Self-attention lets each word directly look at every other word in parallel, computing relationships in one step.

โŒ RNN/LSTM Problems

โ€ข
Sequential bottleneck: Word 100 must wait for words 1-99 to process
โ€ข
Vanishing gradients: Long-range dependencies fade (words 50 steps apart)
โ€ข
No parallelization: Can't process sentence on multiple GPUs at once
โ€ข
Hidden state compression: All context squeezed into fixed vector

โœ“ Self-Attention Solutions

โ†’
Parallel processing: All positions computed simultaneously โ†’ 10-100ร— faster training
โ†’
Direct connections: Word 1 โ†” Word 100 in one hop (O(1) path length)
โ†’
GPU-friendly: Matrix operations = perfect for modern hardware
โ†’
Dynamic attention: Focus adapts to context, not fixed hidden state

๐Ÿ”ฌ Attention Score Computation (Simplified)

score(word_i, word_j) = similarity(word_i, word_j) / โˆšd
Step 1: Compute dot product between word vectors (measures similarity)
Step 2: Scale by โˆšd (prevents saturation in softmax for large dimensions)
Step 3: Softmax across all words (converts to probability distribution)
Result: Each word gets attention weights summing to 1.0, distributed across all positions based on relevance.

๐Ÿ“Š Computational Complexity

RNN: O(n) sequential steps โ†’ can't parallelize โ†’ slow for long sequences
Self-Attention: O(nยฒ) comparisons but fully parallel โ†’ fast with GPUs despite quadratic cost
Trade-off: For sequences < 10K tokens, attention is faster. For very long sequences (100K+), variants like Linformer or Longformer reduce to O(n) or O(n log n).

1. Self-Attention Mechanism

๐ŸŽฏ Interactive: Click Words to See Attention

Self-attention lets each word understand its relationship with every other word in the sentence.

Attention from "sat" to other words:

The5%
cat35%
sat30%
on15%
the5%
mat10%

๐Ÿ’ก Key Insight: "sat" pays most attention to "cat" (semantic relationship).

Query-Key-Value: The Retrieval Metaphor

๐Ÿ” Database-Inspired Attention Mechanism

Think of attention as a soft database lookup: Query searches, Keys match, Values return. This elegant abstraction powers all transformer attention.

๐Ÿ’ก The Analogy: Search Engine

๐Ÿ”Ž Query (Q)
What it is: "What am I looking for?"
Example: Word "sat" asks: "Who/what is doing the sitting?"
Technical: Q = Input ร— W_Q
(Linear projection of input embedding)
๐Ÿ”‘ Key (K)
What it is: "What do I offer?"
Example: Word "cat" advertises: "I'm a noun, an actor"
Technical: K = Input ร— W_K
(Different projection, same dimension as Q)
๐Ÿ’Ž Value (V)
What it is: "What information do I provide?"
Example: "cat" returns semantic features about being feline
Technical: V = Input ร— W_V
(Actual content to return if matched)

๐Ÿงฎ The Math: Scaled Dot-Product Attention

Attention(Q, K, V) = softmax((Q ร— K^T) / โˆšd_k) ร— V
Step 1: Q ร— K^T
โ€ข Compute all pairwise similarities
โ€ข Shape: [seq_len, seq_len]
โ€ข Example: "sat" query matches all keys
โ€ข Higher score = more relevant
Step 2: Scale by โˆšd_k
โ€ข d_k = key dimension (e.g., 64)
โ€ข Without scaling: large dot products โ†’ saturated softmax
โ€ข With scaling: gradients flow better
โ€ข Critical for training stability
Step 3: Softmax
โ€ข Converts scores to probabilities
โ€ข Each row sums to 1.0
โ€ข Softmax(x_i) = e^(x_i) / ฮฃe^(x_j)
โ€ข Differentiable = backprop works
Step 4: ร— V
โ€ข Weighted sum of all values
โ€ข High attention โ†’ more influence
โ€ข Output shape: [seq_len, d_v]
โ€ข Each position = context-aware representation

๐ŸŽฏ Why Three Separate Projections?

Q1: Why not just use raw embeddings?
Answer: Learned projections allow the model to transform embeddings into "search-friendly" and "content-friendly" spaces.
Q2: Why are Q and K different from V?
Answer: Q and K are optimized for matching (finding relevant positions). V is optimized for content (what to return).
Q3: Can Q, K, V have different dimensions?
Answer: Q and K must match (for dot product). V can differ, but typically d_q = d_k = d_v = d_model / num_heads.

2. Query, Key, Value Mechanism

๐Ÿ”‘ Interactive: How Attention Computes

Step 1: Input Embeddings

Each word is represented as a vector (e.g., 512 dimensions).

"cat" โ†’ [0.2, -0.5, 0.8, ..., 0.1] (512 values)

Multi-Head Attention: Ensemble of Perspectives

๐ŸŽญ Why Multiple Attention Heads?

A single attention head might miss nuances. Multiple heads let the model attend to different types of relationships simultaneouslyโ€”syntax, semantics, position, coreference.

๐Ÿ”ฌ What Different Heads Learn (Empirical Observations)

Head Type 1: Syntactic
Focus: Grammatical relationships
Examples:
โ€ข Subject โ†’ Verb connections
โ€ข Adjective โ†’ Noun modifications
โ€ข Preposition โ†’ Object dependencies
"The cat sat" (subject-verb)
Head Type 2: Semantic
Focus: Meaning and context
Examples:
โ€ข Words with similar meanings
โ€ข Contextual word sense
โ€ข Thematic relationships
"bank" โ†’ "river" vs "money" context
Head Type 3: Positional
Focus: Word order patterns
Examples:
โ€ข Attends to adjacent words
โ€ข Local n-gram patterns
โ€ข Relative position awareness
Word i โ†’ words iยฑ1, iยฑ2
Head Type 4: Long-Range
Focus: Distant dependencies
Examples:
โ€ข Coreference resolution
โ€ข Discourse coherence
โ€ข Sentence-level structure
"John ... he" (pronoun โ†’ antecedent)

๐Ÿงฎ The Mathematics: Splitting and Concatenating

Input Dimension Splitting:
d_model = 512 (typical BERT/GPT size)
h = 8 heads
d_k = d_model / h = 512 / 8 = 64 per head
Each head operates on 64-dimensional space (much smaller!). This allows 8 different "views" of the data.
Parallel Computation:
head_i = Attention(Q_i, K_i, V_i)
Each head computes its own Q, K, V projections and attention independently.
All heads run in parallel on GPU โ†’ no extra time cost!
Concatenation and Final Projection:
MultiHead(Q,K,V) = Concat(head_1, ..., head_h) ร— W_O
โ€ข Concatenate all 8 heads: [batch, seq, 64ร—8] = [batch, seq, 512]
โ€ข Final linear projection W_O mixes information from all heads
โ€ข Output dimension = d_model (512), same as input

๐Ÿ’ก Empirical Findings from Research

More heads = better? Not always. BERT uses 8-16, GPT-3 uses 96. Sweet spot is task-dependent. Too many heads = redundancy.
Head pruning: Studies show 20-40% of heads can be removed with minimal performance loss. Many heads learn similar patterns.
Interpretability: Different heads in different layers specialize. Early layers: syntax. Middle: semantics. Late: task-specific patterns.

3. Multi-Head Attention

๐ŸŽญ Interactive: Multiple Attention Heads

Head 1 Focus:

๐Ÿ” Syntactic: Subject-verb relationships, grammatical structure
Dimension per head:
128

๐Ÿ’ก Why Multiple Heads? Each head learns different types of relationships. They're concatenated and projected to form the final output.

Positional Encoding: Injecting Word Order

๐Ÿ“ The Problem: Attention is Position-Agnostic

Self-attention is permutation-invariantโ€”it treats "cat sat mat" and "mat sat cat" identically. Positional encodings add unique patterns for each position so the model can distinguish word order.

๐Ÿง  Why Sinusoidal Functions?

Design Requirements:
1.
Uniqueness: Each position gets a different encoding
2.
Bounded: Values stay in reasonable range (not exploding)
3.
Deterministic: Same position = same encoding every time
4.
Generalizable: Can extrapolate to longer sequences than training
5.
Relative positioning: Model can learn that "k steps apart" is consistent

๐Ÿ”ข The Formula Explained

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Variables:
pos: Position in sequence (0, 1, 2, ...)
i: Dimension index (0 to d_model/2)
d_model: Embedding dimension (512 typical)
2i, 2i+1: Alternate sine and cosine
Why 10000?
โ€ข Creates wavelengths from 2ฯ€ to 10000ยท2ฯ€
โ€ข Low dimensions: high frequency (local patterns)
โ€ข High dimensions: low frequency (global patterns)
โ€ข Balances short and long-range position info
Intuition:
Think of it like a binary counter in continuous space. Each dimension oscillates at a different frequency.
Position 0:
Dim 0: sin(0) = 0, Dim 1: cos(0) = 1, Dim 2: sin(0) = 0, ...
Position 100:
Dim 0: sin(100) = ..., Dim 1: cos(100) = ..., different pattern!

โœ“ Sinusoidal Advantages

โ€ข No learnable parameters (saves memory)
โ€ข Works for any sequence length at inference
โ€ข Linear combinations give relative positions
โ€ข PE(pos+k) can be expressed as linear function of PE(pos)

๐Ÿ”„ Alternatives Used in Practice

Learned PE: BERT uses learned embeddings (better for fixed max length)
Relative PE: T5, XLNet encode relative distances directly
RoPE: Rotary PE in modern models (LLaMA, GPT-NeoX)
ALiBi: Attention with Linear Biases (no explicit PE)

4. Positional Encoding

๐Ÿ“ Interactive: Adding Position Information

Since transformers process all positions simultaneously, we add positional encodings to preserve word order.

Sinusoidal Pattern Visualization

Each column = position, each row = dimension
Formula
PE(pos, 2i) = sin(pos/10000^(2i/d))
PE(pos, 2i+1) = cos(pos/10000^(2i/d))
Purpose
Unique pattern for each position that the model can learn to use

Encoder vs Decoder: Two Architectural Paradigms

๐Ÿ”€ Three Architectural Families

๐Ÿ“–

Encoder-Only

Purpose: Understanding and representation
Attention: Bidirectional (sees entire input)
Examples: BERT, RoBERTa, ALBERT
Best For:
โ€ข Text classification
โ€ข Named entity recognition
โ€ข Question answering
โ€ข Semantic search
โœ๏ธ

Decoder-Only

Purpose: Generation and prediction
Attention: Causal (can't see future)
Examples: GPT-3, GPT-4, LLaMA
Best For:
โ€ข Text generation
โ€ข Creative writing
โ€ข Code completion
โ€ข Conversational AI
๐Ÿ”„

Encoder-Decoder

Purpose: Sequence-to-sequence mapping
Attention: Both + cross-attention
Examples: T5, BART, mT5
Best For:
โ€ข Machine translation
โ€ข Summarization
โ€ข Paraphrasing
โ€ข Text-to-SQL

๐Ÿ” Encoder Architecture Deep Dive

Bidirectional Self-Attention:
Each token can attend to all other tokens in the input sequence (past and future).
Input: "The cat sat on the mat"
Token "cat" attends to: [The, cat, sat, on, the, mat] โ† sees everything
Training Strategy:
โ€ข Masked Language Modeling (MLM): Replace 15% of tokens with [MASK], predict them
Example: "The [MASK] sat on mat" โ†’ predict "cat"
โ€ข Forces model to build contextual representations
Advantages:
โœ“ Rich bidirectional context
โœ“ Better for understanding tasks
โœ“ Smaller models can achieve high accuracy
โœ“ Fine-tunes efficiently on downstream tasks

โœจ Decoder Architecture Deep Dive

Causal (Masked) Self-Attention:
Each token can only attend to previous tokens (and itself), not future tokens.
Input: "The cat sat on the mat"
Token "cat" attends to: [The, cat] โ† can't see "sat", "on", "the", "mat"
Why? To prevent information leakage during autoregressive generation.
Training Strategy:
โ€ข Next Token Prediction: Given tokens 1..n, predict token n+1
Example: "The cat" โ†’ predict "sat"
โ€ข Trained on massive text corpora (1+ trillion tokens for GPT-3)
Advantages:
โœ“ Natural for generation tasks
โœ“ Scales to very large models (175B+ params)
โœ“ Few-shot learning via prompting
โœ“ Flexible zero-shot generalization

๐Ÿ”— Cross-Attention: The Bridge Between Encoder & Decoder

In encoder-decoder models, the decoder has a third attention sublayer called cross-attention that connects encoder outputs to the decoder.

Cross-Attention Mechanics:
1. Query (Q): Comes from decoder's previous layer
2. Key (K) & Value (V): Come from encoder's final output
CrossAttn(Q_decoder, K_encoder, V_encoder)
This allows each decoder position to attend over all encoder positions (entire input sequence).
Example: Translation
English (Encoder):
"The cat sits"
French (Decoder generating):
"Le chat" โ†’ next word?
Cross-attention lets "chat" look back at entire English sentence to decide next French word is "s'assoit".
When to Use Each:
Classification: Encoder-only (BERT)
Generation: Decoder-only (GPT)
Translation/Summary: Encoder-Decoder (T5)
Modern trend: Decoder-only for everything with prompting!

5. Encoder vs Decoder Architecture

๐Ÿ—๏ธ Interactive: Full Transformer Structure

๐Ÿ“ฅ Encoder (Understanding)

1. Multi-Head Self-Attention
Bidirectional - sees entire input
โ†“
2. Add & Normalize
Residual connection + Layer norm
โ†“
3. Feed-Forward Network
2 linear layers with activation
โ†“
4. Add & Normalize
Another residual + norm
Used for: BERT, encoding context

๐Ÿ“ค Decoder (Generation)

1. Masked Self-Attention
Causal - can't see future tokens
โ†“
2. Cross-Attention (if encoder exists)
Attends to encoder outputs
โ†“
3. Feed-Forward Network
Same as encoder FFN
โ†“
4. Output Projection
Linear + softmax over vocabulary
Used for: GPT, autoregressive generation

6. Layer Normalization

๐Ÿ“Š Interactive: Normalize Activations

Before (Raw Values)

Val 1:
2
Val 2:
4
Val 3:
6
Val 4:
8

After Normalization

Val 1:
2.00
Val 2:
4.00
Val 3:
6.00
Val 4:
8.00

7. Feed-Forward Network

๐Ÿงฎ Interactive: Position-wise FFN

FFN Output
0.75
FFN(x) = max(0, xWโ‚ + bโ‚)Wโ‚‚ + bโ‚‚
ReLU (Rectified Linear)
Simple: outputs 0 for negative, x for positive. Fast and effective.
GELU (Gaussian Error Linear)
Smooth approximation. Used in BERT, GPT. Better gradients.

8. Attention Masking

๐ŸŽญ Interactive: Masking Patterns

Attention Matrix Visualization

Rows = queries, Columns = keys
โœ“ Full Attention (Encoder)
Every position can attend to all positions. Used in BERT for bidirectional understanding.

9. Transformer Applications

๐ŸŒ Interactive: Famous Models

๐Ÿ“

GPT (Decoder-only)

Decoder-only with causal masking

Autoregressive language model. Predicts next token given previous context.

Use Cases:
Text generation, chat, code completion

10. Model Size Calculator

๐Ÿ“ Interactive: Estimate Parameters

Estimated Parameters
123.3M
Attention
28.3M
Feed-Forward
56.6M
Embeddings
38.4M

๐Ÿ’ก Reference: GPT-2 (117M), BERT-base (110M), GPT-3 (175B), GPT-4 (~1.7T estimated)

๐ŸŽฏ Key Takeaways

๐ŸŽฏ

Self-Attention is Key

Transformers replace recurrence with attention. Each position attends to all others simultaneously, enabling parallelization and long-range dependencies.

๐ŸŽญ

Multi-Head Attention

Multiple attention heads learn different relationships: syntax, semantics, position, long-range. They're concatenated for rich representations.

๐Ÿ“

Positional Encoding

Since attention is permutation-invariant, positional encodings inject word order information using sinusoidal patterns the model can learn.

๐Ÿ—๏ธ

Encoder-Decoder Structure

Encoder for understanding (BERT), decoder for generation (GPT), or both for translation (T5). Each serves different purposes.

๐ŸŽญ

Masking Mechanisms

Causal masking for autoregressive generation (GPT), padding masking for variable lengths, no masking for bidirectional understanding (BERT).

โšก

Scalability

Transformers scale beautifully. From 110M (BERT) to 175B (GPT-3) to trillions of parameters. More compute + data = better performance.