Home/Concepts/Artificial Intelligence/Transformer Architecture Explained

Transformer Architecture Explained

Understand attention mechanisms and modern language models

⏱️ 21 min⚡ 20 interactions

What is Transformer Architecture?

The Transformer revolutionized AI by replacing recurrence with attention mechanisms. It's the architecture behind GPT, BERT, and modern language models.

💡 Core Innovation: Attention Is All You Need

🎯

Self-Attention

Each word attends to all other words simultaneously

⚡

Parallelization

Process entire sequences at once, unlike RNNs

🎭

Multi-Head Attention

Learn different types of relationships in parallel

Self-Attention: The Core Innovation

🧠 Why Attention Revolutionized NLP

Before transformers, RNNs processed words sequentially—slow and prone to forgetting. Self-attention lets each word directly look at every other word in parallel, computing relationships in one step.

❌ RNN/LSTM Problems

•

Sequential bottleneck: Word 100 must wait for words 1-99 to process

•

Vanishing gradients: Long-range dependencies fade (words 50 steps apart)

•

No parallelization: Can't process sentence on multiple GPUs at once

•

Hidden state compression: All context squeezed into fixed vector

✓ Self-Attention Solutions

→

Parallel processing: All positions computed simultaneously → 10-100× faster training

→

Direct connections: Word 1 ↔ Word 100 in one hop (O(1) path length)

→

GPU-friendly: Matrix operations = perfect for modern hardware

→

Dynamic attention: Focus adapts to context, not fixed hidden state

🔬 Attention Score Computation (Simplified)

score(word_i, word_j) = similarity(word_i, word_j) / √d

Step 1: Compute dot product between word vectors (measures similarity)

Step 2: Scale by √d (prevents saturation in softmax for large dimensions)

Step 3: Softmax across all words (converts to probability distribution)

Result: Each word gets attention weights summing to 1.0, distributed across all positions based on relevance.

📊 Computational Complexity

RNN: O(n) sequential steps → can't parallelize → slow for long sequences

Self-Attention: O(n²) comparisons but fully parallel → fast with GPUs despite quadratic cost

Trade-off: For sequences < 10K tokens, attention is faster. For very long sequences (100K+), variants like Linformer or Longformer reduce to O(n) or O(n log n).

1. Self-Attention Mechanism

🎯 Interactive: Click Words to See Attention

Self-attention lets each word understand its relationship with every other word in the sentence.

Attention from "sat" to other words:

The5%

cat35%

sat30%

on15%

the5%

mat10%

💡 Key Insight: "sat" pays most attention to "cat" (semantic relationship).

Query-Key-Value: The Retrieval Metaphor

🔍 Database-Inspired Attention Mechanism

Think of attention as a soft database lookup: Query searches, Keys match, Values return. This elegant abstraction powers all transformer attention.

💡 The Analogy: Search Engine

🔎 Query (Q)

What it is: "What am I looking for?"

Example: Word "sat" asks: "Who/what is doing the sitting?"

Technical: Q = Input × W_Q
(Linear projection of input embedding)

🔑 Key (K)

What it is: "What do I offer?"

Example: Word "cat" advertises: "I'm a noun, an actor"

Technical: K = Input × W_K
(Different projection, same dimension as Q)

💎 Value (V)

What it is: "What information do I provide?"

Example: "cat" returns semantic features about being feline

Technical: V = Input × W_V
(Actual content to return if matched)

🧮 The Math: Scaled Dot-Product Attention

Attention(Q, K, V) = softmax((Q × K^T) / √d_k) × V

Step 1: Q × K^T

• Compute all pairwise similarities
• Shape: [seq_len, seq_len]
• Example: "sat" query matches all keys
• Higher score = more relevant

Step 2: Scale by √d_k

• d_k = key dimension (e.g., 64)
• Without scaling: large dot products → saturated softmax
• With scaling: gradients flow better
• Critical for training stability

Step 3: Softmax

• Converts scores to probabilities
• Each row sums to 1.0
• Softmax(x_i) = e^(x_i) / Σe^(x_j)
• Differentiable = backprop works

Step 4: × V

• Weighted sum of all values
• High attention → more influence
• Output shape: [seq_len, d_v]
• Each position = context-aware representation

🎯 Why Three Separate Projections?

Q1: Why not just use raw embeddings?
Answer: Learned projections allow the model to transform embeddings into "search-friendly" and "content-friendly" spaces.

Q2: Why are Q and K different from V?
Answer: Q and K are optimized for matching (finding relevant positions). V is optimized for content (what to return).

Q3: Can Q, K, V have different dimensions?
Answer: Q and K must match (for dot product). V can differ, but typically d_q = d_k = d_v = d_model / num_heads.

2. Query, Key, Value Mechanism

🔑 Interactive: How Attention Computes

Step 1: Input Embeddings

Each word is represented as a vector (e.g., 512 dimensions).

"cat" → [0.2, -0.5, 0.8, ..., 0.1] (512 values)

Multi-Head Attention: Ensemble of Perspectives

🎭 Why Multiple Attention Heads?

A single attention head might miss nuances. Multiple heads let the model attend to different types of relationships simultaneously—syntax, semantics, position, coreference.

🔬 What Different Heads Learn (Empirical Observations)

Head Type 1: Syntactic

Focus: Grammatical relationships

Examples:

• Subject → Verb connections
• Adjective → Noun modifications
• Preposition → Object dependencies

"The cat sat" (subject-verb)

Head Type 2: Semantic

Focus: Meaning and context

Examples:

• Words with similar meanings
• Contextual word sense
• Thematic relationships

"bank" → "river" vs "money" context

Head Type 3: Positional

Focus: Word order patterns

Examples:

• Attends to adjacent words
• Local n-gram patterns
• Relative position awareness

Word i → words i±1, i±2

Head Type 4: Long-Range

Focus: Distant dependencies

Examples:

• Coreference resolution
• Discourse coherence
• Sentence-level structure

"John ... he" (pronoun → antecedent)

🧮 The Mathematics: Splitting and Concatenating

Input Dimension Splitting:

d_model = 512 (typical BERT/GPT size)
h = 8 heads
d_k = d_model / h = 512 / 8 = 64 per head

Each head operates on 64-dimensional space (much smaller!). This allows 8 different "views" of the data.

Parallel Computation:

head_i = Attention(Q_i, K_i, V_i)

Each head computes its own Q, K, V projections and attention independently.

All heads run in parallel on GPU → no extra time cost!

Concatenation and Final Projection:

MultiHead(Q,K,V) = Concat(head_1, ..., head_h) × W_O

• Concatenate all 8 heads: [batch, seq, 64×8] = [batch, seq, 512]
• Final linear projection W_O mixes information from all heads
• Output dimension = d_model (512), same as input

💡 Empirical Findings from Research

More heads = better? Not always. BERT uses 8-16, GPT-3 uses 96. Sweet spot is task-dependent. Too many heads = redundancy.

Head pruning: Studies show 20-40% of heads can be removed with minimal performance loss. Many heads learn similar patterns.

Interpretability: Different heads in different layers specialize. Early layers: syntax. Middle: semantics. Late: task-specific patterns.

3. Multi-Head Attention

🎭 Interactive: Multiple Attention Heads

Number of Heads: 4

Head 1 Focus:

🔍 Syntactic: Subject-verb relationships, grammatical structure

Dimension per head:

128

💡 Why Multiple Heads? Each head learns different types of relationships. They're concatenated and projected to form the final output.

Positional Encoding: Injecting Word Order

📍 The Problem: Attention is Position-Agnostic

Self-attention is permutation-invariant—it treats "cat sat mat" and "mat sat cat" identically. Positional encodings add unique patterns for each position so the model can distinguish word order.

🧠 Why Sinusoidal Functions?

Design Requirements:

Uniqueness: Each position gets a different encoding

Bounded: Values stay in reasonable range (not exploding)

Deterministic: Same position = same encoding every time

Generalizable: Can extrapolate to longer sequences than training

Relative positioning: Model can learn that "k steps apart" is consistent

🔢 The Formula Explained

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Variables:

pos: Position in sequence (0, 1, 2, ...)

i: Dimension index (0 to d_model/2)

d_model: Embedding dimension (512 typical)

2i, 2i+1: Alternate sine and cosine

Why 10000?

• Creates wavelengths from 2π to 10000·2π

• Low dimensions: high frequency (local patterns)

• High dimensions: low frequency (global patterns)

• Balances short and long-range position info

Intuition:

Think of it like a binary counter in continuous space. Each dimension oscillates at a different frequency.

Position 0:

Dim 0: sin(0) = 0, Dim 1: cos(0) = 1, Dim 2: sin(0) = 0, ...

Position 100:

Dim 0: sin(100) = ..., Dim 1: cos(100) = ..., different pattern!

✓ Sinusoidal Advantages

• No learnable parameters (saves memory)

• Works for any sequence length at inference

• Linear combinations give relative positions

• PE(pos+k) can be expressed as linear function of PE(pos)

🔄 Alternatives Used in Practice

Learned PE: BERT uses learned embeddings (better for fixed max length)

Relative PE: T5, XLNet encode relative distances directly

RoPE: Rotary PE in modern models (LLaMA, GPT-NeoX)

ALiBi: Attention with Linear Biases (no explicit PE)

4. Positional Encoding

📍 Interactive: Adding Position Information

Since transformers process all positions simultaneously, we add positional encodings to preserve word order.

Sequence Length: 8

Embedding Dimension: 64

Sinusoidal Pattern Visualization

Each column = position, each row = dimension

Formula

PE(pos, 2i) = sin(pos/10000^(2i/d))

PE(pos, 2i+1) = cos(pos/10000^(2i/d))

Purpose

Unique pattern for each position that the model can learn to use

Encoder vs Decoder: Two Architectural Paradigms

🔀 Three Architectural Families

📖

Encoder-Only

Purpose: Understanding and representation

Attention: Bidirectional (sees entire input)

Examples: BERT, RoBERTa, ALBERT

Best For:

• Text classification

• Named entity recognition

• Question answering

• Semantic search

✍️

Decoder-Only

Purpose: Generation and prediction

Attention: Causal (can't see future)

Examples: GPT-3, GPT-4, LLaMA

Best For:

• Text generation

• Creative writing

• Code completion

• Conversational AI

🔄

Encoder-Decoder

Purpose: Sequence-to-sequence mapping

Attention: Both + cross-attention

Examples: T5, BART, mT5

Best For:

• Machine translation

• Summarization

• Paraphrasing

• Text-to-SQL

🔍 Encoder Architecture Deep Dive

Bidirectional Self-Attention:

Each token can attend to all other tokens in the input sequence (past and future).

Input: "The cat sat on the mat"
Token "cat" attends to: [The, cat, sat, on, the, mat] ← sees everything

Training Strategy:

• Masked Language Modeling (MLM): Replace 15% of tokens with [MASK], predict them

Example: "The [MASK] sat on mat" → predict "cat"

• Forces model to build contextual representations

Advantages:

✓ Rich bidirectional context

✓ Better for understanding tasks

✓ Smaller models can achieve high accuracy

✓ Fine-tunes efficiently on downstream tasks

✨ Decoder Architecture Deep Dive

Causal (Masked) Self-Attention:

Each token can only attend to previous tokens (and itself), not future tokens.

Input: "The cat sat on the mat"
Token "cat" attends to: [The, cat] ← can't see "sat", "on", "the", "mat"

Why? To prevent information leakage during autoregressive generation.

Training Strategy:

• Next Token Prediction: Given tokens 1..n, predict token n+1

Example: "The cat" → predict "sat"

• Trained on massive text corpora (1+ trillion tokens for GPT-3)

Advantages:

✓ Natural for generation tasks

✓ Scales to very large models (175B+ params)

✓ Few-shot learning via prompting

✓ Flexible zero-shot generalization

🔗 Cross-Attention: The Bridge Between Encoder & Decoder

In encoder-decoder models, the decoder has a third attention sublayer called cross-attention that connects encoder outputs to the decoder.

Cross-Attention Mechanics:

1. Query (Q): Comes from decoder's previous layer

2. Key (K) & Value (V): Come from encoder's final output

CrossAttn(Q_decoder, K_encoder, V_encoder)

This allows each decoder position to attend over all encoder positions (entire input sequence).

Example: Translation

English (Encoder):

"The cat sits"

French (Decoder generating):

"Le chat" → next word?

Cross-attention lets "chat" look back at entire English sentence to decide next French word is "s'assoit".

When to Use Each:

Classification: Encoder-only (BERT)

Generation: Decoder-only (GPT)

Translation/Summary: Encoder-Decoder (T5)

Modern trend: Decoder-only for everything with prompting!

5. Encoder vs Decoder Architecture

🏗️ Interactive: Full Transformer Structure

📥 Encoder (Understanding)

1. Multi-Head Self-Attention

Bidirectional - sees entire input

↓

2. Add & Normalize

Residual connection + Layer norm

↓

3. Feed-Forward Network

2 linear layers with activation

↓

4. Add & Normalize

Another residual + norm

Used for: BERT, encoding context

📤 Decoder (Generation)

1. Masked Self-Attention

Causal - can't see future tokens

↓

2. Cross-Attention (if encoder exists)

Attends to encoder outputs

↓

3. Feed-Forward Network

Same as encoder FFN

↓

4. Output Projection

Linear + softmax over vocabulary

Used for: GPT, autoregressive generation

6. Layer Normalization

📊 Interactive: Normalize Activations

Value 1

Value 2

Value 3

Value 4

Before (Raw Values)

Val 1:

Val 2:

Val 3:

Val 4:

After Normalization

Val 1:

2.00

Val 2:

4.00

Val 3:

6.00

Val 4:

8.00

7. Feed-Forward Network

🧮 Interactive: Position-wise FFN

Input Value: 0.50

FFN Output

0.75

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

ReLU (Rectified Linear)

Simple: outputs 0 for negative, x for positive. Fast and effective.

GELU (Gaussian Error Linear)

Smooth approximation. Used in BERT, GPT. Better gradients.

8. Attention Masking

🎭 Interactive: Masking Patterns

Attention Matrix Visualization

Rows = queries, Columns = keys

✓ Full Attention (Encoder)

Every position can attend to all positions. Used in BERT for bidirectional understanding.

9. Transformer Applications

🌍 Interactive: Famous Models

📝

GPT (Decoder-only)

Decoder-only with causal masking

Autoregressive language model. Predicts next token given previous context.

Use Cases:

Text generation, chat, code completion

10. Model Size Calculator

📐 Interactive: Estimate Parameters

Layers: 12

Hidden Dimension: 768

Vocabulary Size: 50,000

Estimated Parameters

123.3M

Attention

28.3M

Feed-Forward

56.6M

Embeddings

38.4M

💡 Reference: GPT-2 (117M), BERT-base (110M), GPT-3 (175B), GPT-4 (~1.7T estimated)

🎯 Key Takeaways

🎯

Self-Attention is Key

Transformers replace recurrence with attention. Each position attends to all others simultaneously, enabling parallelization and long-range dependencies.

🎭

Multi-Head Attention

Multiple attention heads learn different relationships: syntax, semantics, position, long-range. They're concatenated for rich representations.

📍

Positional Encoding

Since attention is permutation-invariant, positional encodings inject word order information using sinusoidal patterns the model can learn.

🏗️

Encoder-Decoder Structure

Encoder for understanding (BERT), decoder for generation (GPT), or both for translation (T5). Each serves different purposes.

🎭

Masking Mechanisms

Causal masking for autoregressive generation (GPT), padding masking for variable lengths, no masking for bidirectional understanding (BERT).

⚡

Scalability

Transformers scale beautifully. From 110M (BERT) to 175B (GPT-3) to trillions of parameters. More compute + data = better performance.