🎯 Multi-Head Attention
Understanding the core mechanism powering modern transformers
Your Progress
0 / 5 completedIntroduction to Attention
🎯 What is Attention?
Attention is a mechanism that allows models to focus on relevant parts of the input when processing each element. Instead of treating all inputs equally, attention weighs their importance dynamically.
When reading "The cat sat on the mat," we naturally pay more attention to "cat" and "mat" when understanding "sat." Attention mechanisms replicate this in neural networks.
🔄 Why Multi-Head?
Single attention can only learn one type of relationship. Multi-head attention runs multiple attention operations in parallel, each learning different patterns:
Head 1: Syntax
Learns grammatical relationships like subject-verb agreement
Head 2: Semantics
Captures meaning and context between related concepts
Head 3: Long-range
Connects distant words that reference each other
Head 4: Local
Focuses on adjacent words and immediate context
📊 Historical Context
Bahdanau et al. introduce attention for seq2seq models
Vaswani et al. introduce multi-head self-attention in transformers
BERT, GPT, T5, and modern LLMs all use multi-head attention
All positions processed simultaneously, enabling efficient training
Direct connections between any positions, regardless of distance
Attention weights reveal what the model focuses on