📊 Layer Normalization
Stabilize training and improve transformer performance
Your Progress
0 / 5 completed←
Previous Module
Multi-Head Attention
Why Layer Normalization?
🎯 The Problem
Deep neural networks suffer from internal covariate shift – layer input distributions change during training, slowing convergence. Normalization techniques stabilize activations, enabling faster training and deeper architectures.
💡
Key Insight
Layer normalization normalizes across features (per sample), while batch normalization normalizes across the batch dimension. This makes LayerNorm ideal for transformers and RNNs.
⚠️ Without Normalization
- •Vanishing/exploding gradients – activations grow unbounded or shrink to zero
- •Slow convergence – requires careful learning rate tuning and initialization
- •Training instability – loss spikes and divergence common in deep networks
- •Limited depth – difficult to train networks beyond 10-20 layers effectively
✅ With Layer Normalization
- •Stable gradients – normalized activations keep gradients in healthy range
- •Faster training – 2-3x speedup common, higher learning rates possible
- •Reduced sensitivity – less dependent on initialization and hyperparameters
- •Enables depth – transformers with 100+ layers train successfully
🏗️ Where LayerNorm Appears
🔄
Transformers
After multi-head attention and feedforward layers in every block
📝
RNNs/LSTMs
Within recurrent cells to stabilize hidden state evolution
🎨
GANs
Generator and discriminator networks for training stability