Home/Concepts/Artificial Intelligence/Machine Learning Optimization

Machine Learning Optimization

Explore gradient descent variants and hyperparameter tuning

⏱️ 22 min20 interactions

What is ML Optimization?

Machine Learning Optimization is the art and science of training models efficiently. It's about finding the best parameters (weights) that minimize your loss function, using clever algorithms and hyperparameter tuning.

💡 The Core Goal

🎯
Minimize Loss
Find weights that make predictions close to actual values
Train Efficiently
Converge quickly without wasting compute resources
🎚️
Tune Hyperparameters
Learning rate, batch size, momentum - get them right!

Gradient Descent: The Optimization Workhorse

📐 The Mathematics of Learning

Why Gradients Point Uphill

The optimization problem: Find parameters θ that minimize loss J(θ)
minθ J(θ) = minθ (1/n) Σ L(f(xi; θ), yi)
Gradient ∇J(θ): Vector of partial derivatives - shows direction of steepest ascent
∇J(θ) = [∂J/∂θ₁, ∂J/∂θ₂, ..., ∂J/∂θn]
Key insight: To minimize (go downhill), move opposite to gradient: θ = θ - η∇J(θ)

🔄 Three Flavors of Gradient Descent

1. Batch Gradient Descent (BGD)
θ = θ - η × (1/n) Σi=1..n ∇L(f(xi; θ), yi)
Uses: Entire dataset for each update
✓ Accurate gradient, stable convergence, guaranteed convergence for convex functions
✗ Slow for large datasets, requires entire dataset in memory, stuck in local minima
Example: 1M samples → compute 1M gradients before single update!
2. Stochastic Gradient Descent (SGD)
θ = θ - η × ∇L(f(xi; θ), yi) // single random sample
Uses: One random sample per update
✓ Fast updates, can escape local minima (noisy), online learning possible
✗ Noisy gradients, never truly converges (oscillates), hard to parallelize
Example: 1M samples → 1M updates per epoch (vs 1 for batch)
3. Mini-Batch Gradient Descent (Standard)
θ = θ - η × (1/B) Σi∈batch ∇L(f(xi; θ), yi)
Uses: Small batch (typically 32-256 samples)
✓ Best of both worlds: faster than batch, less noisy than SGD
✓ Vectorization on GPUs, efficient memory usage, some exploration
🎯 This is the standard in deep learning!

🎯 Convergence & Stopping Criteria

When to Stop?
1. Gradient norm is small: ||∇J(θ)|| < ε (e.g., ε = 10⁻⁵)
→ We're at (local) minimum, gradient ≈ 0
2. Loss change is tiny: |Jt - Jt-1| < ε
→ Not improving anymore
3. Max iterations: Reached predefined epoch limit
→ Time/compute budget exhausted
4. Validation loss increases: Overfitting detected
→ Early stopping (best practice!)
Convergence Rate
Convex functions: O(1/k) convergence for GD, O(1/√k) for SGD
Non-convex (neural nets): No global optimum guarantees
Practical: Loss decreases exponentially at first, then plateaus
Typical curve: J(t) = J* + (J₀ - J*)e-λt

⚡ Gradient Computation in Practice

Backpropagation: Efficient algorithm to compute ∇J(θ) in O(n) time
Chain Rule Application:
∂J/∂wl = ∂J/∂zl × ∂zl/∂wl × ... (chain through layers)
Forward pass: compute outputs and cache activations
Backward pass: compute gradients layer-by-layer from output to input
Automatic differentiation: PyTorch, TensorFlow handle this automatically
loss.backward() # Magic! Computes all gradients
optimizer.step() # Updates all parameters

🎯 Learning Rate Impact

Too small: Slow convergence, stuck in plateaus
Too large: Divergence, oscillation around minimum
Just right: Fast, stable convergence

📊 Batch Size Impact

Small (1-32): Noisy, explores more, generalizes better
Large (128-512): Stable, faster per epoch, may overfit
Sweet spot: 32-128 for most tasks

💡 Key Insight

Gradient descent is iterative refinement: each step moves slightly toward better parameters. The "descent" comes from following negative gradient!

1. Gradient Descent: The Foundation

⛰️ Interactive: Descend the Loss Surface

Position: 5.00Loss: 25.00Steps: 0

📐 Formula: New Weight = Old Weight - Learning Rate × Gradient. The gradient points uphill, so we subtract to go downhill!

2. Learning Rate: The Most Important Hyperparameter

🎚️ Interactive: Compare Learning Rates

Convergence Speed

Speed80%
Stability85%

Perfect balance - fast and stable convergence

3. Batch Size: Speed vs Accuracy Trade-off

📦 Interactive: Adjust Batch Size

1 (SGD)32 (Mini-batch)256 (Large)
Convergence
50%
Memory Usage
12%
Updates/Epoch
312
Small Batch (1-32)
Noisy gradients, better generalization, slower convergence
Large Batch (128-512)
Stable gradients, faster training, may overfit, needs more memory

Advanced Optimizers: Beyond Vanilla SGD

🚀 Why We Need Better Optimizers

SGD's Limitations

Problem 1 - Same learning rate for all parameters: Some dimensions need large steps, others small
Problem 2 - Stuck in ravines: Oscillates across narrow valleys while making slow progress along the valley
Problem 3 - Local minima & saddle points: Gets trapped, especially in high-dimensional spaces
Problem 4 - Learning rate sensitivity: Must carefully tune η or training fails
Solution: Adaptive optimizers that adjust learning rates per-parameter based on history

🔄 Momentum: Adding Velocity

Mathematical Formulation
vt = β vt-1 + ∇J(θt) // accumulate velocity
θt+1 = θt - η vt // update with velocity
β (momentum coefficient): Typically 0.9 (retains 90% of previous velocity)
Exponentially weighted average: Recent gradients weighted more, but history matters
Physical Intuition: Ball Rolling Down Hill
Without momentum: Ball stops at each step, chooses new direction
With momentum: Ball builds speed, continues rolling even uphill (inertia)
Result: Escapes local minima, dampens oscillations in ravines, accelerates in consistent directions
Nesterov Momentum (NAG) - Look Ahead
θlookahead = θt - β vt-1 // anticipate future position
vt = β vt-1 + ∇J(θlookahead) // compute gradient there
θt+1 = θt - η vt
Smarter: evaluates gradient at where we're about to be, not where we are!

📊 Adaptive Learning Rates

AdaGrad (2011) - Per-Parameter Adaptation
Gt = Gt-1 + gt ⊙ gt // accumulate squared gradients
θt+1 = θt - (η / √(Gt + ε)) ⊙ gt
Key idea: Smaller updates for frequently updated parameters, larger for rare ones
✓ Good for sparse data (NLP, embeddings)
✗ Learning rate decays to zero too quickly (Gt grows monotonically)
RMSprop (2012) - Exponential Moving Average
E[g²]t = β E[g²]t-1 + (1-β) gt² // decay old gradients
θt+1 = θt - (η / √(E[g²]t + ε)) gt
Fix for AdaGrad: Uses exponential moving average instead of sum
β typically 0.9: Recent gradients matter more
✓ Doesn't decay to zero, works well for RNNs, non-stationary problems
Hinton introduced in Coursera lecture (unpublished!)
Adam (2014) - Combining Momentum + RMSprop
mt = β₁ mt-1 + (1-β₁) gt // 1st moment (mean)
vt = β₂ vt-1 + (1-β₂) gt² // 2nd moment (variance)
t = mt / (1-β₁ᵗ) // bias correction
t = vt / (1-β₂ᵗ) // bias correction
θt+1 = θt - η (m̂t / (√v̂t + ε))
mt: Momentum-like, smooths gradient direction
vt: RMSprop-like, adapts learning rate per parameter
Bias correction: Accounts for initialization at zero (important early training!)
Default hyperparameters: β₁=0.9, β₂=0.999, η=0.001, ε=10⁻⁸
✓ Robust, fast convergence, works well with default params, most popular optimizer
✗ Can generalize slightly worse than SGD+momentum with extensive tuning

⚖️ Optimizer Comparison

OptimizerSpeedFinal AccTuningBest For
SGDSlowBest*HardConvNets with time
SGD+MomMediumExcellentMediumComputer Vision
RMSpropFastGoodEasyRNNs, non-stationary
AdamFastVery GoodEasiestDefault choice, NLP
AdamWFastBestEasyTransformers, SOTA
*SGD+momentum can achieve slightly better generalization than Adam with extensive tuning, but Adam is more robust to hyperparameters

🎯 When to Use What?

Adam: Default choice, works 90% of time
SGD+Mom: Best final accuracy with tuning (CV)
AdamW: Transformers, modern best practice

⚡ Modern Variants

AdamW: Decoupled weight decay
RAdam: Rectified Adam, warm-up free
Ranger: RAdam + LookAhead

💡 Key Insight

Adam = Momentum (direction) + RMSprop (step size). It adapts both where to go and how far to step for each parameter!

4. Optimizer Algorithms

🚀 Interactive: Compare Optimizers

🚀

Adam

Adaptive Moment Estimation - combines momentum + RMSprop

Update Rule:
m = β₁m + (1-β₁)g; v = β₂v + (1-β₂)g²
✓ Advantages
Fast convergence, robust, default choice
✗ Limitations
Can generalize worse than SGD sometimes

5. Momentum: Accelerating Convergence

💨 Interactive: Feel the Momentum

No momentumTypical: 0.9High momentum

Effect Visualization

Velocity Accumulation
0.0
🚀 Fast - strong momentum, typical choice
How It Works
Momentum accumulates past gradients like a ball rolling down a hill, building velocity and smoothing out noisy updates.
Why Use It?
Escapes local minima, dampens oscillations, accelerates convergence in relevant directions.
Typical Value
β = 0.9 is most common (90% of past velocity retained)

6. Loss Landscape Navigation

🗺️ Interactive: Explore 2D Loss Surface

Current Loss
0.000
🎯 Near optimal!
🟢 Optimal point (0, 0) • 🔵 Your position

7. Early Stopping: Preventing Overfitting

⏱️ Interactive: Training with Patience

Epoch: 0/50Val Loss: 0.572

💡 Early Stopping: Stop training when validation loss doesn't improve for 5 consecutive epochs. Prevents overfitting and saves compute!

Learning Rate Schedules: Dynamic Adaptation

📉 Why Decay the Learning Rate?

The Training Journey

Early Training: Large Steps
Far from optimum: Loss is high, gradients are large
Need speed: Large learning rate to make rapid progress
Risk: Some instability okay, exploring loss landscape
Late Training: Fine-tuning
Near optimum: Loss plateauing, in basin of attraction
Need precision: Small learning rate to settle into minimum
Risk: Large LR causes oscillation around minimum, never converges!
Analogy: Like driving - speed up on highway (early), slow down near destination (late)

📊 Common Decay Schedules

1. Constant Learning Rate (Baseline)
ηt = η₀ // same throughout training
✓ Simple, no hyperparameters, good for short training
✗ Suboptimal for long training, oscillates near minimum
Use when: Training ≤ 10 epochs, or with Adam (has adaptive rates)
2. Step Decay (Staircase)
ηt = η₀ × γ⌊epoch/drop_every⌋
Example: η₀=0.1, γ=0.5, drop_every=30
Epochs 0-29: η=0.1
Epochs 30-59: η=0.05
Epochs 60-89: η=0.025
✓ Simple, interpretable, works well in practice
✗ Requires knowing when to drop (domain knowledge)
Classic choice: Divide by 10 every 30-50 epochs
3. Exponential Decay
ηt = η₀ × e-λt or ηt = η₀ × γt
Smooth decay: Gradual reduction every epoch
Typical γ: 0.95-0.99 (smaller = faster decay)
✓ Smooth, no sudden changes, mathematically elegant
✗ Decays too fast or too slow if λ not tuned
Use when: Want gradual refinement, have budget for tuning λ
4. Polynomial Decay
ηt = η₀ × (1 - t/T)power
T: Total epochs
power: Typically 1.0 (linear) or 2.0 (quadratic)
✓ Reaches near-zero at end of training, controlled decay rate
Use when: You know total training time in advance
5. Cosine Annealing
ηt = ηmin + (η₀ - ηmin)/2 × (1 + cos(πt/T))
Smooth decay: Follows cosine curve from η₀ to ηmin
Variant - With Restarts (SGDR):
Periodically reset to η₀ and decay again (escape local minima!)
✓ Smooth, widely used, restarts help exploration
✓ SOTA in many tasks (ResNet, Transformers)
Popular in modern deep learning (2017+)

🔥 Warmup: Starting Carefully

Problem: Large learning rate at start can cause divergence (especially with Adam, batch norm)
Solution: Linearly increase LR from small value to target over first N epochs
ηt = η₀ × min(1, t/Twarmup) // first Twarmup epochs
Example: Transformer Training
1. Warmup: Linearly increase LR for 4K steps (0 → 0.001)
2. Decay: Inverse square root decay after warmup
ηt = dmodel-0.5 × min(t-0.5, t × warmup_steps-1.5)
BERT, GPT use warmup! Typical: 1-10% of total steps

🔄 Cyclical Learning Rates

Idea: Vary LR between min and max in cycles (triangle, cosine waves)
ηt cycles between [ηmin, ηmax] every C epochs
Benefits:
Escape saddle points: High LR periods help jump out
Faster convergence: Can traverse loss landscape efficiently
Better generalization: Regularization effect from oscillation
Less common than monotonic decay, but powerful tool (Leslie Smith, 2017)

🎯 Practical Recommendations

For Computer Vision (CNNs)
Optimizer: SGD + momentum
Schedule: Step decay (÷10 at epochs 30, 60, 90)
Initial LR: 0.1
Classic ImageNet recipe
For NLP (Transformers)
Optimizer: Adam or AdamW
Schedule: Warmup + cosine decay
Initial LR: 1e-4 to 5e-4
BERT, GPT standard
Quick Experiment / Baseline
Optimizer: Adam
Schedule: Constant LR
Initial LR: 1e-3
Works 80% of the time
SOTA / Competition
Optimizer: AdamW or SGD+momentum
Schedule: Cosine with warmup + restarts
Initial LR: Grid search [1e-5, 1e-2]
Max performance, more tuning

📉 When to Decay?

Always for long training (>50 epochs)
Optional for Adam with early stopping
Essential for SGD to converge

🎛️ Tuning Priority

1. Initial LR: Most important!
2. Schedule type: Step vs cosine
3. Decay rate/timing: Fine-tune if needed

💡 Key Insight

LR schedules are about exploration vs exploitation: explore early (high LR), exploit late (low LR). Like simulated annealing!

8. Learning Rate Schedules

📉 Interactive: Decay Strategies

constant Schedule

Same LR throughout training

✓ When to Use
Simple tasks, short training
⚙️ Implementation
lr = initial_lr

9. L2 Regularization (Weight Decay)

⚖️ Interactive: Balance Fitting vs Simplicity

No regularization0.01 (typical)Strong penalty
Data Loss
0.500
Reg Loss
0.500
Total Loss
1.000
Formula:
Total Loss = Data Loss + λ × Σ(weights²)

🎯 Effect: Balanced - prevents overfitting while allowing flexibility

10. Hyperparameter Overview

🎛️ Interactive: Essential Hyperparameters

🎯 Key Takeaways

🎚️

Learning Rate is Critical

The most important hyperparameter. Too high = divergence, too low = slow training. Start with 0.001 for Adam, 0.01 for SGD. Use learning rate schedules for long training.

🚀

Adam is Usually Best

Adam combines momentum and adaptive learning rates. It's the default choice for most tasks. Use SGD with momentum for better generalization if you have time to tune.

📦

Batch Size Trade-offs

32-128 is typical. Larger batches = faster training, more memory, less noise. Smaller batches = better generalization, slower, less memory. Balance based on your GPU.

⏱️

Early Stopping Prevents Overfitting

Monitor validation loss. Stop when it stops improving for N epochs (patience). Saves compute and prevents overfitting. Keep checkpoints of best model.

⚖️

Regularization for Generalization

L2 (weight decay), dropout, and data augmentation prevent overfitting. L2 λ = 0.0001-0.01 typical. Don't over-regularize or you'll underfit.

🎛️

Tune Systematically

Start with learning rate, then batch size, then others. Use grid search or random search. Modern: Bayesian optimization, hyperband. Log everything!