Reinforcement Learning Basics
Train an agent to navigate environments and maximize rewards
What is Reinforcement Learning?
Reinforcement Learning (RL) is learning through interaction. An agent learns to make decisions by receiving rewards or penalties from its environment.
💡 The RL Framework
1. Agent-Environment Interaction
🎮 Interactive: Navigate to Goal
Move the agent (🤖) to the goal (⭐). Each step costs -1 reward. Reaching the goal gives +100!
💡 The Loop: Agent observes state → takes action → receives reward → environment transitions to new state → repeat!
Reward Functions: The Art of Incentive Design
🎯 Why Reward Design is Critical
💡 The Reward Hypothesis
"All goals can be described by maximizing expected cumulative reward." The reward function is the only way you communicate the task to the agent. Get it wrong, and the agent will learn perfectly... to do the wrong thing!
🎨 Reward Shaping Strategies
⚠️ Reward Design Pitfalls
2. Reward Function Design
🎯 Interactive: Shape Agent Behavior
⚠️ Reward Shaping: Small step penalty encourages efficiency. Large goal reward motivates completion. Balance is key!
3. Exploration vs Exploitation
🎲 Interactive: The Epsilon-Greedy Dilemma
Q-Learning: The Foundation of Value-Based RL
📊 Understanding Q-Values
🎯 What is Q(s, a)?
Q(state, action) is the expected cumulative reward from taking action a in state s, then following the optimal policy thereafter. It answers: "How good is this action in this situation?"
🧮 The Bellman Equation
📋 Q-Table Representation
✓ Q-Learning Advantages
⚠️ Challenges
4. Q-Values: Action-Value Function
📊 Interactive: Q-Table Visualization
Q(state, action) represents expected future reward for taking an action in a state.
Q(State 0, right)
💡 Goal: Learn Q-values for all state-action pairs. Then choose action with highest Q-value in each state!
5. Hyperparameters: α and γ
⚙️ Interactive: Tune Learning
Learning Rate (α)
Discount Factor (γ)
6. Policy: Decision Strategy
🎯 Interactive: Compare Policies
⚖️ Epsilon-Greedy Policy
With probability ε explore randomly, otherwise exploit best action. Best of both worlds!
7. Training Progress
📈 Interactive: Watch Agent Learn
8. Bellman Equation
🧮 Interactive: Q-Value Calculator
Q-Value Calculation
RL Algorithms: From Tabular to Deep RL
🧠 The Algorithm Landscape
📊 Two Main Families
🔬 On-Policy vs Off-Policy
🚀 Modern Breakthroughs
9. Popular RL Algorithms
🧠 Interactive: Algorithm Comparison
Q-Learning
Off-policy TD learning. Learns optimal Q-values directly.
10. Real-World Applications
🌍 Interactive: RL in Action
Game Playing
AlphaGo defeated world champions. RL agents master Atari, Dota 2, StarCraft.
🎯 Key Takeaways
Learning Through Interaction
RL agents learn by trial and error. They interact with environments, receive rewards, and improve their behavior over time without explicit supervision.
Reward is the Signal
Design rewards carefully—they define agent behavior. Sparse rewards are hard to learn from. Reward shaping guides agents toward goals efficiently.
Exploration-Exploitation Tradeoff
Balance exploring new strategies vs exploiting known good ones. Epsilon-greedy is simple and effective. Start high exploration, decay over time.
Q-Learning Foundation
Q-values represent expected future rewards. Q-learning uses Bellman equation to learn optimal Q-values. Deep Q-Networks scale to complex states with neural nets.
Hyperparameters Matter
Learning rate (α) controls update speed. Discount factor (γ) balances immediate vs future rewards. Tune these carefully—they dramatically affect learning.
State-of-the-Art Algorithms
DQN for Atari games, PPO for robotics, AlphaGo for strategic games. Modern RL combines deep learning with classical algorithms for superhuman performance.