Home/Concepts/Artificial Intelligence/Reinforcement Learning Basics

Reinforcement Learning Basics

Train an agent to navigate environments and maximize rewards

⏱️ 19 min20 interactions

What is Reinforcement Learning?

Reinforcement Learning (RL) is learning through interaction. An agent learns to make decisions by receiving rewards or penalties from its environment.

💡 The RL Framework

🤖
Agent
The learner/decision maker (e.g., robot, game player)
🌍
Environment
Everything the agent interacts with (game world, physical space)
🎁
Reward
Feedback signal indicating how good an action was

1. Agent-Environment Interaction

🎮 Interactive: Navigate to Goal

Move the agent (🤖) to the goal (⭐). Each step costs -1 reward. Reaching the goal gives +100!

🤖
Total Reward
0
Steps Taken
0

💡 The Loop: Agent observes state → takes action → receives reward → environment transitions to new state → repeat!

Reward Functions: The Art of Incentive Design

🎯 Why Reward Design is Critical

💡 The Reward Hypothesis

"All goals can be described by maximizing expected cumulative reward." The reward function is the only way you communicate the task to the agent. Get it wrong, and the agent will learn perfectly... to do the wrong thing!

Classic Example: Reward Hacking (OpenAI's CoastRunners)
Intended goal: Race boat to finish line quickly
Reward given: Points for hitting targets along track
What agent learned: Drive in circles hitting same 3 targets repeatedly!
❌ Agent never crossed finish line—wasn't rewarded for it
✓ Fix: Reward finish line heavily, remove target points

🎨 Reward Shaping Strategies

Sparse Rewards (Hard Mode):
• Goal reached: +100
• Everything else: 0
Problem:
Agent wanders randomly for thousands of episodes before accidentally finding goal
No gradient to follow—like searching for needle in haystack blindfolded
Use when: Simple environments, short episodes (<100 steps)
Dense Rewards (Guided Learning):
• Goal reached: +100
• Step toward goal: +1
• Step away: -1
• Hit wall: -10
Benefit:
Agent gets feedback every step—learns much faster
Clear gradient pointing toward improvement
Use when: Complex environments, long horizons
Common Reward Components:
1. Goal reward: Large positive (e.g., +100) when task completed
2. Step penalty: Small negative (e.g., -1) to encourage efficiency
3. Failure penalty: Moderate negative (e.g., -50) for dangerous actions
4. Progress reward: Incremental positive for moving toward goal
Balance is key: Too much step penalty → agent rushes and fails. Too little → agent wanders aimlessly.

⚠️ Reward Design Pitfalls

Pitfall 1: Specifying Means Instead of Ends
❌ Bad: "Move right 10 steps, then up 5 steps, then..."
✓ Good: "Reach the goal location by any path"
Let agent discover optimal strategy—don't micromanage!
Pitfall 2: Misaligned Proxy Metrics
❌ Example: Reward "distance reduced to goal" in maze
→ Agent tries to break through walls (shortest distance!) instead of finding path
✓ Better: Reward "valid steps toward goal" or simply "reaching goal"
Pitfall 3: Reward Magnitude Imbalance
❌ Goal: +1, Step: -0.1 → Agent ignores goal (not worth the journey!)
✓ Goal: +100, Step: -1 → Clear value in reaching goal
Goal reward should dominate step penalties by 10-100×
Pitfall 4: Hidden Terminal States
If episode ends on failure (e.g., falling off cliff), agent might not realize it's bad!
✓ Solution: Give explicit negative reward before termination
if fall_off_cliff: reward = -100; done = True

2. Reward Function Design

🎯 Interactive: Shape Agent Behavior

Goal
+100
👣
Each Step
-1
🚫
Obstacle
-50

⚠️ Reward Shaping: Small step penalty encourages efficiency. Large goal reward motivates completion. Balance is key!

3. Exploration vs Exploitation

🎲 Interactive: The Epsilon-Greedy Dilemma

🔍
Exploration
Try new actions
0
Random actions taken
🎯
Exploitation
Use best known
0
Best actions taken
ε = 0 (pure exploitation)May miss better strategies
ε = 1 (pure exploration)Never uses learned knowledge
ε = 0.1 (recommended)Good balance: 90% exploit, 10% explore

Q-Learning: The Foundation of Value-Based RL

📊 Understanding Q-Values

🎯 What is Q(s, a)?

Q(state, action) is the expected cumulative reward from taking action a in state s, then following the optimal policy thereafter. It answers: "How good is this action in this situation?"

Intuitive Example: Navigation
Imagine you're at position (1,1) in a 5×5 grid, goal at (5,5)
Q((1,1), right) = 42
→ Moving right is good path
Q((1,1), left) = -5
→ Moving left goes wrong way
Q((1,1), up) = 38
→ Moving up is okay
Q((1,1), down) = -10
→ Moving down wrong way
✓ Optimal policy: Choose action with highest Q-value → "right" (42)

🧮 The Bellman Equation

Core Insight: Recursive Decomposition
Q(s, a) = r + γ × max Q(s', a')
The value of taking action a in state s equals:
r: Immediate reward you get right now
γ: Discount factor (0 to 1) - how much you value future
max Q(s', a'): Best possible value from next state s'
Q-Learning Update Rule:
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
α: Learning rate (0.1 typical)
[...]: Temporal difference error
Update moves current Q toward observed reward + future value
Why "max Q(s', a')"?
Q-learning is off-policy:
• Assumes agent will act optimally in future
• Uses max even if agent explored suboptimally
• Learns optimal Q-values regardless of exploration
Contrast: SARSA uses actual next action (on-policy)
Example Update Calculation:
State s: position (2,3), Action a: "right"
Current Q(s,a) = 15.2
Take action → receive reward r = -1 (step penalty)
Arrive at s' = (3,3), max Q(s',•) = 20.5 (best action from new state)
Update: Q(s,a) = 15.2 + 0.1 × [-1 + 0.9×20.5 - 15.2]
= 15.2 + 0.1 × [17.45]
= 15.2 + 1.745 = 16.945
Q-value increased! Agent learned this action leads to valuable future states.

📋 Q-Table Representation

Tabular Q-Learning (Discrete Spaces):
Store Q-value for every (state, action) pair in a table/dictionary
Q_table = {
(0,0): {"up": 5.2, "down": -2.1, "left": 0.0, "right": 8.7},
(0,1): {"up": 7.5, "down": 1.3, "left": 4.2, "right": 12.1},
...
}
Works for: Small discrete state spaces (e.g., 10×10 grid, 4 actions = 400 values)
Fails for: Large/continuous spaces (e.g., Atari: 210×160×3 pixels = 10^150,528 states!)
Deep Q-Networks (Continuous/Large Spaces):
Use neural network to approximate Q(s,a) instead of storing table
Input: state (e.g., game pixels)
Network: Conv layers → FC layers
Output: Q-value for each action
DQN innovations: Experience replay, target network, reward clipping
Success: Human-level Atari, Go, StarCraft

✓ Q-Learning Advantages

Off-policy: Learn optimal policy while exploring
Model-free: No need to know environment dynamics
Proven convergence: Guaranteed to find optimal Q* (with conditions)
Simple to implement: One equation, one table/network

⚠️ Challenges

State explosion: Q-table grows exponentially
Exploration needed: Must visit states multiple times
Convergence slow: Can take millions of samples
Discrete actions only: (DQN limitation, PPO solves this)

4. Q-Values: Action-Value Function

📊 Interactive: Q-Table Visualization

Q(state, action) represents expected future reward for taking an action in a state.

Select State
Select Action

Q(State 0, right)

27.3
Expected cumulative reward
Q-Learning Update Rule:
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

💡 Goal: Learn Q-values for all state-action pairs. Then choose action with highest Q-value in each state!

5. Hyperparameters: α and γ

⚙️ Interactive: Tune Learning

Slow learningFast learning
Myopic (short-term)Far-sighted (long-term)

Learning Rate (α)

Value:0.10
Effect:Stable updates
Controls how much new information overrides old. α=1 means completely replace, α=0 means never learn.

Discount Factor (γ)

Value:0.90
Planning:Medium-term
How much agent values future rewards. γ=0 only cares about immediate, γ→1 values distant future equally.

6. Policy: Decision Strategy

🎯 Interactive: Compare Policies

⚖️ Epsilon-Greedy Policy

With probability ε explore randomly, otherwise exploit best action. Best of both worlds!

Exploration:10% (ε)
Exploitation:90% (1-ε)
Performance:Optimal balance

7. Training Progress

📈 Interactive: Watch Agent Learn

Training Episodes
0/100
Avg Reward
0.0
Success Rate
0%
Avg Steps
10

8. Bellman Equation

🧮 Interactive: Q-Value Calculator

Q-Value Calculation

Q(s,a) = r + γ × max Q(s',a')
The value of taking action a in state s
Immediate
10
+
Future (γ=0.9)
45.0
Total Q-Value
55.0

RL Algorithms: From Tabular to Deep RL

🧠 The Algorithm Landscape

📊 Two Main Families

Value-Based Methods
• Learn Q(s,a) or V(s) functions
• Derive policy from values (greedy/ε-greedy)
Examples: Q-Learning, SARSA, DQN
Best for: Discrete action spaces
Limitation: Hard with continuous actions
Policy-Based Methods
• Learn policy π(a|s) directly
• Optimize policy parameters via gradient ascent
Examples: REINFORCE, PPO, A3C
Best for: Continuous actions (robotics)
Bonus: Can learn stochastic policies
Actor-Critic (Hybrid)
Critic: Learns value function V(s) or Q(s,a)
Actor: Learns policy π(a|s)
• Critic guides Actor's learning (reduces variance)
Examples: A3C, PPO, SAC, TD3
Why best: Combines stability (critic) + flexibility (actor)

🔬 On-Policy vs Off-Policy

On-Policy (SARSA, PPO):
Learn about and improve the same policy you're using to act
Agent explores with policy π
→ Collects data using π
→ Updates π based on that data
→ Throw away old data!
Pro: More stable, safer exploration
Con: Sample inefficient (can't reuse old data)
Off-Policy (Q-Learning, DQN):
Learn about optimal policy while following exploratory policy
Agent explores with ε-greedy
→ Store experiences in replay buffer
→ Learn optimal Q-values from buffer
→ Reuse data many times!
Pro: Sample efficient (experience replay)
Con: Can be unstable, needs tricks (target networks)

🚀 Modern Breakthroughs

DQN (2013-2015): Deep RL Era Begins
• First to master Atari games from pixels (49 games, human-level in 29)
Key innovations:
1. Experience Replay: Store (s,a,r,s') transitions, sample random batches
2. Target Network: Separate network for computing targets (updated every N steps)
3. Frame Stacking: Input last 4 frames to capture motion
Result: Stable training of CNNs for Q-learning
PPO (2017): Policy Gradient Workhorse
Proximal Policy Optimization - limits policy updates per step
• Clips probability ratio to prevent destructive updates:
L = min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)
Why dominant: Simple, stable, works for continuous/discrete actions
Used in: OpenAI Five (Dota 2), robotics, ChatGPT RLHF
AlphaGo/AlphaZero (2016-2017): Self-Play Mastery
• Defeated world champion Lee Sedol 4-1 in Go
Approach: Combines Monte Carlo Tree Search + deep RL + self-play
• AlphaZero: Mastered Go, Chess, Shogi from scratch (no human data!)
Trained for 9 hours, surpassed centuries of human Go knowledge
Recent Trends (2020-2025):
Offline RL: Learn from fixed datasets (no environment interaction)
Model-Based RL: Learn environment model, plan with it (Dreamer, MuZero)
Multi-Agent RL: Agents learning together/competitively
RLHF: Reinforcement Learning from Human Feedback (GPT-4, Claude)

9. Popular RL Algorithms

🧠 Interactive: Algorithm Comparison

📊

Q-Learning

Off-policy TD learning. Learns optimal Q-values directly.

Update Formula:
Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
✓ Pros
Simple, proven, works well for discrete spaces
✗ Cons
Struggles with continuous actions, can be sample inefficient

10. Real-World Applications

🌍 Interactive: RL in Action

🎮

Game Playing

AlphaGo defeated world champions. RL agents master Atari, Dota 2, StarCraft.

Examples:
Chess, Go, Poker, Video games

🎯 Key Takeaways

🔄

Learning Through Interaction

RL agents learn by trial and error. They interact with environments, receive rewards, and improve their behavior over time without explicit supervision.

🎁

Reward is the Signal

Design rewards carefully—they define agent behavior. Sparse rewards are hard to learn from. Reward shaping guides agents toward goals efficiently.

⚖️

Exploration-Exploitation Tradeoff

Balance exploring new strategies vs exploiting known good ones. Epsilon-greedy is simple and effective. Start high exploration, decay over time.

📊

Q-Learning Foundation

Q-values represent expected future rewards. Q-learning uses Bellman equation to learn optimal Q-values. Deep Q-Networks scale to complex states with neural nets.

⚙️

Hyperparameters Matter

Learning rate (α) controls update speed. Discount factor (γ) balances immediate vs future rewards. Tune these carefully—they dramatically affect learning.

🚀

State-of-the-Art Algorithms

DQN for Atari games, PPO for robotics, AlphaGo for strategic games. Modern RL combines deep learning with classical algorithms for superhuman performance.