Home/Concepts/Artificial Intelligence/Reinforcement Learning Basics

Reinforcement Learning Basics

Train an agent to navigate environments and maximize rewards

⏱️ 19 min⚡ 20 interactions

What is Reinforcement Learning?

Reinforcement Learning (RL) is learning through interaction. An agent learns to make decisions by receiving rewards or penalties from its environment.

💡 The RL Framework

🤖

Agent

The learner/decision maker (e.g., robot, game player)

🌍

Environment

Everything the agent interacts with (game world, physical space)

🎁

Reward

Feedback signal indicating how good an action was

1. Agent-Environment Interaction

🎮 Interactive: Navigate to Goal

Move the agent (🤖) to the goal (⭐). Each step costs -1 reward. Reaching the goal gives +100!

🤖

⭐

Total Reward

Steps Taken

💡 The Loop: Agent observes state → takes action → receives reward → environment transitions to new state → repeat!

Reward Functions: The Art of Incentive Design

🎯 Why Reward Design is Critical

💡 The Reward Hypothesis

"All goals can be described by maximizing expected cumulative reward." The reward function is the only way you communicate the task to the agent. Get it wrong, and the agent will learn perfectly... to do the wrong thing!

Classic Example: Reward Hacking (OpenAI's CoastRunners)

• Intended goal: Race boat to finish line quickly

• Reward given: Points for hitting targets along track

• What agent learned: Drive in circles hitting same 3 targets repeatedly!

❌ Agent never crossed finish line—wasn't rewarded for it

✓ Fix: Reward finish line heavily, remove target points

🎨 Reward Shaping Strategies

Sparse Rewards (Hard Mode):

• Goal reached: +100

• Everything else: 0

Problem:

Agent wanders randomly for thousands of episodes before accidentally finding goal

No gradient to follow—like searching for needle in haystack blindfolded

Use when: Simple environments, short episodes (<100 steps)

Dense Rewards (Guided Learning):

• Goal reached: +100

• Step toward goal: +1

• Step away: -1

• Hit wall: -10

Benefit:

Agent gets feedback every step—learns much faster

Clear gradient pointing toward improvement

Use when: Complex environments, long horizons

Common Reward Components:

1. Goal reward: Large positive (e.g., +100) when task completed

2. Step penalty: Small negative (e.g., -1) to encourage efficiency

3. Failure penalty: Moderate negative (e.g., -50) for dangerous actions

4. Progress reward: Incremental positive for moving toward goal

Balance is key: Too much step penalty → agent rushes and fails. Too little → agent wanders aimlessly.

⚠️ Reward Design Pitfalls

Pitfall 1: Specifying Means Instead of Ends

❌ Bad: "Move right 10 steps, then up 5 steps, then..."

✓ Good: "Reach the goal location by any path"

Let agent discover optimal strategy—don't micromanage!

Pitfall 2: Misaligned Proxy Metrics

❌ Example: Reward "distance reduced to goal" in maze

→ Agent tries to break through walls (shortest distance!) instead of finding path

✓ Better: Reward "valid steps toward goal" or simply "reaching goal"

Pitfall 3: Reward Magnitude Imbalance

❌ Goal: +1, Step: -0.1 → Agent ignores goal (not worth the journey!)

✓ Goal: +100, Step: -1 → Clear value in reaching goal

Goal reward should dominate step penalties by 10-100×

Pitfall 4: Hidden Terminal States

If episode ends on failure (e.g., falling off cliff), agent might not realize it's bad!

✓ Solution: Give explicit negative reward before termination

if fall_off_cliff: reward = -100; done = True

2. Reward Function Design

🎯 Interactive: Shape Agent Behavior

Goal Reward: 100

Step Penalty: -1

Obstacle Penalty: -50

⭐

Goal

+100

👣

Each Step

-1

🚫

Obstacle

-50

⚠️ Reward Shaping: Small step penalty encourages efficiency. Large goal reward motivates completion. Balance is key!

3. Exploration vs Exploitation

🎲 Interactive: The Epsilon-Greedy Dilemma

Epsilon (ε): 0.10 (10% exploration)

🔍

Exploration

Try new actions

Random actions taken

🎯

Exploitation

Use best known

Best actions taken

ε = 0 (pure exploitation)May miss better strategies

ε = 1 (pure exploration)Never uses learned knowledge

ε = 0.1 (recommended)Good balance: 90% exploit, 10% explore

Q-Learning: The Foundation of Value-Based RL

📊 Understanding Q-Values

🎯 What is Q(s, a)?

Q(state, action) is the expected cumulative reward from taking action a in state s, then following the optimal policy thereafter. It answers: "How good is this action in this situation?"

Intuitive Example: Navigation

Imagine you're at position (1,1) in a 5×5 grid, goal at (5,5)

Q((1,1), right) = 42

→ Moving right is good path

Q((1,1), left) = -5

→ Moving left goes wrong way

Q((1,1), up) = 38

→ Moving up is okay

Q((1,1), down) = -10

→ Moving down wrong way

✓ Optimal policy: Choose action with highest Q-value → "right" (42)

🧮 The Bellman Equation

Core Insight: Recursive Decomposition

Q(s, a) = r + γ × max Q(s', a')

The value of taking action a in state s equals:

• r: Immediate reward you get right now

• γ: Discount factor (0 to 1) - how much you value future

• max Q(s', a'): Best possible value from next state s'

Q-Learning Update Rule:

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

α: Learning rate (0.1 typical)

[...]: Temporal difference error

Update moves current Q toward observed reward + future value

Why "max Q(s', a')"?

Q-learning is off-policy:

• Assumes agent will act optimally in future

• Uses max even if agent explored suboptimally

• Learns optimal Q-values regardless of exploration

Contrast: SARSA uses actual next action (on-policy)

Example Update Calculation:

State s: position (2,3), Action a: "right"

Current Q(s,a) = 15.2

Take action → receive reward r = -1 (step penalty)

Arrive at s' = (3,3), max Q(s',•) = 20.5 (best action from new state)

Update: Q(s,a) = 15.2 + 0.1 × [-1 + 0.9×20.5 - 15.2]

= 15.2 + 0.1 × [17.45]

= 15.2 + 1.745 = 16.945

Q-value increased! Agent learned this action leads to valuable future states.

📋 Q-Table Representation

Tabular Q-Learning (Discrete Spaces):

Store Q-value for every (state, action) pair in a table/dictionary

Q_table = {

(0,0): {"up": 5.2, "down": -2.1, "left": 0.0, "right": 8.7},

(0,1): {"up": 7.5, "down": 1.3, "left": 4.2, "right": 12.1},

...

}

Works for: Small discrete state spaces (e.g., 10×10 grid, 4 actions = 400 values)

Fails for: Large/continuous spaces (e.g., Atari: 210×160×3 pixels = 10^150,528 states!)

Deep Q-Networks (Continuous/Large Spaces):

Use neural network to approximate Q(s,a) instead of storing table

Input: state (e.g., game pixels)

Network: Conv layers → FC layers

Output: Q-value for each action

DQN innovations: Experience replay, target network, reward clipping

Success: Human-level Atari, Go, StarCraft

✓ Q-Learning Advantages

• Off-policy: Learn optimal policy while exploring

• Model-free: No need to know environment dynamics

• Proven convergence: Guaranteed to find optimal Q* (with conditions)

• Simple to implement: One equation, one table/network

⚠️ Challenges

• State explosion: Q-table grows exponentially

• Exploration needed: Must visit states multiple times

• Convergence slow: Can take millions of samples

• Discrete actions only: (DQN limitation, PPO solves this)

4. Q-Values: Action-Value Function

📊 Interactive: Q-Table Visualization

Q(state, action) represents expected future reward for taking an action in a state.

Select State

Select Action

Q(State 0, right)

39.4

Expected cumulative reward

Q-Learning Update Rule:

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

💡 Goal: Learn Q-values for all state-action pairs. Then choose action with highest Q-value in each state!

5. Hyperparameters: α and γ

⚙️ Interactive: Tune Learning

Learning Rate (α): 0.10

Slow learningFast learning

Discount Factor (γ): 0.90

Myopic (short-term)Far-sighted (long-term)

Learning Rate (α)

Value:0.10

Effect:Stable updates

Controls how much new information overrides old. α=1 means completely replace, α=0 means never learn.

Discount Factor (γ)

Value:0.90

Planning:Medium-term

How much agent values future rewards. γ=0 only cares about immediate, γ→1 values distant future equally.

6. Policy: Decision Strategy

🎯 Interactive: Compare Policies

⚖️ Epsilon-Greedy Policy

With probability ε explore randomly, otherwise exploit best action. Best of both worlds!

Exploration:10% (ε)

Exploitation:90% (1-ε)

Performance:Optimal balance

7. Training Progress

📈 Interactive: Watch Agent Learn

Training Episodes

0/100

Avg Reward

0.0

Success Rate

Avg Steps

8. Bellman Equation

🧮 Interactive: Q-Value Calculator

Immediate Reward (r): 10

Next State Value: 50

Q-Value Calculation

Q(s,a) = r + γ × max Q(s',a')

The value of taking action a in state s

Immediate

Future (γ=0.9)

45.0

Total Q-Value

55.0

RL Algorithms: From Tabular to Deep RL

🧠 The Algorithm Landscape

📊 Two Main Families

Value-Based Methods

• Learn Q(s,a) or V(s) functions

• Derive policy from values (greedy/ε-greedy)

Examples: Q-Learning, SARSA, DQN

Best for: Discrete action spaces

Limitation: Hard with continuous actions

Policy-Based Methods

• Learn policy π(a|s) directly

• Optimize policy parameters via gradient ascent

Examples: REINFORCE, PPO, A3C

Best for: Continuous actions (robotics)

Bonus: Can learn stochastic policies

Actor-Critic (Hybrid)

• Critic: Learns value function V(s) or Q(s,a)

• Actor: Learns policy π(a|s)

• Critic guides Actor's learning (reduces variance)

Examples: A3C, PPO, SAC, TD3

Why best: Combines stability (critic) + flexibility (actor)

🔬 On-Policy vs Off-Policy

On-Policy (SARSA, PPO):

Learn about and improve the same policy you're using to act

Agent explores with policy π

→ Collects data using π

→ Updates π based on that data

→ Throw away old data!

Pro: More stable, safer exploration

Con: Sample inefficient (can't reuse old data)

Off-Policy (Q-Learning, DQN):

Learn about optimal policy while following exploratory policy

Agent explores with ε-greedy

→ Store experiences in replay buffer

→ Learn optimal Q-values from buffer

→ Reuse data many times!

Pro: Sample efficient (experience replay)

Con: Can be unstable, needs tricks (target networks)

🚀 Modern Breakthroughs

DQN (2013-2015): Deep RL Era Begins

• First to master Atari games from pixels (49 games, human-level in 29)

• Key innovations:

1. Experience Replay: Store (s,a,r,s') transitions, sample random batches

2. Target Network: Separate network for computing targets (updated every N steps)

3. Frame Stacking: Input last 4 frames to capture motion

Result: Stable training of CNNs for Q-learning

PPO (2017): Policy Gradient Workhorse

• Proximal Policy Optimization - limits policy updates per step

• Clips probability ratio to prevent destructive updates:

L = min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)

• Why dominant: Simple, stable, works for continuous/discrete actions

• Used in: OpenAI Five (Dota 2), robotics, ChatGPT RLHF

AlphaGo/AlphaZero (2016-2017): Self-Play Mastery

• Defeated world champion Lee Sedol 4-1 in Go

• Approach: Combines Monte Carlo Tree Search + deep RL + self-play

• AlphaZero: Mastered Go, Chess, Shogi from scratch (no human data!)

Trained for 9 hours, surpassed centuries of human Go knowledge

Recent Trends (2020-2025):

• Offline RL: Learn from fixed datasets (no environment interaction)

• Model-Based RL: Learn environment model, plan with it (Dreamer, MuZero)

• Multi-Agent RL: Agents learning together/competitively

• RLHF: Reinforcement Learning from Human Feedback (GPT-4, Claude)

9. Popular RL Algorithms

🧠 Interactive: Algorithm Comparison

📊

Q-Learning

Off-policy TD learning. Learns optimal Q-values directly.

Update Formula:

Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

✓ Pros

Simple, proven, works well for discrete spaces

✗ Cons

Struggles with continuous actions, can be sample inefficient

10. Real-World Applications

🌍 Interactive: RL in Action

🎮

Game Playing

AlphaGo defeated world champions. RL agents master Atari, Dota 2, StarCraft.

Examples:

Chess, Go, Poker, Video games

🎯 Key Takeaways

🔄

Learning Through Interaction

RL agents learn by trial and error. They interact with environments, receive rewards, and improve their behavior over time without explicit supervision.

🎁

Reward is the Signal

Design rewards carefully—they define agent behavior. Sparse rewards are hard to learn from. Reward shaping guides agents toward goals efficiently.

⚖️

Exploration-Exploitation Tradeoff

Balance exploring new strategies vs exploiting known good ones. Epsilon-greedy is simple and effective. Start high exploration, decay over time.

📊

Q-Learning Foundation

Q-values represent expected future rewards. Q-learning uses Bellman equation to learn optimal Q-values. Deep Q-Networks scale to complex states with neural nets.

⚙️

Hyperparameters Matter

Learning rate (α) controls update speed. Discount factor (γ) balances immediate vs future rewards. Tune these carefully—they dramatically affect learning.

🚀

State-of-the-Art Algorithms

DQN for Atari games, PPO for robotics, AlphaGo for strategic games. Modern RL combines deep learning with classical algorithms for superhuman performance.