Home/Concepts/Artificial Intelligence/Decision Trees and Random Forests

Decision Trees and Random Forests

Build ensemble models and visualize decision boundaries

⏱️ 26 min⚡ 20 interactions

What are Decision Trees?

Decision Trees are intuitive ML models that make decisions by asking a series of yes/no questions. Think of it like a flowchart - you start at the top and follow branches until you reach a decision. Random Forests combine many trees to create more powerful, accurate predictions.

💡 The Core Idea

🌳

Single Tree

One flowchart, clear logic, but prone to overfitting

🌲

Random Forest

Many trees voting together, more robust and accurate

🎯

Ensemble Power

Wisdom of crowds - average many predictions for better results

Decision Tree Construction: The CART Algorithm

🌳 How Trees Grow: Recursive Partitioning

The Greedy Algorithm

• Goal: Split data into increasingly pure subsets until stopping criteria met

• Approach: Greedy - choose best split at each node (no backtracking)

• Result: Tree structure where each leaf represents a prediction

function BuildTree(data, depth):

if stopping_condition(data, depth):

return LeafNode(majority_class(data))

best_split = find_best_split(data) // key step!

left_data, right_data = split(data, best_split)

left_subtree = BuildTree(left_data, depth+1)

right_subtree = BuildTree(right_data, depth+1)

return DecisionNode(best_split, left_subtree, right_subtree)

🔍 Finding the Best Split

Step 1: Evaluate All Possible Splits

• For each feature: Try every possible threshold value

- Numerical: Age ≥ 30? Income ≥ $50K? (many thresholds)

- Categorical: City = "NY"? Color ∈ {"Red", "Blue"}?

• Calculate impurity reduction:

ΔImpurity = I(parent) - (N_left/N) I(left) - (N_right/N) I(right)

Where I() is Gini or Entropy, N is number of samples

Step 2: Choose Split with Maximum Gain

• Information Gain: How much does this split reduce uncertainty?

Gain = H(parent) - Σ (N_child/N_total) H(child)

• Pick feature and threshold with highest gain

• Tie-breaking: First encountered or random

Example: Loan Approval Dataset

100 samples: 60 approved, 40 denied

Split 1: Age ≥ 30

Left (Age ≥ 30): 50 approved, 10 denied → Gini = 0.267

Right (Age < 30): 10 approved, 30 denied → Gini = 0.375

Weighted Gini = 0.60 × 0.267 + 0.40 × 0.375 = 0.310

Split 2: Income ≥ $50K

Left (Income ≥ $50K): 55 approved, 15 denied → Gini = 0.347

Right (Income < $50K): 5 approved, 25 denied → Gini = 0.278

Weighted Gini = 0.70 × 0.347 + 0.30 × 0.278 = 0.326

→ Choose Age ≥ 30 (lower impurity = better split!)

🛑 Stopping Criteria

Stop splitting when ANY of these conditions met:

1. Maximum Depth Reached

• depth ≥ max_depth (e.g., 10)
• Prevents overly complex trees
• Typical: 3-10 for good generalization

2. Too Few Samples

• N_node < min_samples_split (e.g., 10)
• Not enough data to split reliably
• Prevents fitting noise

3. Pure Node

• All samples same class (Gini=0)
• Perfect classification achieved
• No benefit from further splitting

4. No Information Gain

• Best split doesn't reduce impurity
• Features exhausted or uninformative
• Can't improve predictions

⏱️ Computational Complexity

Training Complexity

O(n × m × log n)

where n = samples, m = features

• For each feature (m):

- Sort samples by feature value: O(n log n)

- Find best threshold: O(n) linear scan

• Tree depth: O(log n) on average, O(n) worst case

• Total nodes: ~2^depth, each does O(n×m) work

Prediction Complexity

O(depth) = O(log n) average, O(n) worst

Very fast! Just traverse tree from root to leaf

• Balanced tree: log₂(n) comparisons

• Example: 1M samples → ~20 comparisons

📚 Classic Algorithms

ID3 (1986)

• Quinlan's original algorithm

• Uses entropy and information gain

• Categorical features only

• No pruning

C4.5 (1993)

• Successor to ID3

• Handles numerical features

• Gain ratio (normalizes gain)

• Post-pruning with error-based

CART (1984)

• Classification & Regression

• Uses Gini impurity

• Binary splits only

• sklearn implementation

✅ Advantages

• Interpretable (visual flowchart)

• No feature scaling needed

• Handles mixed data types

• Fast prediction (O(log n))

✗ Disadvantages

• Prone to overfitting

• High variance (unstable)

• Greedy (locally optimal)

• Bias toward many-level features

💡 Key Insight

Decision trees are greedy - they choose the best split NOW without considering future splits. This makes them fast but suboptimal!

1. Build Your Decision Tree

🌳 Interactive: Construct a Tree

Tree Depth: 2

Shallow (1)Medium (3)Deep (5)

Root

Age ≥ 30?

Yes

Income ≥ $50K?

Credit ≥ 650?

🌳 How it works: Start at the root, ask a question about a feature, branch left or right based on the answer, repeat until reaching a leaf node (final decision).

Splitting Criteria: Gini vs Entropy

📊 Measuring Impurity: How Mixed Is This Node?

🎯 What Is Impurity?

Core Concept

Impurity measures how mixed the class labels are in a node:

• Pure node (impurity = 0): All samples same class

Example: [100 Yes, 0 No] → Perfect! No mixing

• Mixed node (impurity > 0): Multiple classes present

Example: [50 Yes, 50 No] → Maximum mixing

• Goal: Find splits that maximize purity (minimize impurity)

Why We Need Impurity Measures

• Quantify split quality: Which split reduces uncertainty most?

• Compare across features: Age ≥ 30 vs Income ≥ $50K?

• Stopping criterion: Stop when impurity ≤ threshold

• Guide greedy algorithm: Choose split with max impurity reduction

📐 Gini Impurity: Probability of Misclassification

Formula (Binary Classification)

Gini(p) = 2p(1 - p)

where p = proportion of positive class

Intuition:

If you randomly label a sample from this node based on class distribution,

what's the probability of being wrong?

Step-by-Step:

1. Pick random sample: Probability of positive = p

2. Guess label randomly: Guess positive with prob = p

3. Calculate error:

• Sample is positive, guess negative: p × (1-p)

• Sample is negative, guess positive: (1-p) × p

• Total error: p(1-p) + (1-p)p = 2p(1-p)

Gini Values (Binary)

Pure (p=0 or 1)

Gini = 0

[100,0] or [0,100]

Moderate (p=0.3)

Gini = 0.42

[30,70]

Max (p=0.5)

Gini = 0.5

[50,50]

Multi-class Extension

Gini = 1 - Σ p_i²

Sum over all classes i

Example: 3 classes [40, 30, 30] samples

p₁=0.4, p₂=0.3, p₃=0.3

Gini = 1 - (0.4² + 0.3² + 0.3²) = 1 - 0.34 = 0.66

🌀 Entropy: Information-Theoretic Uncertainty

Formula (Binary Classification)

H(p) = -p log₂(p) - (1-p) log₂(1-p)

where p = proportion of positive class

Intuition from Information Theory:

Entropy = expected number of bits needed to encode the class

Higher entropy = more uncertainty = more bits needed

Why log₂?

• Rare events carry more information: log converts probabilities to bits

If p=1 (certain) → log₂(1) = 0 bits needed

If p=0.5 (50/50) → log₂(0.5) = -1, so H = 1 bit needed

• Negative sign: Makes entropy positive (log of prob is negative)

Entropy Values (Binary)

Pure (p=0 or 1)

H = 0 bits

No uncertainty

Moderate (p=0.3)

H = 0.88 bits

Some uncertainty

Max (p=0.5)

H = 1 bit

Maximum uncertainty

Multi-class Extension

H = -Σ p_i log₂(p_i)

Sum over all classes i

Example: 3 classes [40, 30, 30] samples

H = -(0.4×log₂0.4 + 0.3×log₂0.3 + 0.3×log₂0.3)

H = -(0.4×(-1.32) + 0.3×(-1.74) + 0.3×(-1.74))

H = 1.57 bits

⚖️ Gini vs Entropy: Side-by-Side

Gini Impurity

✅ Faster to compute: No log calculations

✅ Range [0, 0.5]: Easier to interpret

✅ sklearn default: Industry standard

✅ Favors larger partitions: More balanced trees

⚠️ Approximates entropy: 2p(1-p) ≈ H(p)

Use when: Speed matters, default choice

Entropy

✅ Information theory: Principled approach

✅ Range [0, 1]: Bits of information

✅ Better for imbalanced: log handles rare classes

✅ More sensitive to changes: Sharper splits

⚠️ Slower: log₂ computation cost

Use when: Theoretical preference, ID3/C4.5

Practical Truth

In practice, Gini and Entropy produce very similar trees! Differences are minor (<1-2% accuracy). Choose Gini for speed, Entropy for theory. Most important: tree depth, min_samples, and pruning.

📈 Information Gain: The Reduction in Impurity

Formula

IG = I(parent) - Σ (N_child/N_parent) × I(child)

where I() is Gini or Entropy

Interpretation:

• Before split: Parent node impurity I(parent)

• After split: Weighted average of children impurities

• Information Gain: How much purity increased

• Goal: Maximize IG (pick split with highest gain)

Complete Example

Dataset: 100 loan applications [60 approved, 40 denied]

Parent Node:

Gini(0.6) = 2 × 0.6 × 0.4 = 0.48

Split: Age ≥ 30

• Left (≥30): 60 samples [50 approved, 10 denied]

Gini = 2 × (50/60) × (10/60) = 0.278

• Right (<30): 40 samples [10 approved, 30 denied]

Gini = 2 × (10/40) × (30/40) = 0.375

• Weighted avg: (60/100)×0.278 + (40/100)×0.375 = 0.317

• Information Gain = 0.48 - 0.317 = 0.163 ✓

Split: Income ≥ $50K

• Left (≥$50K): 70 samples [55 approved, 15 denied]

Gini = 2 × (55/70) × (15/70) = 0.337

• Right (<$50K): 30 samples [5 approved, 25 denied]

Gini = 2 × (5/30) × (25/30) = 0.278

• Weighted avg: (70/100)×0.337 + (30/100)×0.278 = 0.319

• Information Gain = 0.48 - 0.319 = 0.161 ✗

→ Choose Age ≥ 30 (higher IG: 0.163 > 0.161)

Gain Ratio (C4.5 Improvement)

GainRatio = IG / SplitInfo

SplitInfo = -Σ (N_i/N) log₂(N_i/N)

Problem: IG favors features with many values

Example: "CustomerID" has perfect IG (1 sample per value!)

Solution: Normalize by split entropy (SplitInfo)

Many-valued splits have high SplitInfo → lower GainRatio

C4.5 algorithm uses GainRatio instead of IG

🔑 Key Takeaway

Both Gini and Entropy measure "impurity" (class mixing). Split quality = reduction in impurity. Choose the split with maximum Information Gain!

💡 Practical Tip

Don't overthink Gini vs Entropy - they give similar results. Focus on tree complexity (depth, pruning) which has 10× more impact on performance!

2. Splitting Criteria: Gini vs Entropy

📊 Interactive: Measure Impurity

Class Distribution: 50% / 50%

Gini Impurity

0.500

Formula: 2p(1-p)

Entropy

1.000

Formula: -Σ p·log₂(p)

⚖️ Maximum impurity - completely mixed classes

3. Feature Importance

⭐ Interactive: Which Features Matter Most?

Age35%

Income28%

Credit Score22%

Employment15%

📊 How It's Calculated

Based on how much each feature reduces impurity when used for splitting. Higher = more predictive power.

💡 Why It Matters

Helps with feature selection, interpretability, and understanding your data. Drop low-importance features to simplify.

4. Preventing Overfitting

✂️ Interactive: Prune Your Tree

Max Depth: 5

Min Samples per Leaf: 2

Overfitting Risk59%

Train Accuracy

82%

Test Accuracy

71%

Gap

12%

⚖️ Moderate complexity. Consider slightly more constraints.

Random Forests: Ensemble Learning Power

🌳 From One Tree to a Forest

🎯 The Problem with Single Decision Trees

High Variance = Unstable Predictions

• Small data changes → Big tree changes:

Add one sample → completely different tree structure!

• Overfitting: Learns noise and specific patterns from training data

Perfect on training (100%) but poor on test (70%)

• Greedy algorithm: Locally optimal splits ≠ globally optimal tree

First split locks in future structure

Example: Loan Prediction

Training set: 100 samples

• Tree A: Splits on Age first → 85% test accuracy

• Tree B: Add 5 samples → now splits on Income first → 78% test accuracy

• Tree C: Remove 3 samples → splits on Credit score → 82% test accuracy

Problem: Which tree is "right"? Predictions vary wildly!

🌲 Random Forest Algorithm

Core Idea: Wisdom of Crowds

Train many diverse trees, then vote on final prediction

for i = 1 to N_trees:

1. Sample data with replacement (bootstrap)

2. At each split, sample random subset of features

3. Train tree on bootstrapped data + random features

Final prediction = majority vote (classification)

or average (regression)

Step 1: Bootstrap Aggregating (Bagging)

What is Bootstrap?

Sample N observations from dataset with replacement

Example: Dataset = [A, B, C, D, E]

• Bootstrap sample 1: [A, A, C, D, E] ← A appears twice, B missing

• Bootstrap sample 2: [A, B, B, D, E] ← B appears twice, C missing

• Bootstrap sample 3: [B, C, C, C, E] ← C appears 3×, A and D missing

Result: Each tree sees different data → diverse trees!

On average, each bootstrap sample contains ~63.2% unique observations
(Probability of being selected at least once: 1 - (1-1/N)^N → 1 - 1/e ≈ 0.632)

Step 2: Random Feature Subsampling

Key Innovation: At each split, only consider random subset of features

Why? Prevents strong features from dominating all trees

• If "Income" is strongest feature, ALL trees would split on it first

• Trees become correlated → ensemble doesn't help much

• Random subsampling forces trees to use different features

Typical values:

• Classification: √m features (e.g., 10 features → sample 3)
• Regression: m/3 features (e.g., 12 features → sample 4)

Step 3: Voting / Averaging

Classification

• Each tree votes for a class

• Majority vote wins

• Example: 100 trees

65 say "Approve"

35 say "Deny"

→ Predict "Approve"

Regression

• Each tree predicts a value

• Average all predictions

• Example: 100 trees

Tree outputs: [45K, 48K, 43K, ...]

→ Predict mean = $46K

📊 Why Random Forests Work: Bias-Variance Tradeoff

The Fundamental Equation

Error = Bias² + Variance + Noise

• Bias: How wrong is the model on average?

Low bias = complex model fits training data well

• Variance: How much do predictions change with different training data?

Low variance = stable, consistent predictions

• Noise: Irreducible error in data (can't fix)

Decision Tree: Low Bias, High Variance

• Low Bias: Deep trees can fit any pattern (very flexible)

• High Variance: Small data changes → big tree changes (unstable)

Perfect candidate for variance reduction via averaging!

Random Forest: Reduces Variance Without Increasing Bias

Var(average of N trees) = Var(single tree) / N

(assumes trees are independent)

Mathematical Intuition:

If you average N independent predictors, variance drops by factor of N

100 trees → variance reduced 100×!

In Practice:

Trees aren't fully independent (same dataset), so reduction is less

But still get major variance reduction: ~10-20× typical

Bias stays same:

Each tree is still fully grown (low bias)

Average of low-bias predictors = low bias

Visualization

Single Tree:

High variance (±15%), Low bias

Test accuracy: 70-85% (unstable!)

Random Forest (100 trees):

Low variance (±2%), Low bias

Test accuracy: 82-84% (stable!)

📈 Out-of-Bag (OOB) Error Estimation

Free Validation Set!

• Recall: Each bootstrap sample contains ~63% of data

• The remaining ~37% are "out-of-bag" (OOB) samples

• Key insight: These OOB samples weren't used to train that tree!

They act like a validation set for that tree

OOB Score Calculation

For each sample in dataset:

1. Find all trees where this sample was OOB (~37 trees out of 100)

2. Aggregate predictions from those trees only

3. Compare aggregate prediction to true label

OOB error = average error across all samples

Example: Sample #42 (true label = "Approve")

• OOB in trees: #3, #5, #12, #18, ... (40 trees total)

• These 40 trees vote: 28 say "Approve", 12 say "Deny"

• Prediction: "Approve" ✓ Correct!

Why OOB Is Amazing

✅ No separate validation set needed: Save data for training

✅ Unbiased estimate: As accurate as cross-validation

✅ Free: No extra computation (trees already trained)

✅ Hyperparameter tuning: Use OOB score to pick n_trees, max_features

Typical: OOB error ≈ test error (within 1-2%)

⚔️ Random Forest vs Other Ensembles

Random Forest (Bagging)

• Parallel training (fast!)

• Bootstrap + random features

• Reduces variance

• Each tree independent

• Robust to overfitting

Use when: Default choice, fast, interpretable

Gradient Boosting (XGBoost)

• Sequential training (slower)

• Each tree fixes previous errors

• Reduces bias + variance

• Trees are dependent

• Can overfit if not tuned

Use when: Need max accuracy, have tuning time

AdaBoost

• Sequential training

• Re-weights misclassified samples

• Uses shallow trees (stumps)

• Sensitive to outliers

• Less popular now

Use when: Simple baselines, historical code

Benchmark (Typical Results)

Random Forest: 85% accuracy, 2 min training

XGBoost: 87% accuracy, 10 min training

AdaBoost: 83% accuracy, 5 min training

RF wins on: Speed, robustness, ease of use. XGBoost wins on: Peak accuracy (Kaggle competitions).

🔑 Key Insight

Random Forests are "ensemble learning" - combining many weak learners (trees) into a strong learner. Diversity (bootstrap + random features) is the secret sauce!

💡 Practical Tip

Start with 100 trees, √m features. Rarely need tuning - RF is remarkably robust. Use OOB error instead of cross-validation to save time!

5. Build a Random Forest

🌲 Interactive: Grow Your Forest

Number of Trees: 10

1 tree10 (typical min)100+ (better)

🌲

Accuracy

85%

Training Time

1.0s

Variance

32%

Bootstrap Sampling(Each tree trains on random sample with replacement)

6. Bagging: Bootstrap Aggregating

🎒 Interactive: Sample & Aggregate

Sample Size: 70% of data

Tree 1

Sample 1

70% of 1000 = 700 samples

Prediction: ✓ Class A

Tree 2

Sample 2

70% of 1000 = 700 samples

Prediction: ✗ Class B

Tree 3

Sample 3

70% of 1000 = 700 samples

Prediction: ✓ Class A

🗳️ Final Ensemble Prediction

Class A (by majority vote)

7. Make a Prediction

🔮 Interactive: Loan Approval Predictor

Age: 35 years

Income: $50,000

✅

Approved

Meets all criteria for loan approval

💡 Decision Path: Age ≥ 25 ✓ → Income ≥ $30K ✓ → Approved!

8. Ensemble Voting Mechanism

🗳️ Interactive: Democracy of Trees

Tree 1

Tree 2

Tree 3

Final Prediction

Class A

By majority vote

9. Random Feature Selection

🎲 Interactive: Feature Subsampling

Each tree in a Random Forest considers only a random subset of features at each split. This creates diverse trees and prevents overfitting. Typical: √n features for classification, n/3 for regression.

Features Selected2 / 6

Recommended: √6 ≈ 3 features per split

10. Single Tree vs Random Forest

⚖️ Interactive: Compare Performance

Accuracy

94%

Overfitting

Low

Training Time

Interpretability

Medium

🌲 Random Forest

Slower to train but much more accurate and robust. Industry standard for tabular data. Lower variance through ensemble averaging.

🎯 Key Takeaways

🌳

Decision Trees are Intuitive

Easy to understand and visualize. Great for interpretability. But single trees overfit easily - they memorize training data noise.

🌲

Random Forests Fix Overfitting

Combine many trees trained on random subsets of data and features. Ensemble voting reduces variance. Industry standard for tabular data!

⭐

Feature Importance is Gold

Random Forests tell you which features matter most. Use this for feature selection, data understanding, and stakeholder communication.

✂️

Pruning Prevents Overfitting

Limit max_depth (3-10 typical), set min_samples_leaf (5-20), or use pre-pruning. Don't let trees grow too deep or they'll memorize noise.

🎒

Bagging Builds Better Models

Bootstrap Aggregating: train each tree on random sample with replacement. Reduces variance by √n. The foundation of Random Forests.

🚀

When to Use Random Forests

Tabular data, classification or regression, when accuracy matters more than speed. Alternatives: XGBoost/LightGBM (faster, often better), Neural Nets (unstructured data).