Home/Concepts/Artificial Intelligence/Decision Trees and Random Forests

Decision Trees and Random Forests

Build ensemble models and visualize decision boundaries

⏱️ 26 min20 interactions

What are Decision Trees?

Decision Trees are intuitive ML models that make decisions by asking a series of yes/no questions. Think of it like a flowchart - you start at the top and follow branches until you reach a decision. Random Forests combine many trees to create more powerful, accurate predictions.

💡 The Core Idea

🌳
Single Tree
One flowchart, clear logic, but prone to overfitting
🌲
Random Forest
Many trees voting together, more robust and accurate
🎯
Ensemble Power
Wisdom of crowds - average many predictions for better results

Decision Tree Construction: The CART Algorithm

🌳 How Trees Grow: Recursive Partitioning

The Greedy Algorithm

Goal: Split data into increasingly pure subsets until stopping criteria met
Approach: Greedy - choose best split at each node (no backtracking)
Result: Tree structure where each leaf represents a prediction
function BuildTree(data, depth):
if stopping_condition(data, depth):
return LeafNode(majority_class(data))
best_split = find_best_split(data) // key step!
left_data, right_data = split(data, best_split)
left_subtree = BuildTree(left_data, depth+1)
right_subtree = BuildTree(right_data, depth+1)
return DecisionNode(best_split, left_subtree, right_subtree)

🔍 Finding the Best Split

Step 1: Evaluate All Possible Splits
For each feature: Try every possible threshold value
- Numerical: Age ≥ 30? Income ≥ $50K? (many thresholds)
- Categorical: City = "NY"? Color ∈ {"Red", "Blue"}?
Calculate impurity reduction:
ΔImpurity = I(parent) - (Nleft/N) I(left) - (Nright/N) I(right)
Where I() is Gini or Entropy, N is number of samples
Step 2: Choose Split with Maximum Gain
Information Gain: How much does this split reduce uncertainty?
Gain = H(parent) - Σ (Nchild/Ntotal) H(child)
Pick feature and threshold with highest gain
Tie-breaking: First encountered or random
Example: Loan Approval Dataset
100 samples: 60 approved, 40 denied
Split 1: Age ≥ 30
Left (Age ≥ 30): 50 approved, 10 denied → Gini = 0.267
Right (Age < 30): 10 approved, 30 denied → Gini = 0.375
Weighted Gini = 0.60 × 0.267 + 0.40 × 0.375 = 0.310
Split 2: Income ≥ $50K
Left (Income ≥ $50K): 55 approved, 15 denied → Gini = 0.347
Right (Income < $50K): 5 approved, 25 denied → Gini = 0.278
Weighted Gini = 0.70 × 0.347 + 0.30 × 0.278 = 0.326
→ Choose Age ≥ 30 (lower impurity = better split!)

🛑 Stopping Criteria

Stop splitting when ANY of these conditions met:
1. Maximum Depth Reached
• depth ≥ max_depth (e.g., 10)
• Prevents overly complex trees
• Typical: 3-10 for good generalization
2. Too Few Samples
• Nnode < min_samples_split (e.g., 10)
• Not enough data to split reliably
• Prevents fitting noise
3. Pure Node
• All samples same class (Gini=0)
• Perfect classification achieved
• No benefit from further splitting
4. No Information Gain
• Best split doesn't reduce impurity
• Features exhausted or uninformative
• Can't improve predictions

⏱️ Computational Complexity

Training Complexity
O(n × m × log n)
where n = samples, m = features
For each feature (m):
- Sort samples by feature value: O(n log n)
- Find best threshold: O(n) linear scan
Tree depth: O(log n) on average, O(n) worst case
Total nodes: ~2depth, each does O(n×m) work
Prediction Complexity
O(depth) = O(log n) average, O(n) worst
Very fast! Just traverse tree from root to leaf
• Balanced tree: log₂(n) comparisons
• Example: 1M samples → ~20 comparisons

📚 Classic Algorithms

ID3 (1986)
• Quinlan's original algorithm
• Uses entropy and information gain
• Categorical features only
• No pruning
C4.5 (1993)
• Successor to ID3
• Handles numerical features
• Gain ratio (normalizes gain)
• Post-pruning with error-based
CART (1984)
• Classification & Regression
• Uses Gini impurity
• Binary splits only
• sklearn implementation

✅ Advantages

• Interpretable (visual flowchart)
• No feature scaling needed
• Handles mixed data types
• Fast prediction (O(log n))

✗ Disadvantages

• Prone to overfitting
• High variance (unstable)
• Greedy (locally optimal)
• Bias toward many-level features

💡 Key Insight

Decision trees are greedy - they choose the best split NOW without considering future splits. This makes them fast but suboptimal!

1. Build Your Decision Tree

🌳 Interactive: Construct a Tree

Shallow (1)Medium (3)Deep (5)
Root
Age ≥ 30?
Yes
Income ≥ $50K?
No
Credit ≥ 650?

🌳 How it works: Start at the root, ask a question about a feature, branch left or right based on the answer, repeat until reaching a leaf node (final decision).

Splitting Criteria: Gini vs Entropy

📊 Measuring Impurity: How Mixed Is This Node?

🎯 What Is Impurity?

Core Concept
Impurity measures how mixed the class labels are in a node:
Pure node (impurity = 0): All samples same class
Example: [100 Yes, 0 No] → Perfect! No mixing
Mixed node (impurity > 0): Multiple classes present
Example: [50 Yes, 50 No] → Maximum mixing
Goal: Find splits that maximize purity (minimize impurity)
Why We Need Impurity Measures
Quantify split quality: Which split reduces uncertainty most?
Compare across features: Age ≥ 30 vs Income ≥ $50K?
Stopping criterion: Stop when impurity ≤ threshold
Guide greedy algorithm: Choose split with max impurity reduction

📐 Gini Impurity: Probability of Misclassification

Formula (Binary Classification)
Gini(p) = 2p(1 - p)
where p = proportion of positive class
Intuition:
If you randomly label a sample from this node based on class distribution,
what's the probability of being wrong?
Step-by-Step:
1. Pick random sample: Probability of positive = p
2. Guess label randomly: Guess positive with prob = p
3. Calculate error:
• Sample is positive, guess negative: p × (1-p)
• Sample is negative, guess positive: (1-p) × p
• Total error: p(1-p) + (1-p)p = 2p(1-p)
Gini Values (Binary)
Pure (p=0 or 1)
Gini = 0
[100,0] or [0,100]
Moderate (p=0.3)
Gini = 0.42
[30,70]
Max (p=0.5)
Gini = 0.5
[50,50]
Multi-class Extension
Gini = 1 - Σ pi²
Sum over all classes i
Example: 3 classes [40, 30, 30] samples
p₁=0.4, p₂=0.3, p₃=0.3
Gini = 1 - (0.4² + 0.3² + 0.3²) = 1 - 0.34 = 0.66

🌀 Entropy: Information-Theoretic Uncertainty

Formula (Binary Classification)
H(p) = -p log₂(p) - (1-p) log₂(1-p)
where p = proportion of positive class
Intuition from Information Theory:
Entropy = expected number of bits needed to encode the class
Higher entropy = more uncertainty = more bits needed
Why log₂?
Rare events carry more information: log converts probabilities to bits
If p=1 (certain) → log₂(1) = 0 bits needed
If p=0.5 (50/50) → log₂(0.5) = -1, so H = 1 bit needed
Negative sign: Makes entropy positive (log of prob is negative)
Entropy Values (Binary)
Pure (p=0 or 1)
H = 0 bits
No uncertainty
Moderate (p=0.3)
H = 0.88 bits
Some uncertainty
Max (p=0.5)
H = 1 bit
Maximum uncertainty
Multi-class Extension
H = -Σ pi log₂(pi)
Sum over all classes i
Example: 3 classes [40, 30, 30] samples
H = -(0.4×log₂0.4 + 0.3×log₂0.3 + 0.3×log₂0.3)
H = -(0.4×(-1.32) + 0.3×(-1.74) + 0.3×(-1.74))
H = 1.57 bits

⚖️ Gini vs Entropy: Side-by-Side

Gini Impurity
Faster to compute: No log calculations
Range [0, 0.5]: Easier to interpret
sklearn default: Industry standard
Favors larger partitions: More balanced trees
⚠️ Approximates entropy: 2p(1-p) ≈ H(p)
Use when: Speed matters, default choice
Entropy
Information theory: Principled approach
Range [0, 1]: Bits of information
Better for imbalanced: log handles rare classes
More sensitive to changes: Sharper splits
⚠️ Slower: log₂ computation cost
Use when: Theoretical preference, ID3/C4.5
Practical Truth
In practice, Gini and Entropy produce very similar trees! Differences are minor (<1-2% accuracy). Choose Gini for speed, Entropy for theory. Most important: tree depth, min_samples, and pruning.

📈 Information Gain: The Reduction in Impurity

Formula
IG = I(parent) - Σ (Nchild/Nparent) × I(child)
where I() is Gini or Entropy
Interpretation:
Before split: Parent node impurity I(parent)
After split: Weighted average of children impurities
Information Gain: How much purity increased
Goal: Maximize IG (pick split with highest gain)
Complete Example
Dataset: 100 loan applications [60 approved, 40 denied]
Parent Node:
Gini(0.6) = 2 × 0.6 × 0.4 = 0.48
Split: Age ≥ 30
• Left (≥30): 60 samples [50 approved, 10 denied]
Gini = 2 × (50/60) × (10/60) = 0.278
• Right (<30): 40 samples [10 approved, 30 denied]
Gini = 2 × (10/40) × (30/40) = 0.375
• Weighted avg: (60/100)×0.278 + (40/100)×0.375 = 0.317
Information Gain = 0.48 - 0.317 = 0.163
Split: Income ≥ $50K
• Left (≥$50K): 70 samples [55 approved, 15 denied]
Gini = 2 × (55/70) × (15/70) = 0.337
• Right (<$50K): 30 samples [5 approved, 25 denied]
Gini = 2 × (5/30) × (25/30) = 0.278
• Weighted avg: (70/100)×0.337 + (30/100)×0.278 = 0.319
Information Gain = 0.48 - 0.319 = 0.161
→ Choose Age ≥ 30 (higher IG: 0.163 > 0.161)
Gain Ratio (C4.5 Improvement)
GainRatio = IG / SplitInfo
SplitInfo = -Σ (Ni/N) log₂(Ni/N)
Problem: IG favors features with many values
Example: "CustomerID" has perfect IG (1 sample per value!)
Solution: Normalize by split entropy (SplitInfo)
Many-valued splits have high SplitInfo → lower GainRatio
C4.5 algorithm uses GainRatio instead of IG

🔑 Key Takeaway

Both Gini and Entropy measure "impurity" (class mixing). Split quality = reduction in impurity. Choose the split with maximum Information Gain!

💡 Practical Tip

Don't overthink Gini vs Entropy - they give similar results. Focus on tree complexity (depth, pruning) which has 10× more impact on performance!

2. Splitting Criteria: Gini vs Entropy

📊 Interactive: Measure Impurity

Gini Impurity

0.500
Formula: 2p(1-p)

Entropy

1.000
Formula: -Σ p·log₂(p)

⚖️ Maximum impurity - completely mixed classes

3. Feature Importance

⭐ Interactive: Which Features Matter Most?

Age35%
Income28%
Credit Score22%
Employment15%
📊 How It's Calculated
Based on how much each feature reduces impurity when used for splitting. Higher = more predictive power.
💡 Why It Matters
Helps with feature selection, interpretability, and understanding your data. Drop low-importance features to simplify.

4. Preventing Overfitting

✂️ Interactive: Prune Your Tree

Overfitting Risk59%
Train Accuracy
82%
Test Accuracy
71%
Gap
12%

⚖️ Moderate complexity. Consider slightly more constraints.

Random Forests: Ensemble Learning Power

🌳 From One Tree to a Forest

🎯 The Problem with Single Decision Trees

High Variance = Unstable Predictions
Small data changes → Big tree changes:
Add one sample → completely different tree structure!
Overfitting: Learns noise and specific patterns from training data
Perfect on training (100%) but poor on test (70%)
Greedy algorithm: Locally optimal splits ≠ globally optimal tree
First split locks in future structure
Example: Loan Prediction
Training set: 100 samples
Tree A: Splits on Age first → 85% test accuracy
Tree B: Add 5 samples → now splits on Income first → 78% test accuracy
Tree C: Remove 3 samples → splits on Credit score → 82% test accuracy
Problem: Which tree is "right"? Predictions vary wildly!

🌲 Random Forest Algorithm

Core Idea: Wisdom of Crowds
Train many diverse trees, then vote on final prediction
for i = 1 to N_trees:
1. Sample data with replacement (bootstrap)
2. At each split, sample random subset of features
3. Train tree on bootstrapped data + random features
Final prediction = majority vote (classification)
or average (regression)
Step 1: Bootstrap Aggregating (Bagging)
What is Bootstrap?
Sample N observations from dataset with replacement
Example: Dataset = [A, B, C, D, E]
• Bootstrap sample 1: [A, A, C, D, E] ← A appears twice, B missing
• Bootstrap sample 2: [A, B, B, D, E] ← B appears twice, C missing
• Bootstrap sample 3: [B, C, C, C, E] ← C appears 3×, A and D missing
Result: Each tree sees different data → diverse trees!
On average, each bootstrap sample contains ~63.2% unique observations
(Probability of being selected at least once: 1 - (1-1/N)^N → 1 - 1/e ≈ 0.632)
Step 2: Random Feature Subsampling
Key Innovation: At each split, only consider random subset of features
Why? Prevents strong features from dominating all trees
• If "Income" is strongest feature, ALL trees would split on it first
• Trees become correlated → ensemble doesn't help much
• Random subsampling forces trees to use different features
Typical values:
• Classification: √m features (e.g., 10 features → sample 3)
• Regression: m/3 features (e.g., 12 features → sample 4)
Step 3: Voting / Averaging
Classification
• Each tree votes for a class
• Majority vote wins
• Example: 100 trees
65 say "Approve"
35 say "Deny"
→ Predict "Approve"
Regression
• Each tree predicts a value
• Average all predictions
• Example: 100 trees
Tree outputs: [45K, 48K, 43K, ...]
→ Predict mean = $46K

📊 Why Random Forests Work: Bias-Variance Tradeoff

The Fundamental Equation
Error = Bias² + Variance + Noise
Bias: How wrong is the model on average?
Low bias = complex model fits training data well
Variance: How much do predictions change with different training data?
Low variance = stable, consistent predictions
Noise: Irreducible error in data (can't fix)
Decision Tree: Low Bias, High Variance
Low Bias: Deep trees can fit any pattern (very flexible)
High Variance: Small data changes → big tree changes (unstable)
Perfect candidate for variance reduction via averaging!
Random Forest: Reduces Variance Without Increasing Bias
Var(average of N trees) = Var(single tree) / N
(assumes trees are independent)
Mathematical Intuition:
If you average N independent predictors, variance drops by factor of N
100 trees → variance reduced 100×!
In Practice:
Trees aren't fully independent (same dataset), so reduction is less
But still get major variance reduction: ~10-20× typical
Bias stays same:
Each tree is still fully grown (low bias)
Average of low-bias predictors = low bias
Visualization
Single Tree:
High variance (±15%), Low bias
Test accuracy: 70-85% (unstable!)
Random Forest (100 trees):
Low variance (±2%), Low bias
Test accuracy: 82-84% (stable!)

📈 Out-of-Bag (OOB) Error Estimation

Free Validation Set!
Recall: Each bootstrap sample contains ~63% of data
The remaining ~37% are "out-of-bag" (OOB) samples
Key insight: These OOB samples weren't used to train that tree!
They act like a validation set for that tree
OOB Score Calculation
For each sample in dataset:
1. Find all trees where this sample was OOB (~37 trees out of 100)
2. Aggregate predictions from those trees only
3. Compare aggregate prediction to true label
OOB error = average error across all samples
Example: Sample #42 (true label = "Approve")
• OOB in trees: #3, #5, #12, #18, ... (40 trees total)
• These 40 trees vote: 28 say "Approve", 12 say "Deny"
• Prediction: "Approve" ✓ Correct!
Why OOB Is Amazing
No separate validation set needed: Save data for training
Unbiased estimate: As accurate as cross-validation
Free: No extra computation (trees already trained)
Hyperparameter tuning: Use OOB score to pick n_trees, max_features
Typical: OOB error ≈ test error (within 1-2%)

⚔️ Random Forest vs Other Ensembles

Random Forest (Bagging)
• Parallel training (fast!)
• Bootstrap + random features
• Reduces variance
• Each tree independent
• Robust to overfitting
Use when: Default choice, fast, interpretable
Gradient Boosting (XGBoost)
• Sequential training (slower)
• Each tree fixes previous errors
• Reduces bias + variance
• Trees are dependent
• Can overfit if not tuned
Use when: Need max accuracy, have tuning time
AdaBoost
• Sequential training
• Re-weights misclassified samples
• Uses shallow trees (stumps)
• Sensitive to outliers
• Less popular now
Use when: Simple baselines, historical code
Benchmark (Typical Results)
Random Forest: 85% accuracy, 2 min training
XGBoost: 87% accuracy, 10 min training
AdaBoost: 83% accuracy, 5 min training
RF wins on: Speed, robustness, ease of use. XGBoost wins on: Peak accuracy (Kaggle competitions).

🔑 Key Insight

Random Forests are "ensemble learning" - combining many weak learners (trees) into a strong learner. Diversity (bootstrap + random features) is the secret sauce!

💡 Practical Tip

Start with 100 trees, √m features. Rarely need tuning - RF is remarkably robust. Use OOB error instead of cross-validation to save time!

5. Build a Random Forest

🌲 Interactive: Grow Your Forest

1 tree10 (typical min)100+ (better)
🌲
🌲
🌲
🌲
🌲
🌲
🌲
🌲
🌲
🌲
Accuracy
85%
Training Time
1.0s
Variance
32%

6. Bagging: Bootstrap Aggregating

🎒 Interactive: Sample & Aggregate

Tree 1
Sample 1
70% of 1000 = 700 samples
Prediction: ✗ Class B
Tree 2
Sample 2
70% of 1000 = 700 samples
Prediction: ✗ Class B
Tree 3
Sample 3
70% of 1000 = 700 samples
Prediction: ✗ Class B
🗳️ Final Ensemble Prediction
Class A (by majority vote)

7. Make a Prediction

🔮 Interactive: Loan Approval Predictor

Approved
Meets all criteria for loan approval

💡 Decision Path: Age ≥ 25 ✓ → Income ≥ $30K ✓ → Approved!

8. Ensemble Voting Mechanism

🗳️ Interactive: Democracy of Trees

Tree 1
Tree 2
Tree 3
Final Prediction
Class A
By majority vote

9. Random Feature Selection

🎲 Interactive: Feature Subsampling

Each tree in a Random Forest considers only a random subset of features at each split. This creates diverse trees and prevents overfitting. Typical: √n features for classification, n/3 for regression.

Features Selected2 / 6
Recommended:63 features per split

10. Single Tree vs Random Forest

⚖️ Interactive: Compare Performance

Accuracy
94%
Overfitting
Low
Training Time
8s
Interpretability
Medium
🌲 Random Forest
Slower to train but much more accurate and robust. Industry standard for tabular data. Lower variance through ensemble averaging.

🎯 Key Takeaways

🌳

Decision Trees are Intuitive

Easy to understand and visualize. Great for interpretability. But single trees overfit easily - they memorize training data noise.

🌲

Random Forests Fix Overfitting

Combine many trees trained on random subsets of data and features. Ensemble voting reduces variance. Industry standard for tabular data!

Feature Importance is Gold

Random Forests tell you which features matter most. Use this for feature selection, data understanding, and stakeholder communication.

✂️

Pruning Prevents Overfitting

Limit max_depth (3-10 typical), set min_samples_leaf (5-20), or use pre-pruning. Don't let trees grow too deep or they'll memorize noise.

🎒

Bagging Builds Better Models

Bootstrap Aggregating: train each tree on random sample with replacement. Reduces variance by √n. The foundation of Random Forests.

🚀

When to Use Random Forests

Tabular data, classification or regression, when accuracy matters more than speed. Alternatives: XGBoost/LightGBM (faster, often better), Neural Nets (unstructured data).