Alerting & Debugging

Alerts wake you up at 3am. Make them count. Alert on symptoms (error rate spike), not causes (CPU high). Set thresholds based on user impact: if users aren't affected, don't page. Use severity levels: INFO (log only), WARN (investigate tomorrow), ERROR (page immediately). When debugging: 1) Check dashboard (what's broken?), 2) Grep logs by trace ID (what happened?), 3) Examine traces (where's the bottleneck?). 5-minute diagnosis beats 5-hour guesswork.

Interactive: Alert Management Simulator

Watch how alerts trigger based on real-time conditions. Click "Start Monitoring" to simulate an incident:

Alert Monitoring System

Time: 0s

Error Rate Spike

high

Condition: Error rate >2% for 5 minutes

Action: Page on-call engineer

High Latency

medium

Condition: P95 latency >1000ms for 10 minutes

Action: Slack notification

Cost Spike

critical

Condition: Hourly cost >$500

Action: Page team lead + pause non-critical agents

Queue Congestion

medium

Condition: Queue depth >1000 requests

Action: Auto-scale workers

Debugging Workflow

Check Dashboard: Which metric is abnormal? Error rate? Latency? Cost?

Find Trace ID: Pick a failing request from logs, grab its trace_id

Follow The Trace: See request journey across services, find bottleneck

Grep Logs: Search all logs for that trace_id, read error messages

Fix & Verify: Deploy fix, watch metrics return to normal, document incident

💡

Alert Fatigue Is Real

Too many alerts = ignored alerts. If you page for every warning, engineers will ignore pages. Only alert on user-impacting issues. ERROR alert = immediate response required. WARN alert = investigate during business hours. INFO = just log it. Review alert history monthly: Which alerts were false positives? Which real incidents had no alert? Tune thresholds. Good alerting means 95% of pages are real problems needing immediate action.

Monitoring & Observability

Your Progress

Alerting & Debugging

Interactive: Alert Management Simulator

Debugging Workflow