Building Effective Alert Systems

Good alerting is an art. Alert too aggressively and your team develops alert fatigue, ignoring notifications. Alert too conservatively and critical issues go unnoticed until users complain.

The Golden Rules of Alerting

✓

Alert on symptoms, not causes: Alert when users are impacted, not when internal metrics fluctuate

✓

Every alert must be actionable: If there's nothing to do, it shouldn't page anyone

✓

Use appropriate severity levels: Critical = wake someone up, Warning = investigate during work hours, Info = track for trends

✓

Include context in alerts: Link to runbooks, show recent trends, suggest next steps

Three Severity Levels

🚨

Critical

Service degraded or down. User-facing impact. Requires immediate action.

Examples: Error rate > 5%, complete service outage, data loss

⚠️

Warning

Potential issue developing. May impact users soon. Investigate during work hours.

Examples: Latency trending up, queue growing, approaching capacity limits

ℹ️

Info

Notable event occurred. No immediate action needed. Good to know for context.

Examples: Deployment completed, configuration changed, traffic spike

Interactive: Alert Dashboard

Manage active alerts and configure alerting rules. Acknowledge alerts to clear them, toggle rules to enable/disable:

Active Alerts

2 Active

🚨

High Error Rate Detected

2 min ago

critical

Metric

Error Rate

Current Value

4.7%

Threshold

< 2%

⚠️

Elevated Latency

8 min ago

warning

Metric

P95 Latency

Current Value

723ms

Threshold

< 500ms

Recently Acknowledged

⚠️

Queue Depth Increasing(15 min ago)

✓ Acknowledged

Alert Rules Configuration

Toggle rules on/off to control which conditions trigger alerts:

Error Rate > 2%

error_rate > 0.02 for 5 minutes

critical

✓ Active

P95 Latency > 500ms

p95_latency > 500 for 10 minutes

warning

✓ Active

Success Rate < 95%

success_rate < 0.95 for 5 minutes

critical

✓ Active

Queue Depth > 50

queue_depth > 50

warning

✓ Active

Token Usage Spike

tokens_per_workflow > 10000

info

○ Disabled

💡

Alert Fatigue Prevention

Review your alerts quarterly. If an alert fires frequently but never leads to action, either fix the underlying issue or remove the alert. Every alert should earn its place by providing genuine value.

Workflow Monitoring

Your Progress

Building Effective Alert Systems

The Golden Rules of Alerting

Three Severity Levels

Critical

Warning

Info

Interactive: Alert Dashboard

Active Alerts

High Error Rate Detected

Elevated Latency

Recently Acknowledged

Alert Rules Configuration

Error Rate > 2%

P95 Latency > 500ms

Success Rate < 95%

Queue Depth > 50

Token Usage Spike