Workflow Monitoring

Build observability systems to track, analyze, and optimize your agentic AI workflows in production

Building Effective Alert Systems

Good alerting is an art. Alert too aggressively and your team develops alert fatigue, ignoring notifications. Alert too conservatively and critical issues go unnoticed until users complain.

The Golden Rules of Alerting

✓
Alert on symptoms, not causes: Alert when users are impacted, not when internal metrics fluctuate
✓
Every alert must be actionable: If there's nothing to do, it shouldn't page anyone
✓
Use appropriate severity levels: Critical = wake someone up, Warning = investigate during work hours, Info = track for trends
✓
Include context in alerts: Link to runbooks, show recent trends, suggest next steps

Three Severity Levels

🚨

Critical

Service degraded or down. User-facing impact. Requires immediate action.

Examples: Error rate > 5%, complete service outage, data loss
âš ī¸

Warning

Potential issue developing. May impact users soon. Investigate during work hours.

Examples: Latency trending up, queue growing, approaching capacity limits
â„šī¸

Info

Notable event occurred. No immediate action needed. Good to know for context.

Examples: Deployment completed, configuration changed, traffic spike

Interactive: Alert Dashboard

Manage active alerts and configure alerting rules. Acknowledge alerts to clear them, toggle rules to enable/disable:

Active Alerts

2 Active
🚨

High Error Rate Detected

2 min ago
critical
Metric
Error Rate
Current Value
4.7%
Threshold
< 2%
âš ī¸

Elevated Latency

8 min ago
warning
Metric
P95 Latency
Current Value
723ms
Threshold
< 500ms

Recently Acknowledged

âš ī¸
Queue Depth Increasing(15 min ago)
✓ Acknowledged

Alert Rules Configuration

Toggle rules on/off to control which conditions trigger alerts:

Error Rate > 2%

error_rate > 0.02 for 5 minutes
critical
✓ Active

P95 Latency > 500ms

p95_latency > 500 for 10 minutes
warning
✓ Active

Success Rate < 95%

success_rate < 0.95 for 5 minutes
critical
✓ Active

Queue Depth > 50

queue_depth > 50
warning
✓ Active

Token Usage Spike

tokens_per_workflow > 10000
info
○ Disabled
💡
Alert Fatigue Prevention

Review your alerts quarterly. If an alert fires frequently but never leads to action, either fix the underlying issue or remove the alert. Every alert should earn its place by providing genuine value.

← Previous: Key Metrics