Building Your Own Framework
Master designing and building custom agentic AI frameworks from scratch
Your Progress
0 / 5 completedProduction-Ready Patterns
Taking your custom framework to production requires robust observability, error handling, and deployment strategies. Here are essential patterns.
Structured Logging
Production Logger
import logging
import json
from datetime import datetime
from typing import Any, Dict
class AgentLogger:
def __init__(self, agent_id: str):
self.agent_id = agent_id
self.logger = logging.getLogger(f"agent.{agent_id}")
self.logger.setLevel(logging.INFO)
# JSON formatter for structured logs
handler = logging.StreamHandler()
handler.setFormatter(self._json_formatter)
self.logger.addHandler(handler)
def _json_formatter(self, record):
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"agent_id": self.agent_id,
"level": record.levelname,
"message": record.getMessage(),
"trace_id": getattr(record, "trace_id", None)
}
return json.dumps(log_data)
def log_action(self, action: str, tool: str, result: Any):
"""Log agent actions with context"""
self.logger.info(
"Agent action",
extra={
"action": action,
"tool": tool,
"result_type": type(result).__name__,
"success": not isinstance(result, dict) or "error" not in result
}
)
def log_error(self, error: Exception, context: Dict):
"""Log errors with full context"""
self.logger.error(
f"Error: {str(error)}",
extra={
"error_type": type(error).__name__,
"context": context,
"traceback": True
}
)📊 Monitoring & Metrics
- •Latency: Track time per agent loop, tool execution
- •Token usage: Monitor LLM costs per request
- •Error rate: Track tool failures, LLM errors
- •Loop iterations: Detect infinite loops early
🛡️ Error Handling
- •Retry with backoff: Exponential retry for transient errors
- •Circuit breakers: Stop calling failing services
- •Fallback responses: Default behavior when tools fail
- •Graceful degradation: Partial results better than none
Rate Limiting & Cost Control
Token Budget Management
from collections import defaultdict
from datetime import datetime, timedelta
class CostController:
def __init__(self, daily_token_limit: int = 1_000_000):
self.daily_token_limit = daily_token_limit
self.usage: Dict[str, int] = defaultdict(int)
self.last_reset = datetime.now()
def check_budget(self, estimated_tokens: int) -> bool:
"""Check if request is within budget"""
self._reset_if_new_day()
today = datetime.now().strftime("%Y-%m-%d")
current_usage = self.usage[today]
if current_usage + estimated_tokens > self.daily_token_limit:
return False
return True
def record_usage(self, tokens: int):
"""Record token usage"""
today = datetime.now().strftime("%Y-%m-%d")
self.usage[today] += tokens
def _reset_if_new_day(self):
"""Reset usage at midnight"""
now = datetime.now()
if (now - self.last_reset).days >= 1:
# Keep only last 7 days
cutoff = (now - timedelta(days=7)).strftime("%Y-%m-%d")
self.usage = {
k: v for k, v in self.usage.items() if k >= cutoff
}
self.last_reset = now
def get_usage_stats(self) -> Dict:
"""Get current usage statistics"""
today = datetime.now().strftime("%Y-%m-%d")
return {
"today_usage": self.usage[today],
"limit": self.daily_token_limit,
"remaining": self.daily_token_limit - self.usage[today],
"percent_used": (self.usage[today] / self.daily_token_limit) * 100
}Deployment Strategies
🐳 Docker
Benefits:
- Consistent environments
- Easy scaling
- Portable deployment
Use for:
Microservices, cloud platforms
☁️ Serverless
Benefits:
- Auto-scaling
- Pay per use
- No server management
Use for:
Burst workloads, low traffic
🖥️ VMs/K8s
Benefits:
- Full control
- Complex orchestration
- High availability
Use for:
Production systems, large scale
Configuration Management
Environment-Based Config
import os
from pydantic import BaseSettings
class AgentConfig(BaseSettings):
# LLM Configuration
model: str = "gpt-4"
max_tokens: int = 2000
temperature: float = 0.7
# Agent Configuration
max_iterations: int = 10
timeout_seconds: int = 30
# Cost Controls
daily_token_limit: int = 1_000_000
max_cost_per_request: float = 1.0
# Observability
log_level: str = "INFO"
enable_tracing: bool = True
# API Keys (from environment)
openai_api_key: str
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
# Load configuration
config = AgentConfig()
# Use in agent
agent = Agent(
model=config.model,
max_iterations=config.max_iterations,
timeout=config.timeout_seconds
)🚀 Production Checklist
Must Have:
- ✓Structured logging with trace IDs
- ✓Error handling and retries
- ✓Token/cost budgets
- ✓Health check endpoints
Should Have:
- +Distributed tracing (OpenTelemetry)
- +Metrics dashboard (Grafana)
- +Alert on error rates, costs
- +Load testing and benchmarks