Short-Term Memory

Master how AI agents manage conversation context and working memory

Understanding Context Windows

A context window is the amount of text (measured in tokens) that an LLM can process in a single forward pass. It acts as the agent's immediate working memory—everything the model needs to consider when generating its next response.

Think of it like RAM in a computer: larger context windows allow agents to hold more information simultaneously, but come at the cost of slower processing and higher computational expense.

Interactive: Window Size Simulator

8,192 tokens
2K32K64K128K
Max Messages
54
(~150 tokens each)
Word Capacity
6,144
(~3/4 of tokens)
Cost Multiplier
2.0x
(vs 4K baseline)

✓ Standard Context

Good for most chat applications. Can handle detailed multi-turn conversations.

Interactive: Message Buffer Visualization

20 / 54
Buffer Usage37%
User150 tokens
Message 20: Sample content here...
Agent150 tokens
Message 19: Sample content here...
User150 tokens
Message 18: Sample content here...
Agent150 tokens
Message 17: Sample content here...
User150 tokens
Message 16: Sample content here...
Agent150 tokens
Message 15: Sample content here...
User150 tokens
Message 14: Sample content here...
Agent150 tokens
Message 13: Sample content here...
User150 tokens
Message 12: Sample content here...
Agent150 tokens
Message 11: Sample content here...
User150 tokens
Message 10: Sample content here...
Agent150 tokens
Message 9: Sample content here...
User150 tokens
Message 8: Sample content here...
Agent150 tokens
Message 7: Sample content here...
User150 tokens
Message 6: Sample content here...
Agent150 tokens
Message 5: Sample content here...
User150 tokens
Message 4: Sample content here...
Agent150 tokens
Message 3: Sample content here...
User150 tokens
Message 2: Sample content here...
Agent150 tokens
Message 1: Sample content here...

How Models Handle Context Limits

🔄 Sliding Window

Drop oldest messages first (FIFO). Simple but loses early context. Used by most chat apps.

📝 Summarization

Compress old messages into summaries. Retains key info but loses details. Better for long sessions.

⭐ Importance Filtering

Keep messages with high relevance scores. Requires extra computation but preserves critical context.

🔀 Hybrid Approach

Combine strategies: summarize middle, keep recent and important. Best results but most complex.

Key Insights

  • Larger windows ≠ better: They're slower, more expensive, and can dilute attention
  • Hard limit: Unlike attention, context windows are a strict boundary—exceeding them means data loss
  • Planning matters: Design conversations to fit within limits (chunking, summaries, retrieval)
  • Cost scales linearly: 2x the tokens = 2x the cost per request
Previous