Home/Agentic AI/Short-Term Memory/Context Windows

Short-Term Memory

Master how AI agents manage conversation context and working memory

Your Progress

0 / 5 completed

Understanding Context Windows

A context window is the amount of text (measured in tokens) that an LLM can process in a single forward pass. It acts as the agent's immediate working memory—everything the model needs to consider when generating its next response.

Think of it like RAM in a computer: larger context windows allow agents to hold more information simultaneously, but come at the cost of slower processing and higher computational expense.

Interactive: Window Size Simulator

Context Window Size8,192 tokens

2K32K64K128K

Max Messages

(~150 tokens each)

Word Capacity

6,144

(~3/4 of tokens)

Cost Multiplier

2.0x

(vs 4K baseline)

✓ Standard Context

Good for most chat applications. Can handle detailed multi-turn conversations.

Interactive: Message Buffer Visualization

Messages in Window20 / 54

Buffer Usage37%

User150 tokens

Message 20: Sample content here...

Agent150 tokens

Message 19: Sample content here...

User150 tokens

Message 18: Sample content here...

Agent150 tokens

Message 17: Sample content here...

User150 tokens

Message 16: Sample content here...

Agent150 tokens

Message 15: Sample content here...

User150 tokens

Message 14: Sample content here...

Agent150 tokens

Message 13: Sample content here...

User150 tokens

Message 12: Sample content here...

Agent150 tokens

Message 11: Sample content here...

User150 tokens

Message 10: Sample content here...

Agent150 tokens

Message 9: Sample content here...

User150 tokens

Message 8: Sample content here...

Agent150 tokens

Message 7: Sample content here...

User150 tokens

Message 6: Sample content here...

Agent150 tokens

Message 5: Sample content here...

User150 tokens

Message 4: Sample content here...

Agent150 tokens

Message 3: Sample content here...

User150 tokens

Message 2: Sample content here...

Agent150 tokens

Message 1: Sample content here...

How Models Handle Context Limits

🔄 Sliding Window

Drop oldest messages first (FIFO). Simple but loses early context. Used by most chat apps.

📝 Summarization

Compress old messages into summaries. Retains key info but loses details. Better for long sessions.

⭐ Importance Filtering

Keep messages with high relevance scores. Requires extra computation but preserves critical context.

🔀 Hybrid Approach

Combine strategies: summarize middle, keep recent and important. Best results but most complex.

Key Insights

• Larger windows ≠ better: They're slower, more expensive, and can dilute attention
• Hard limit: Unlike attention, context windows are a strict boundary—exceeding them means data loss
• Planning matters: Design conversations to fit within limits (chunking, summaries, retrieval)
• Cost scales linearly: 2x the tokens = 2x the cost per request

←IntroductionPrevious