Home/Agentic AI/Short-Term Memory/Attention Mechanisms

Short-Term Memory

Master how AI agents manage conversation context and working memory

Attention: The Focus Mechanism

While context windows define what information is available, attention mechanisms determine what information matters. Think of attention as a spotlight that can illuminate different parts of the context with varying intensity.

Instead of treating all tokens equally, attention assigns weights to each token based on its relevance to the current generation step. This allows models to "focus" on important words while ignoring filler.

Interactive: Attention Weight Visualization

Watch how attention highlights relevant words. Token brightness = attention weight.

Thecatsatonthematandslept
High attention words: "cat" (0.85), "mat" (0.75), "slept" (0.70) โ€” these carry the core meaning

Interactive: Types of Attention

Self-Attention

Each token attends to all other tokens in the same sequence. Used within a single input (e.g., understanding a sentence).

PurposeContextual understanding
Example"bank" โ†’ river or money?
Used InBERT, GPT encoders

Analogy: Reading a sentence and understanding each word based on the surrounding words.

How Attention Works (Simplified)

Step 1: Compute Scores

For each token pair, calculate a similarity score (how related they are).

score(Q, K) = Q ยท K^T / โˆšd_k

Step 2: Softmax Normalization

Convert scores to weights that sum to 1 (probability distribution).

weights = softmax(scores)

Step 3: Weighted Sum

Multiply each token's value by its weight and sum them up.

output = ฮฃ (weights[i] ร— V[i])

Result: Each token gets a representation that's a weighted combination of all tokens, with relevant ones contributing more.

Why Attention Matters for Agents

โœ… Enables

  • โ€ข Selective focus: Ignore filler, focus on key info
  • โ€ข Long-range dependencies: Connect distant tokens
  • โ€ข Parallel processing: All tokens computed at once
  • โ€ข Interpretability: Visualize what model focuses on

โš ๏ธ Limitations

  • โ€ข Quadratic complexity: O(nยฒ) for sequence length n
  • โ€ข Memory intensive: Stores attention matrix
  • โ€ข Still bounded: Can't attend beyond context window
  • โ€ข Computational cost: Expensive for very long contexts
โ†Previous