Learn how to give AI agents the ability to remember recent conversations, handle follow-up questions, and manage conversation history across multiple interactions.

This article is part of the free-to-read AI Agent Handbook
Short-Term Conversation Memory
You've built an agent that can reason, use tools, and solve complex problems. But there's a fundamental limitation: it forgets everything after each interaction. Ask it a question, get an answer, then ask a follow-up, and it has no idea what you're talking about.
Imagine calling a help desk where the representative forgets your conversation every 30 seconds. You'd have to re-explain your problem constantly. Frustrating, right? That's exactly what happens with an agent that lacks memory.
In this chapter, we'll give your personal assistant the ability to remember recent conversations. You'll learn how to maintain context across multiple interactions, handle follow-up questions, and manage conversation history as it grows. By the end, your agent will feel less like a stateless question-answering machine and more like an assistant that actually remembers what you've been discussing.
The Problem: Stateless Interactions
Let's see what happens without memory. Here's a conversation with our agent:
The agent answered the first question perfectly. But when you asked a follow-up ("What's the population?"), it had no memory of discussing France. Each interaction is isolated, like talking to someone with amnesia.
This breaks down quickly in real conversations. People naturally ask follow-up questions, refer to previous topics, and build on earlier context. Without memory, your agent can't handle these basic conversational patterns.
Example (Claude Sonnet 4.5)
Let's see this in code:
Each call to ask_without_memory creates a fresh conversation. The agent has no context from previous questions. This is the default behavior when you don't explicitly manage conversation history.
The Solution: Conversation History
The fix is straightforward: keep track of the conversation and send it with each new message. Instead of just sending the latest question, you send the entire dialogue history.
Here's what that looks like:
Now let's try the same conversation with memory:
The agent understood "the population" refers to Paris because it remembered the previous exchange. This is the foundation of conversational AI: maintaining context across turns.
How Conversation History Works
Let's look at what's actually happening behind the scenes. After the two-question exchange above, our history list contains:
Each message is stored with its role (either "user" or "assistant") and content. When you ask a new question, the entire list goes to the model. The model sees the full conversation thread and can reference earlier messages.
Think of it like showing someone a chat transcript. They can read the whole conversation and understand the context of the latest message. That's exactly what we're doing with the language model.
Building a Conversational Agent
Let's create a more complete conversational agent that manages memory automatically:
Now we can have natural, multi-turn conversations:
Notice how each response builds on the previous context. The agent remembers we're discussing a Japan trip and tailors its answers accordingly.
Handling Follow-Up Questions
One of the most powerful aspects of conversation memory is handling follow-ups. People rarely ask perfectly self-contained questions. They say "What about that?", "Can you explain more?", or "Why is that?"
Let's see this in action:
Each follow-up question would be impossible to answer without context. "Which one" refers to Python and JavaScript. "Why is that" refers to Python being easier. "You mentioned" explicitly references the earlier response.
The agent handles all of this naturally because it has the full conversation in memory.
Memory with Tool Use
Memory becomes even more important when your agent uses tools. The agent needs to remember:
- What tools it has called
- What results it received
- How those results relate to the user's questions
Let's extend our conversational agent to support tools:
Now watch how memory works with tools:
The agent remembers both the conversation and the tool results. "Double that" refers to the previous calculation. "The original number" refers to the first result, not the doubled one.
This is powerful. The agent can build on its own work, reference previous calculations, and maintain context across multiple tool uses.
The Context Window Challenge
Here's a problem: conversation history grows with every exchange. After 50 turns, you're sending 100 messages (50 user, 50 assistant) with every new question. This creates two issues:
Cost: Most API providers charge per token. Sending the entire history every time gets expensive.
Context limits: Models have maximum context windows. Claude Sonnet 4.5 supports 200K tokens, but you'll eventually hit limits in very long conversations.
Let's see this problem in practice:
For a long conversation, you might be sending thousands of tokens with each request. Most of those tokens are old messages that might not be relevant anymore.
Solution 1: Sliding Window Memory
The simplest solution: only keep the most recent N messages. This is called a "sliding window" because you keep a fixed-size window that slides forward as the conversation progresses.
With a sliding window of 20 messages, the agent remembers the last 10 exchanges (10 user messages + 10 assistant responses). Older messages are discarded.
This works well for:
- Casual conversations where old context isn't needed
- Cost-sensitive applications
- Very long conversations that would exceed context limits
The tradeoff: the agent forgets older parts of the conversation. If you reference something from 15 exchanges ago, the agent won't remember it.
Solution 2: Conversation Summarization
A more sophisticated approach: periodically summarize old messages and replace them with the summary. This preserves important information while reducing token count.
This approach:
- Keeps recent messages in full detail
- Summarizes older messages to preserve important context
- Reduces token count while maintaining continuity
The tradeoff: summarization costs an extra API call, and some details might be lost in the summary.
Choosing a Memory Strategy
Which approach should you use? It depends on your use case:
Full history (no limits):
- Best for: Short conversations, when cost isn't a concern
- Pros: Perfect memory, no information loss
- Cons: Expensive for long conversations, can hit context limits
Sliding window:
- Best for: Casual chat, when only recent context matters
- Pros: Simple, predictable cost, never hits context limits
- Cons: Forgets older information completely
Summarization:
- Best for: Long conversations where older context matters
- Pros: Preserves important information, manages token count
- Cons: More complex, costs extra for summarization, might lose details
For our personal assistant, a hybrid approach often works best:
This combines both strategies: keep recent messages in full, summarize older ones.
Practical Considerations
As you build memory into your agent, keep these points in mind:
Start simple: Begin with full history. Only add complexity (windowing, summarization) when you actually need it.
Monitor costs: Track how many tokens you're sending. If costs are high, implement a sliding window.
Test edge cases: What happens when the user references something from 20 messages ago? Does your memory strategy handle it?
Consider the domain: A customer service bot might need full history. A casual chatbot might work fine with a small window.
Provide memory controls: Let users start fresh conversations when needed. Sometimes they want to change topics completely.
Key Takeaways
Let's review what we've learned about short-term conversation memory:
Memory is essential for conversation. Without it, agents can't handle follow-up questions or maintain context across turns.
Implementation is straightforward. Store messages in a list and send them with each new request. The model handles the rest.
Memory grows over time. Long conversations create large histories that cost more and can hit context limits.
Multiple strategies exist. Full history, sliding windows, and summarization each have their place.
Choose based on your needs. Consider conversation length, cost constraints, and how much context you need to preserve.
Your personal assistant now has short-term memory. It can maintain context across a conversation, handle follow-ups, and build on previous exchanges. This is a fundamental capability that makes agents feel natural and useful.
In the next chapter, we'll explore long-term memory: how to store information across sessions, remember user preferences, and retrieve relevant facts from a knowledge base. Combined with short-term memory, this will give your agent a complete memory system.
Glossary
Conversation History: The list of messages exchanged between the user and agent, stored in order and sent with each new request to provide context.
Stateless Interaction: A request-response pattern where each interaction is independent, with no memory of previous exchanges.
Sliding Window Memory: A memory strategy that keeps only the most recent N messages, discarding older ones to manage context size and cost.
Context Window: The maximum number of tokens a language model can process in a single request, including both the conversation history and the new message.
Conversation Summarization: A technique that condenses older messages into a brief summary to preserve important context while reducing token count.
Token: The basic unit of text that language models process, roughly equivalent to a word or word fragment. Models charge per token and have maximum token limits.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about short-term conversation memory in AI agents.






Comments