Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.

This article is part of the free-to-read AI Agent Handbook
Speeding Up the Agent
You've built a capable assistant that can reason, use tools, remember conversations, and handle complex tasks. But there's a problem: sometimes it feels slow. A user asks a simple question, and they're waiting three seconds for an answer. They request a calculation, and the agent takes five seconds to respond. In a world where we expect instant feedback, those delays add up.
Speed matters. A fast agent feels responsive and natural to use. A slow one frustrates users and breaks the flow of conversation. The good news is that you can make your agent significantly faster without sacrificing much capability. This chapter shows you how.
Why Speed Matters
Let's start with a scenario. Your assistant is deployed, and a user asks: "What's 47 times 83?"
The agent springs into action. It sends the query to Claude Sonnet 4.5, which thinks about the problem, decides to use the calculator tool, performs the calculation, and generates a response. Total time: 4.2 seconds.
Now imagine the user asks ten questions in a row. That's 42 seconds of waiting. The user gets impatient. They start to wonder if something's broken. They might even give up and use a different tool.
Speed isn't just about user experience, though that's important. It's also about cost. Most language model APIs charge per token generated. A slower agent that generates verbose responses costs more to run. If your agent takes twice as long and generates twice as many tokens, you're paying roughly four times as much per interaction.
The challenge is balancing speed with capability. You want your agent to be fast, but not at the expense of accuracy or usefulness. The techniques in this chapter help you find that balance.
Understanding Where Time Goes
Before we optimize, we need to understand where the time goes. Let's break down a typical agent interaction:
When you run this, you'll see something like:
The vast majority of time is spent waiting for the model to generate a response. Processing the result is nearly instantaneous. This tells us where to focus our optimization efforts: the model call itself.
Choosing the Right Model for the Task
Not every task needs your most powerful model. Claude Sonnet 4.5 is excellent for complex reasoning and tool use, but it's overkill for simple questions. Using a smaller, faster model for straightforward tasks can cut response time in half or more.
Think of it like transportation. You wouldn't take a semi-truck to pick up groceries. A car works fine. Similarly, you don't need your most capable model for every query.
Example: Model Selection Strategy (GPT-5)
Let's build a simple router that chooses the right model based on the query complexity:
This approach gives you speed when you need it and power when you need it. The simple query gets a fast response from GPT-5, while the complex one gets the full reasoning capability of Claude Sonnet 4.5.
Limiting Response Length
Every token the model generates takes time. If your agent produces 500-word responses when 100 words would suffice, you're wasting time and money.
You can control this with the max_tokens parameter, but there's a better way: prompt engineering. Tell the model explicitly to be concise.
Example: Concise Responses (Claude Sonnet 4.5)
The concise version might respond: "Python is a high-level programming language known for its readability and versatility." That's 13 words instead of a 200-word explanation. The user gets their answer faster, and you save on API costs.
Caching Responses
If users frequently ask the same questions, why recompute the answer every time? Cache the response and serve it instantly on subsequent requests.
Example: Simple Response Cache (Gemini 2.5 Flash)
The first call takes the normal time (maybe 1.5 seconds). The second call is instant, returning in milliseconds. For a production system, you'd use a more sophisticated cache with expiration times and size limits, but this shows the basic idea.
Streaming Responses
Sometimes you can't make the agent faster, but you can make it feel faster. Streaming responses show results as they're generated, rather than waiting for the complete answer.
Example: Streaming for Perceived Speed (Claude Sonnet 4.5)
The total time to generate the response doesn't change, but the user sees words appearing immediately. This makes the agent feel much more responsive. Instead of staring at a blank screen for three seconds, they see the answer forming in real time.
Parallel Tool Calls
If your agent needs to use multiple tools, doing them sequentially wastes time. Run them in parallel when possible.
Example: Sequential vs Parallel Tool Execution (Claude Sonnet 4.5)
Output:
The parallel version is three times faster because all three tools run simultaneously. For an agent that frequently uses multiple tools, this can dramatically improve response time.
Optimizing Prompt Size
Large prompts take longer to process. Every token in your prompt adds a small amount of latency. If you're including long system messages, conversation history, or retrieved documents, consider trimming them.
Here's a practical approach:
By keeping only the five most recent messages, you reduce the prompt size and speed up processing. The agent loses some context, but for many conversations, recent messages are all that matter.
Precomputing When Possible
If your agent does the same computation repeatedly, precompute it. For example, if your agent frequently needs to know the current date, time zone conversions, or common calculations, compute these once and reuse them.
These operations are instant because the values are precomputed. Compare this to calling a model to do unit conversions or date formatting, which would take seconds.
Measuring the Impact
As you apply these optimizations, measure their impact. Here's a simple benchmarking approach:
This gives you concrete numbers to evaluate your optimizations. You might find that limiting tokens saves 20% on response time, or that caching cuts average latency by 40%.
Putting It All Together
Let's build a fast agent that combines several of these techniques:
This agent combines caching, concise prompts, and smart token limits to deliver fast responses. The first query might take 1.5 seconds, but the cached version is instant. Simple queries use fewer tokens, saving time and money.
When Speed Isn't Everything
Before we wrap up, a word of caution: don't optimize prematurely. Speed is important, but accuracy matters more. If your agent gives wrong answers quickly, that's worse than giving right answers slowly.
Start by building a correct agent. Then measure where the bottlenecks are. Apply optimizations strategically, and always verify that accuracy doesn't suffer. Sometimes the best answer requires the most capable model and a longer response time. That's okay.
The goal isn't to make every response instant. It's to make the agent as fast as possible while maintaining the quality users expect.
Glossary
Caching: Storing the results of expensive operations so they can be reused without recomputation. For agents, this typically means saving model responses for repeated queries.
Latency: The time delay between when a user makes a request and when they receive a response. Lower latency means a faster, more responsive agent.
Max Tokens: A parameter that limits how many tokens (words or word pieces) a language model can generate in a single response. Lower values produce shorter, faster responses.
Parallel Execution: Running multiple operations simultaneously rather than one after another. This can significantly reduce total execution time when operations don't depend on each other.
Streaming: Sending response data to the user incrementally as it's generated, rather than waiting for the complete response. This improves perceived speed even if total generation time is unchanged.
Token: The basic unit of text that language models process. A token is roughly equivalent to a word or word piece. Both input and output are measured in tokens.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speeding up AI agents.






Comments