Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.

This article is part of the free-to-read AI Agent Handbook
Speeding Up the Agent
You've built a capable assistant that can reason, use tools, remember conversations, and handle complex tasks. But there's a problem: sometimes it feels slow. A user asks a simple question, and they're waiting three seconds for an answer. They request a calculation, and the agent takes five seconds to respond. In a world where we expect instant feedback, those delays add up.
Speed matters. A fast agent feels responsive and natural to use. A slow one frustrates users and breaks the flow of conversation. The good news is that you can make your agent significantly faster without sacrificing much capability. This chapter shows you how.
Why Speed Matters
Let's start with a scenario. Your assistant is deployed, and a user asks: "What's 47 times 83?"
The agent springs into action. It sends the query to Claude Sonnet 4.5, which thinks about the problem, decides to use the calculator tool, performs the calculation, and generates a response. Total time: 4.2 seconds.
Now imagine the user asks ten questions in a row. That's 42 seconds of waiting. The user gets impatient. They start to wonder if something's broken. They might even give up and use a different tool.
Speed isn't just about user experience, though that's important. It's also about cost. Most language model APIs charge per token generated. A slower agent that generates verbose responses costs more to run. If your agent takes twice as long and generates twice as many tokens, you're paying roughly four times as much per interaction.
The challenge is balancing speed with capability. You want your agent to be fast, but not at the expense of accuracy or usefulness. The techniques in this chapter help you find that balance.
Understanding Where Time Goes
Before we optimize, we need to understand where the time goes. Let's break down a typical agent interaction:
Model call: 2.24s Processing: 0.00s Total: 2.24s
When you run this, you'll see something like:
The vast majority of time is spent waiting for the model to generate a response. Processing the result is nearly instantaneous. This tells us where to focus our optimization efforts: the model call itself.
Choosing the Right Model for the Task
Not every task needs your most powerful model. Claude Sonnet 4.5 is excellent for complex reasoning and tool use, but it's overkill for simple questions. Using a smaller, faster model for straightforward tasks can cut response time in half or more.
Think of it like transportation. You wouldn't take a semi-truck to pick up groceries. A car works fine. Similarly, you don't need your most capable model for every query.
Example: Model Selection Strategy (GPT-5)
Let's build a simple router that chooses the right model based on the query complexity:
Simple query: Paris.
Complex query: # Why Paris Became France's Capital Paris became France's capital through a combination of **historical, geographical, and political factors** rather than a single deliberate choice: ## Geographic Advantages - **Central location** in the fertile Paris Basin (Île-de-France region) - Positioned on the **Seine River**, enabling trade and transportation - Natural defensive position on islands in the river (Île de la Cité) ## Historical Development - **Ancient roots**: Roman settlement called Lutetia (circa 250 BC) - **Clovis I** (5th century) made it his royal residence when he unified Frankish tribes - **Hugh Capet** (987 AD) established his power base there, making it the de facto capital of the growing French kingdom - Gradually accumulated royal palaces, administrative functions, and political institutions ## Strategic Growth - As French kings expanded their territory from their Île-de-France base, Paris remained the center of royal power - Became the **economic and cultural hub** - universities, churches, markets - By the Middle Ages, it was the largest city in Western Europe ## No Formal Declaration Interestingly, Paris was never officially declared the capital by law until the **French Constitution of 1958**. It simply evolved into that role through centuries of being the seat of power. The choice was essentially organic - Paris grew powerful because French kings ruled from there, and it remained capital because it had become too important politically, economically, and culturally to move.
This approach gives you speed when you need it and power when you need it. The simple query gets a fast response from GPT-5, while the complex one gets the full reasoning capability of Claude Sonnet 4.5.
Limiting Response Length
Every token the model generates takes time. If your agent produces 500-word responses when 100 words would suffice, you're wasting time and money.
You can control this with the max_tokens parameter, but there's a better way: prompt engineering. Tell the model explicitly to be concise.
Example: Concise Responses (Claude Sonnet 4.5)
With concise prompt:
Python is a high-level, interpreted programming language known for its simple, readable syntax and versatility. It's widely used for web development, data science, automation, artificial intelligence, and many other applications. Token savings: ~70% compared to default behavior
The concise version might respond: "Python is a high-level programming language known for its readability and versatility." That's 13 words instead of a 200-word explanation. The user gets their answer faster, and you save on API costs.
Caching Responses
If users frequently ask the same questions, why recompute the answer every time? Cache the response and serve it instantly on subsequent requests.
Example: Simple Response Cache (Gemini 2.5 Flash)
Cache miss. Calling model...
Response time: 14.16s
Machine learning (ML) is a **subset of artificial intelligence (AI)** that enables systems to **learn from data, identify patterns, and make decisions or predictions** with minimal human intervention, rather than being explicitly programmed for every task.
Think of it like teaching a child:
* You don't give them a detailed rulebook for every situation.
* Instead, you show them many examples (data).
* They learn to recognize patterns and make their own judgments based on those examples.
* Over time, with more experience and feedback, they get better at it.
Machine learning works in a very similar way, but for computers.
---
Here's a breakdown of the key concepts:
1. **Learning from Data:**
* Instead of being explicitly coded with a set of if-then rules, ML algorithms are fed large amounts of data (training data).
* This data contains examples of the task the machine needs to learn, often including both inputs and desired outputs.
2. **Pattern Recognition:**
* The algorithms analyze this data to find statistical relationships, correlations, and hidden patterns.
* They "learn" a model that represents these patterns.
3. **Making Predictions or Decisions:**
* Once trained, the ML model can then be used on new, unseen data.
* It applies the patterns it learned from the training data to make predictions, classify new inputs, or make decisions.
4. **Iterative Improvement:**
* Machine learning models can continuously improve their performance over time as they are exposed to more data and feedback.
---
### How it Generally Works (Simplified):
1. **Data Collection:** Gather relevant data (e.g., images, text, numbers, sensor readings).
2. **Feature Engineering:** Select and transform the most important characteristics (features) from the data that the model will learn from.
3. **Algorithm Selection:** Choose a suitable machine learning algorithm (e.g., linear regression, decision trees, neural networks).
4. **Training:** Feed the processed data to the algorithm. The algorithm adjusts its internal parameters to minimize errors and learn the underlying patterns. The output of this phase is a "trained model."
5. **Evaluation:** Test the trained model on new, unseen data to see how well it performs and generalize.
6. **Deployment:** Once the model is satisfactory, it can be put into production to make real-time predictions or decisions.
---
### Main Types of Machine Learning:
1. **Supervised Learning:**
* **Concept:** Learning from labeled data, where both input and the correct output are provided. Like learning with a teacher.
* **Tasks:**
* **Classification:** Predicting a category (e.g., spam or not spam, cat or dog).
* **Regression:** Predicting a continuous value (e.g., house prices, temperature).
* **Examples:** Spam detection, image recognition, medical diagnosis.
2. **Unsupervised Learning:**
* **Concept:** Learning from unlabeled data, finding hidden structures or patterns without explicit guidance. Like exploring on your own.
* **Tasks:**
* **Clustering:** Grouping similar data points together (e.g., customer segmentation).
* **Dimensionality Reduction:** Simplifying data while retaining important information.
* **Examples:** Recommender systems, anomaly detection, topic modeling.
3. **Reinforcement Learning:**
* **Concept:** An agent learns to make a sequence of decisions in an environment by performing actions and receiving rewards or penalties. Like learning through trial and error.
* **Tasks:** Finding an optimal strategy to achieve a goal.
* **Examples:** Game playing (AlphaGo), robotics, self-driving cars (parts of it).
---
### Why is Machine Learning Important?
* **Automation:** Automates tasks that are complex or impossible to program manually.
* **Scalability:** Can process and learn from vast amounts of data that humans cannot.
* **Discovery:** Uncovers insights and patterns that might be hidden within data.
* **Adaptability:** Models can adapt and improve over time with new data.
* **Personalization:** Powers customized experiences (e.g., recommendation engines).
---
### Common Applications:
* **Recommendation Systems:** (Netflix, Amazon, YouTube)
* **Spam Filters:** (Email services)
* **Fraud Detection:** (Banks, credit card companies)
* **Facial Recognition:** (Phone unlocks, security systems)
* **Speech Recognition:** (Siri, Alexa, Google Assistant)
* **Medical Diagnosis:** (Identifying diseases from scans)
* **Natural Language Processing:** (Translation, sentiment analysis)
* **Self-Driving Cars:** (Object detection, path planning)
In essence, machine learning is about empowering computers to **learn from experience** (data) and make intelligent decisions, much like humans do, but at an unprecedented scale and speed.
Cache hit! Instant response.
Machine learning (ML) is a **subset of artificial intelligence (AI)** that enables systems to **learn from data, identify patterns, and make decisions or predictions** with minimal human intervention, rather than being explicitly programmed for every task.
Think of it like teaching a child:
* You don't give them a detailed rulebook for every situation.
* Instead, you show them many examples (data).
* They learn to recognize patterns and make their own judgments based on those examples.
* Over time, with more experience and feedback, they get better at it.
Machine learning works in a very similar way, but for computers.
---
Here's a breakdown of the key concepts:
1. **Learning from Data:**
* Instead of being explicitly coded with a set of if-then rules, ML algorithms are fed large amounts of data (training data).
* This data contains examples of the task the machine needs to learn, often including both inputs and desired outputs.
2. **Pattern Recognition:**
* The algorithms analyze this data to find statistical relationships, correlations, and hidden patterns.
* They "learn" a model that represents these patterns.
3. **Making Predictions or Decisions:**
* Once trained, the ML model can then be used on new, unseen data.
* It applies the patterns it learned from the training data to make predictions, classify new inputs, or make decisions.
4. **Iterative Improvement:**
* Machine learning models can continuously improve their performance over time as they are exposed to more data and feedback.
---
### How it Generally Works (Simplified):
1. **Data Collection:** Gather relevant data (e.g., images, text, numbers, sensor readings).
2. **Feature Engineering:** Select and transform the most important characteristics (features) from the data that the model will learn from.
3. **Algorithm Selection:** Choose a suitable machine learning algorithm (e.g., linear regression, decision trees, neural networks).
4. **Training:** Feed the processed data to the algorithm. The algorithm adjusts its internal parameters to minimize errors and learn the underlying patterns. The output of this phase is a "trained model."
5. **Evaluation:** Test the trained model on new, unseen data to see how well it performs and generalize.
6. **Deployment:** Once the model is satisfactory, it can be put into production to make real-time predictions or decisions.
---
### Main Types of Machine Learning:
1. **Supervised Learning:**
* **Concept:** Learning from labeled data, where both input and the correct output are provided. Like learning with a teacher.
* **Tasks:**
* **Classification:** Predicting a category (e.g., spam or not spam, cat or dog).
* **Regression:** Predicting a continuous value (e.g., house prices, temperature).
* **Examples:** Spam detection, image recognition, medical diagnosis.
2. **Unsupervised Learning:**
* **Concept:** Learning from unlabeled data, finding hidden structures or patterns without explicit guidance. Like exploring on your own.
* **Tasks:**
* **Clustering:** Grouping similar data points together (e.g., customer segmentation).
* **Dimensionality Reduction:** Simplifying data while retaining important information.
* **Examples:** Recommender systems, anomaly detection, topic modeling.
3. **Reinforcement Learning:**
* **Concept:** An agent learns to make a sequence of decisions in an environment by performing actions and receiving rewards or penalties. Like learning through trial and error.
* **Tasks:** Finding an optimal strategy to achieve a goal.
* **Examples:** Game playing (AlphaGo), robotics, self-driving cars (parts of it).
---
### Why is Machine Learning Important?
* **Automation:** Automates tasks that are complex or impossible to program manually.
* **Scalability:** Can process and learn from vast amounts of data that humans cannot.
* **Discovery:** Uncovers insights and patterns that might be hidden within data.
* **Adaptability:** Models can adapt and improve over time with new data.
* **Personalization:** Powers customized experiences (e.g., recommendation engines).
---
### Common Applications:
* **Recommendation Systems:** (Netflix, Amazon, YouTube)
* **Spam Filters:** (Email services)
* **Fraud Detection:** (Banks, credit card companies)
* **Facial Recognition:** (Phone unlocks, security systems)
* **Speech Recognition:** (Siri, Alexa, Google Assistant)
* **Medical Diagnosis:** (Identifying diseases from scans)
* **Natural Language Processing:** (Translation, sentiment analysis)
* **Self-Driving Cars:** (Object detection, path planning)
In essence, machine learning is about empowering computers to **learn from experience** (data) and make intelligent decisions, much like humans do, but at an unprecedented scale and speed.
The first call takes the normal time (maybe 1.5 seconds). The second call is instant, returning in milliseconds. For a production system, you'd use a more sophisticated cache with expiration times and size limits, but this shows the basic idea.
Streaming Responses
Sometimes you can't make the agent faster, but you can make it feel faster. Streaming responses show results as they're generated, rather than waiting for the complete answer.
Example: Streaming for Perceived Speed (Claude Sonnet 4.5)
Streaming response:
#
What is an API? An **
API (
Application Programming Interface)** is like
a messenger
that lets
different software programs
talk to each other.
## Simple
Analogy
Think
of a
restaurant
:
- **
You
** (the
customer
) are
one
program
- **The
kitchen
** (where
foo
d is made) is another program -
**The waiter** is
the API
You
don
't go
into
the kitchen to
cook
.
Instea
d, you tell
the waiter what you want,
the
wa
iter takes
your order to
the kitchen, and then
brings
back
your
food.
The wa
iter is the go
-between that
makes
everything
work smooth
ly.
## Real
-World Example When
you use a
weather
app on
your phone: - The
app doesn
't store
all
the weather data itself
- It
uses an API to ask
a
weather
service
"
What's the weather in
New York?" - The API sends
back
the
information
- Your app
displays it nic
ely for
you
## Why
APIs Matter
They
let
developers
: - Use
features
from
other services
without rebuil
ding them
- Connect
different apps
and services together - Save
time and effort
**
Bottom
line
:** APIs are the behin
d-the-scenes conn
ectors that make modern apps
an
d websites
work together.
The total time to generate the response doesn't change, but the user sees words appearing immediately. This makes the agent feel much more responsive. Instead of staring at a blank screen for three seconds, they see the answer forming in real time.
Parallel Tool Calls
If your agent needs to use multiple tools, doing them sequentially wastes time. Run them in parallel when possible.
Example: Sequential vs Parallel Tool Execution (Claude Sonnet 4.5)
Testing sequential execution:
Sequential execution: 3.01s Testing parallel execution:
Parallel execution: 1.01s
('Weather in San Francisco: Sunny, 72°F',
'Latest news on technology: [News headlines...]',
'Stock price for AAPL: $150.25')Output:
The parallel version is three times faster because all three tools run simultaneously. For an agent that frequently uses multiple tools, this can dramatically improve response time.
Optimizing Prompt Size
Large prompts take longer to process. Every token in your prompt adds a small amount of latency. If you're including long system messages, conversation history, or retrieved documents, consider trimming them.
Here's a practical approach:
By keeping only the five most recent messages, you reduce the prompt size and speed up processing. The agent loses some context, but for many conversations, recent messages are all that matter.
Precomputing When Possible
If your agent does the same computation repeatedly, precompute it. For example, if your agent frequently needs to know the current date, time zone conversions, or common calculations, compute these once and reuse them.
Today's date: 2025-12-07 10 miles = 16.09 km
These operations are instant because the values are precomputed. Compare this to calling a model to do unit conversions or date formatting, which would take seconds.
Measuring the Impact
As you apply these optimizations, measure their impact. Here's a simple benchmarking approach:
Before optimization:
Average: 2.92s Total: 14.58s After optimization (concise responses):
Average: 3.36s Total: 16.78s Speed improvement: -15.1%
This gives you concrete numbers to evaluate your optimizations. You might find that limiting tokens saves 20% on response time, or that caching cuts average latency by 40%.
Putting It All Together
Let's build a fast agent that combines several of these techniques:
Q: What is Python? A: Python is a high-level, interpreted programming language known for its clear syntax and readability. It's widely used for web development, data science, artificial intelligence, automation, and general-purpose programming. Source: model (3.05s) Q: What is Python? A: Python is a high-level, interpreted programming language known for its clear syntax and readability. It's widely used for web development, data science, artificial intelligence, automation, and general-purpose programming. Source: cache
Q: Explain object-oriented programming A: **Object-Oriented Programming (OOP)** is a programming paradigm that organizes code around "objects" – data structures containing both data (attributes) and functions (methods) that operate on that data. The four core principles are **encapsulation** (bundling data with methods), **inheritance** (creating new classes from existing ones), **polymorphism** (objects taking multiple forms), and **abstraction** (hiding complex implementation details). Source: model (3.42s)
This agent combines caching, concise prompts, and smart token limits to deliver fast responses. The first query might take 1.5 seconds, but the cached version is instant. Simple queries use fewer tokens, saving time and money.
When Speed Isn't Everything
Before we wrap up, a word of caution: don't optimize prematurely. Speed is important, but accuracy matters more. If your agent gives wrong answers quickly, that's worse than giving right answers slowly.
Start by building a correct agent. Then measure where the bottlenecks are. Apply optimizations strategically, and always verify that accuracy doesn't suffer. Sometimes the best answer requires the most capable model and a longer response time. That's okay.
The goal isn't to make every response instant. It's to make the agent as fast as possible while maintaining the quality users expect.
Glossary
Caching: Storing the results of expensive operations so they can be reused without recomputation. For agents, this typically means saving model responses for repeated queries.
Latency: The time delay between when a user makes a request and when they receive a response. Lower latency means a faster, more responsive agent.
Max Tokens: A parameter that limits how many tokens (words or word pieces) a language model can generate in a single response. Lower values produce shorter, faster responses.
Parallel Execution: Running multiple operations simultaneously rather than one after another. This can significantly reduce total execution time when operations don't depend on each other.
Streaming: Sending response data to the user incrementally as it's generated, rather than waiting for the complete response. This improves perceived speed even if total generation time is unchanged.
Token: The basic unit of text that language models process. A token is roughly equivalent to a word or word piece. Both input and output are measured in tokens.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speeding up AI agents.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Skip-gram Model: Learning Word Embeddings by Predicting Context
A comprehensive guide to the Skip-gram model from Word2Vec, covering architecture, objective function, training data generation, and implementation from scratch.

Scaling Up without Breaking the Bank: AI Agent Performance & Cost Optimization at Scale
Learn how to scale AI agents from single users to thousands while maintaining performance and controlling costs. Covers horizontal scaling, load balancing, monitoring, cost controls, and prompt optimization strategies.

Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies
Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments