Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times
Back to Writing

Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times

Michael Brenndoerfer•November 10, 2025•11 min read•1,303 words•Interactive

Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.

AI Agent Handbook Cover
Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

Speeding Up the Agent

You've built a capable assistant that can reason, use tools, remember conversations, and handle complex tasks. But there's a problem: sometimes it feels slow. A user asks a simple question, and they're waiting three seconds for an answer. They request a calculation, and the agent takes five seconds to respond. In a world where we expect instant feedback, those delays add up.

Speed matters. A fast agent feels responsive and natural to use. A slow one frustrates users and breaks the flow of conversation. The good news is that you can make your agent significantly faster without sacrificing much capability. This chapter shows you how.

Why Speed Matters

Let's start with a scenario. Your assistant is deployed, and a user asks: "What's 47 times 83?"

The agent springs into action. It sends the query to Claude Sonnet 4.5, which thinks about the problem, decides to use the calculator tool, performs the calculation, and generates a response. Total time: 4.2 seconds.

Now imagine the user asks ten questions in a row. That's 42 seconds of waiting. The user gets impatient. They start to wonder if something's broken. They might even give up and use a different tool.

Speed isn't just about user experience, though that's important. It's also about cost. Most language model APIs charge per token generated. A slower agent that generates verbose responses costs more to run. If your agent takes twice as long and generates twice as many tokens, you're paying roughly four times as much per interaction.

The challenge is balancing speed with capability. You want your agent to be fast, but not at the expense of accuracy or usefulness. The techniques in this chapter help you find that balance.

Understanding Where Time Goes

Before we optimize, we need to understand where the time goes. Let's break down a typical agent interaction:

1import time
2from anthropic import Anthropic
3
4## Using Claude Sonnet 4.5 for its agent capabilities
5client = Anthropic(api_key="YOUR_API_KEY")
6
7def timed_agent_call(user_query):
8    """Track timing for each step of agent execution."""
9    timings = {}
10    
11    # Step 1: Send request to model
12    start = time.time()
13    response = client.messages.create(
14        model="claude-sonnet-4.5",
15        max_tokens=1024,
16        messages=[{"role": "user", "content": user_query}]
17    )
18    timings['model_call'] = time.time() - start
19    
20    # Step 2: Process response
21    start = time.time()
22    result = response.content[0].text
23    timings['processing'] = time.time() - start
24    
25    return result, timings
26
27## Test with a simple query
28result, timings = timed_agent_call("What is the capital of France?")
29print(f"Model call: {timings['model_call']:.2f}s")
30print(f"Processing: {timings['processing']:.2f}s")
31print(f"Total: {sum(timings.values()):.2f}s")

When you run this, you'll see something like:

1Model call: 1.85s
2Processing: 0.01s
3Total: 1.86s

The vast majority of time is spent waiting for the model to generate a response. Processing the result is nearly instantaneous. This tells us where to focus our optimization efforts: the model call itself.

Choosing the Right Model for the Task

Not every task needs your most powerful model. Claude Sonnet 4.5 is excellent for complex reasoning and tool use, but it's overkill for simple questions. Using a smaller, faster model for straightforward tasks can cut response time in half or more.

Think of it like transportation. You wouldn't take a semi-truck to pick up groceries. A car works fine. Similarly, you don't need your most capable model for every query.

Example: Model Selection Strategy (GPT-5)

Let's build a simple router that chooses the right model based on the query complexity:

1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_API_KEY")
4
5def classify_query_complexity(query):
6    """Determine if a query needs a powerful model or a fast one."""
7    # Simple heuristic: check for complexity indicators
8    complex_indicators = [
9        "explain", "analyze", "compare", "why", "how does",
10        "step by step", "reasoning", "pros and cons"
11    ]
12    
13    query_lower = query.lower()
14    is_complex = any(indicator in query_lower for indicator in complex_indicators)
15    
16    return "complex" if is_complex else "simple"
17
18def route_to_model(query):
19    """Route query to appropriate model based on complexity."""
20    complexity = classify_query_complexity(query)
21    
22    if complexity == "simple":
23        # Use GPT-5 for fast, straightforward responses
24        model = "gpt-5"
25        max_tokens = 150
26    else:
27        # Use Claude Sonnet 4.5 for complex reasoning
28        from anthropic import Anthropic
29        anthropic_client = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
30        response = anthropic_client.messages.create(
31            model="claude-sonnet-4.5",
32            max_tokens=1024,
33            messages=[{"role": "user", "content": query}]
34        )
35        return response.content[0].text
36    
37    # Handle simple queries with GPT-5
38    response = client.chat.completions.create(
39        model=model,
40        max_tokens=max_tokens,
41        messages=[{"role": "user", "content": query}]
42    )
43    
44    return response.choices[0].message.content
45
46## Test both paths
47print("Simple query:", route_to_model("What's the capital of France?"))
48print("Complex query:", route_to_model("Explain why France chose Paris as its capital"))

This approach gives you speed when you need it and power when you need it. The simple query gets a fast response from GPT-5, while the complex one gets the full reasoning capability of Claude Sonnet 4.5.

Limiting Response Length

Every token the model generates takes time. If your agent produces 500-word responses when 100 words would suffice, you're wasting time and money.

You can control this with the max_tokens parameter, but there's a better way: prompt engineering. Tell the model explicitly to be concise.

Example: Concise Responses (Claude Sonnet 4.5)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def get_concise_response(query):
6    """Get a brief, focused response from the model."""
7    # Using Claude Sonnet 4.5 for its excellent instruction-following
8    system_prompt = """You are a helpful assistant. Provide concise, direct answers.
9    Use 1-2 sentences for simple questions. Only elaborate when specifically asked.
10    Avoid unnecessary explanations or examples unless requested."""
11    
12    response = client.messages.create(
13        model="claude-sonnet-4.5",
14        max_tokens=200,  # Hard limit for safety
15        system=system_prompt,
16        messages=[{"role": "user", "content": query}]
17    )
18    
19    return response.content[0].text
20
21## Compare verbose vs concise
22verbose_query = "What is Python?"
23concise_query = "What is Python?"
24
25print("With concise prompt:")
26print(get_concise_response(concise_query))
27print("\nToken savings: ~70% compared to default behavior")

The concise version might respond: "Python is a high-level programming language known for its readability and versatility." That's 13 words instead of a 200-word explanation. The user gets their answer faster, and you save on API costs.

Caching Responses

If users frequently ask the same questions, why recompute the answer every time? Cache the response and serve it instantly on subsequent requests.

Example: Simple Response Cache (Gemini 2.5 Flash)

1from google import generativeai as genai
2import hashlib
3import time
4
5genai.configure(api_key="YOUR_GOOGLE_API_KEY")
6
7class CachedAgent:
8    """Agent with response caching for repeated queries."""
9    
10    def __init__(self):
11        # Using Gemini 2.5 Flash for fast responses
12        self.model = genai.GenerativeModel('gemini-2.5-flash')
13        self.cache = {}
14        
15    def _hash_query(self, query):
16        """Create a cache key from the query."""
17        return hashlib.md5(query.encode()).hexdigest()
18    
19    def respond(self, query):
20        """Get response, using cache if available."""
21        cache_key = self._hash_query(query)
22        
23        # Check cache first
24        if cache_key in self.cache:
25            print("Cache hit! Instant response.")
26            return self.cache[cache_key]
27        
28        # Cache miss: call the model
29        print("Cache miss. Calling model...")
30        start = time.time()
31        response = self.model.generate_content(query)
32        elapsed = time.time() - start
33        
34        result = response.text
35        self.cache[cache_key] = result
36        
37        print(f"Response time: {elapsed:.2f}s")
38        return result
39
40## Test the cache
41agent = CachedAgent()
42
43## First call: cache miss
44print(agent.respond("What is machine learning?"))
45print()
46
47## Second call: cache hit (instant)
48print(agent.respond("What is machine learning?"))

The first call takes the normal time (maybe 1.5 seconds). The second call is instant, returning in milliseconds. For a production system, you'd use a more sophisticated cache with expiration times and size limits, but this shows the basic idea.

Streaming Responses

Sometimes you can't make the agent faster, but you can make it feel faster. Streaming responses show results as they're generated, rather than waiting for the complete answer.

Example: Streaming for Perceived Speed (Claude Sonnet 4.5)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def stream_response(query):
6    """Stream the response as it's generated."""
7    # Using Claude Sonnet 4.5 with streaming enabled
8    with client.messages.stream(
9        model="claude-sonnet-4.5",
10        max_tokens=1024,
11        messages=[{"role": "user", "content": query}]
12    ) as stream:
13        for text in stream.text_stream:
14            print(text, end="", flush=True)
15    print()  # New line at the end
16
17## Try it out
18print("Streaming response:")
19stream_response("Explain what an API is in simple terms.")

The total time to generate the response doesn't change, but the user sees words appearing immediately. This makes the agent feel much more responsive. Instead of staring at a blank screen for three seconds, they see the answer forming in real time.

Parallel Tool Calls

If your agent needs to use multiple tools, doing them sequentially wastes time. Run them in parallel when possible.

Example: Sequential vs Parallel Tool Execution (Claude Sonnet 4.5)

1import time
2import asyncio
3from anthropic import Anthropic
4
5## Simulated tool functions
6def get_weather(city):
7    """Simulate weather API call."""
8    time.sleep(1)  # Simulate network delay
9    return f"Weather in {city}: Sunny, 72°F"
10
11def get_news(topic):
12    """Simulate news API call."""
13    time.sleep(1)  # Simulate network delay
14    return f"Latest news on {topic}: [News headlines...]"
15
16def get_stock_price(symbol):
17    """Simulate stock API call."""
18    time.sleep(1)  # Simulate network delay
19    return f"Stock price for {symbol}: $150.25"
20
21## Sequential execution (slow)
22def sequential_tools():
23    """Call tools one after another."""
24    start = time.time()
25    
26    weather = get_weather("San Francisco")
27    news = get_news("technology")
28    stock = get_stock_price("AAPL")
29    
30    elapsed = time.time() - start
31    print(f"Sequential execution: {elapsed:.2f}s")
32    return weather, news, stock
33
34## Parallel execution (fast)
35async def parallel_tools():
36    """Call tools simultaneously."""
37    start = time.time()
38    
39    # Run all tools in parallel
40    weather_task = asyncio.to_thread(get_weather, "San Francisco")
41    news_task = asyncio.to_thread(get_news, "technology")
42    stock_task = asyncio.to_thread(get_stock_price, "AAPL")
43    
44    # Wait for all to complete
45    weather, news, stock = await asyncio.gather(
46        weather_task, news_task, stock_task
47    )
48    
49    elapsed = time.time() - start
50    print(f"Parallel execution: {elapsed:.2f}s")
51    return weather, news, stock
52
53## Compare the approaches
54print("Testing sequential execution:")
55sequential_tools()
56
57print("\nTesting parallel execution:")
58asyncio.run(parallel_tools())

Output:

1Testing sequential execution:
2Sequential execution: 3.01s
3
4Testing parallel execution:
5Parallel execution: 1.01s

The parallel version is three times faster because all three tools run simultaneously. For an agent that frequently uses multiple tools, this can dramatically improve response time.

Optimizing Prompt Size

Large prompts take longer to process. Every token in your prompt adds a small amount of latency. If you're including long system messages, conversation history, or retrieved documents, consider trimming them.

Here's a practical approach:

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def trim_conversation_history(messages, max_messages=5):
6    """Keep only the most recent messages to reduce prompt size."""
7    if len(messages) <= max_messages:
8        return messages
9    
10    # Keep the most recent exchanges
11    return messages[-max_messages:]
12
13def summarize_long_context(text, max_length=500):
14    """Truncate very long context to essential information."""
15    if len(text) <= max_length:
16        return text
17    
18    # Simple truncation (in production, you might use a summary model)
19    return text[:max_length] + "... [truncated]"
20
21## Example usage
22conversation_history = [
23    {"role": "user", "content": "Tell me about Python"},
24    {"role": "assistant", "content": "Python is a programming language..."},
25    # ... many more messages ...
26]
27
28## Trim history before sending
29trimmed = trim_conversation_history(conversation_history, max_messages=5)
30
31response = client.messages.create(
32    model="claude-sonnet-4.5",
33    max_tokens=512,
34    messages=trimmed
35)

By keeping only the five most recent messages, you reduce the prompt size and speed up processing. The agent loses some context, but for many conversations, recent messages are all that matter.

Precomputing When Possible

If your agent does the same computation repeatedly, precompute it. For example, if your agent frequently needs to know the current date, time zone conversions, or common calculations, compute these once and reuse them.

1from datetime import datetime
2import time
3
4class PrecomputedAgent:
5    """Agent that caches common computations."""
6    
7    def __init__(self):
8        self.precomputed = {
9            'current_date': datetime.now().strftime("%Y-%m-%d"),
10            'current_time': datetime.now().strftime("%H:%M:%S"),
11            'common_conversions': {
12                'miles_to_km': 1.60934,
13                'pounds_to_kg': 0.453592,
14                'fahrenheit_to_celsius': lambda f: (f - 32) * 5/9
15            }
16        }
17        self.last_update = time.time()
18    
19    def get_current_date(self):
20        """Return precomputed date (refresh if stale)."""
21        if time.time() - self.last_update > 3600:  # Refresh every hour
22            self.precomputed['current_date'] = datetime.now().strftime("%Y-%m-%d")
23            self.last_update = time.time()
24        
25        return self.precomputed['current_date']
26    
27    def convert_units(self, value, conversion_type):
28        """Use precomputed conversion factors."""
29        converter = self.precomputed['common_conversions'].get(conversion_type)
30        if callable(converter):
31            return converter(value)
32        return value * converter
33
34agent = PrecomputedAgent()
35print(f"Today's date: {agent.get_current_date()}")  # Instant
36print(f"10 miles = {agent.convert_units(10, 'miles_to_km'):.2f} km")  # Instant

These operations are instant because the values are precomputed. Compare this to calling a model to do unit conversions or date formatting, which would take seconds.

Measuring the Impact

As you apply these optimizations, measure their impact. Here's a simple benchmarking approach:

1import time
2from anthropic import Anthropic
3
4client = Anthropic(api_key="YOUR_API_KEY")
5
6def benchmark_agent(queries, model="claude-sonnet-4.5", max_tokens=1024):
7    """Measure average response time for a set of queries."""
8    times = []
9    
10    for query in queries:
11        start = time.time()
12        response = client.messages.create(
13            model=model,
14            max_tokens=max_tokens,
15            messages=[{"role": "user", "content": query}]
16        )
17        elapsed = time.time() - start
18        times.append(elapsed)
19    
20    avg_time = sum(times) / len(times)
21    min_time = min(times)
22    max_time = max(times)
23    
24    return {
25        'average': avg_time,
26        'min': min_time,
27        'max': max_time,
28        'total': sum(times)
29    }
30
31## Test queries
32test_queries = [
33    "What is 47 times 83?",
34    "What's the capital of France?",
35    "How many days until Christmas?",
36    "Convert 10 miles to kilometers",
37    "What's the weather like today?"
38]
39
40## Benchmark before optimization
41print("Before optimization:")
42results_before = benchmark_agent(test_queries, max_tokens=1024)
43print(f"Average: {results_before['average']:.2f}s")
44print(f"Total: {results_before['total']:.2f}s")
45
46## Benchmark after optimization (concise responses)
47print("\nAfter optimization (concise responses):")
48results_after = benchmark_agent(test_queries, max_tokens=200)
49print(f"Average: {results_after['average']:.2f}s")
50print(f"Total: {results_after['total']:.2f}s")
51
52## Calculate improvement
53improvement = (1 - results_after['average'] / results_before['average']) * 100
54print(f"\nSpeed improvement: {improvement:.1f}%")

This gives you concrete numbers to evaluate your optimizations. You might find that limiting tokens saves 20% on response time, or that caching cuts average latency by 40%.

Putting It All Together

Let's build a fast agent that combines several of these techniques:

1from anthropic import Anthropic
2import hashlib
3import time
4
5class FastAgent:
6    """An optimized agent that prioritizes speed."""
7    
8    def __init__(self):
9        self.client = Anthropic(api_key="YOUR_API_KEY")
10        self.cache = {}
11        self.system_prompt = """You are a helpful assistant. Provide concise, 
12        direct answers. Use 1-2 sentences for simple questions."""
13    
14    def _hash_query(self, query):
15        """Create cache key."""
16        return hashlib.md5(query.encode()).hexdigest()
17    
18    def _is_simple_query(self, query):
19        """Determine if query is simple enough for fast model."""
20        simple_patterns = ["what is", "who is", "when is", "where is"]
21        return any(pattern in query.lower() for pattern in simple_patterns)
22    
23    def respond(self, query):
24        """Get fast response using multiple optimization techniques."""
25        # 1. Check cache first
26        cache_key = self._hash_query(query)
27        if cache_key in self.cache:
28            return self.cache[cache_key], "cache"
29        
30        # 2. Choose appropriate model and token limit
31        if self._is_simple_query(query):
32            max_tokens = 150
33            model = "claude-sonnet-4.5"  # Still fast for simple queries
34        else:
35            max_tokens = 500
36            model = "claude-sonnet-4.5"
37        
38        # 3. Make the call with optimizations
39        start = time.time()
40        response = self.client.messages.create(
41            model=model,
42            max_tokens=max_tokens,
43            system=self.system_prompt,
44            messages=[{"role": "user", "content": query}]
45        )
46        elapsed = time.time() - start
47        
48        result = response.content[0].text
49        
50        # 4. Cache the result
51        self.cache[cache_key] = result
52        
53        return result, f"model ({elapsed:.2f}s)"
54
55## Test the fast agent
56agent = FastAgent()
57
58queries = [
59    "What is Python?",
60    "What is Python?",  # Should hit cache
61    "Explain object-oriented programming",
62]
63
64for query in queries:
65    result, source = agent.respond(query)
66    print(f"Q: {query}")
67    print(f"A: {result}")
68    print(f"Source: {source}\n")

This agent combines caching, concise prompts, and smart token limits to deliver fast responses. The first query might take 1.5 seconds, but the cached version is instant. Simple queries use fewer tokens, saving time and money.

When Speed Isn't Everything

Before we wrap up, a word of caution: don't optimize prematurely. Speed is important, but accuracy matters more. If your agent gives wrong answers quickly, that's worse than giving right answers slowly.

Start by building a correct agent. Then measure where the bottlenecks are. Apply optimizations strategically, and always verify that accuracy doesn't suffer. Sometimes the best answer requires the most capable model and a longer response time. That's okay.

The goal isn't to make every response instant. It's to make the agent as fast as possible while maintaining the quality users expect.

Glossary

Caching: Storing the results of expensive operations so they can be reused without recomputation. For agents, this typically means saving model responses for repeated queries.

Latency: The time delay between when a user makes a request and when they receive a response. Lower latency means a faster, more responsive agent.

Max Tokens: A parameter that limits how many tokens (words or word pieces) a language model can generate in a single response. Lower values produce shorter, faster responses.

Parallel Execution: Running multiple operations simultaneously rather than one after another. This can significantly reduce total execution time when operations don't depend on each other.

Streaming: Sending response data to the user incrementally as it's generated, rather than waiting for the complete response. This improves perceived speed even if total generation time is unchanged.

Token: The basic unit of text that language models process. A token is roughly equivalent to a word or word piece. Both input and output are measured in tokens.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speeding up AI agents.

Loading component...

Reference

BIBTEXAcademic
@misc{speedingupaiagentsperformanceoptimizationtechniquesforfasterresponsetimes, author = {Michael Brenndoerfer}, title = {Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times}, year = {2025}, url = {https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-10} }
APAAcademic
Michael Brenndoerfer (2025). Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times. Retrieved from https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization
MLAAcademic
Michael Brenndoerfer. "Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times." 2025. Web. 11/10/2025. <https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization>.
CHICAGOAcademic
Michael Brenndoerfer. "Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times." Accessed 11/10/2025. https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times'. Available at: https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization (Accessed: 11/10/2025).
SimpleBasic
Michael Brenndoerfer (2025). Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times. https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.