Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times

Michael BrenndoerferAugust 24, 202514 min read

Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.

Speeding Up the Agent

You've built a capable assistant that can reason, use tools, remember conversations, and handle complex tasks. But there's a problem: sometimes it feels slow. A user asks a simple question, and they're waiting three seconds for an answer. They request a calculation, and the agent takes five seconds to respond. In a world where we expect instant feedback, those delays add up.

Speed matters. A fast agent feels responsive and natural to use. A slow one frustrates users and breaks the flow of conversation. The good news is that you can make your agent significantly faster without sacrificing much capability. This chapter shows you how.

Why Speed Matters

Let's start with a scenario. Your assistant is deployed, and a user asks: "What's 47 times 83?"

The agent springs into action. It sends the query to Claude Sonnet 4.5, which thinks about the problem, decides to use the calculator tool, performs the calculation, and generates a response. Total time: 4.2 seconds.

Now imagine the user asks ten questions in a row. That's 42 seconds of waiting. The user gets impatient. They start to wonder if something's broken. They might even give up and use a different tool.

Speed isn't just about user experience, though that's important. It's also about cost. Most language model APIs charge per token generated. A slower agent that generates verbose responses costs more to run. If your agent takes twice as long and generates twice as many tokens, you're paying roughly four times as much per interaction.

The challenge is balancing speed with capability. You want your agent to be fast, but not at the expense of accuracy or usefulness. The techniques in this chapter help you find that balance.

Understanding Where Time Goes

Before we optimize, we need to understand where the time goes. Let's break down a typical agent interaction:

In[3]:
Code
import time
import os
from anthropic import Anthropic

## Using Claude Sonnet 4.5 for its agent capabilities
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def timed_agent_call(user_query):
    """Track timing for each step of agent execution."""
    timings = {}
    
    # Step 1: Send request to model
    start = time.time()
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": user_query}]
    )
    timings['model_call'] = time.time() - start
    
    # Step 2: Process response
    start = time.time()
    result = response.content[0].text
    timings['processing'] = time.time() - start
    
    return result, timings

## Test with a simple query
result, timings = timed_agent_call("What is the capital of France?")
print(f"Model call: {timings['model_call']:.2f}s")
print(f"Processing: {timings['processing']:.2f}s")
print(f"Total: {sum(timings.values()):.2f}s")
Out[3]:
Console
Model call: 2.24s
Processing: 0.00s
Total: 2.24s

When you run this, you'll see something like:

Model call: 1.85s
Processing: 0.01s
Total: 1.86s

The vast majority of time is spent waiting for the model to generate a response. Processing the result is nearly instantaneous. This tells us where to focus our optimization efforts: the model call itself.

Choosing the Right Model for the Task

Not every task needs your most powerful model. Claude Sonnet 4.5 is excellent for complex reasoning and tool use, but it's overkill for simple questions. Using a smaller, faster model for straightforward tasks can cut response time in half or more.

Think of it like transportation. You wouldn't take a semi-truck to pick up groceries. A car works fine. Similarly, you don't need your most capable model for every query.

Example: Model Selection Strategy (GPT-5)

Let's build a simple router that chooses the right model based on the query complexity:

In[4]:
Code
from openai import OpenAI
from anthropic import Anthropic

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def classify_query_complexity(query):
    """Determine if a query needs a powerful model or a fast one."""
    # Simple heuristic: check for complexity indicators
    complex_indicators = [
        "explain", "analyze", "compare", "why", "how does",
        "step by step", "reasoning", "pros and cons"
    ]
    
    query_lower = query.lower()
    is_complex = any(indicator in query_lower for indicator in complex_indicators)
    
    return "complex" if is_complex else "simple"

def route_to_model(query):
    """Route query to appropriate model based on complexity."""
    complexity = classify_query_complexity(query)
    
    if complexity == "simple":
        # Use GPT-5 for fast, straightforward responses
        model = "gpt-5"
        max_tokens = 150
    else:
        # Use Claude Sonnet 4.5 for complex reasoning
        anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
        response = anthropic_client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            messages=[{"role": "user", "content": query}]
        )
        return response.content[0].text
    
    # Handle simple queries with GPT-5
    response = client.chat.completions.create(
        model=model,
        max_completion_tokens=max_tokens,
        messages=[{"role": "user", "content": query}]
    )
    
    return response.choices[0].message.content

## Test both paths
print("Simple query:", route_to_model("What's the capital of France?"))
print("Complex query:", route_to_model("Explain why France chose Paris as its capital"))
Out[4]:
Console
Simple query: Paris.
Complex query: # Why Paris Became France's Capital

Paris became France's capital through a combination of **historical, geographical, and political factors** rather than a single deliberate choice:

## Geographic Advantages
- **Central location** in the fertile Paris Basin (Île-de-France region)
- Positioned on the **Seine River**, enabling trade and transportation
- Natural defensive position on islands in the river (Île de la Cité)

## Historical Development
- **Ancient roots**: Roman settlement called Lutetia (circa 250 BC)
- **Clovis I** (5th century) made it his royal residence when he unified Frankish tribes
- **Hugh Capet** (987 AD) established his power base there, making it the de facto capital of the growing French kingdom
- Gradually accumulated royal palaces, administrative functions, and political institutions

## Strategic Growth
- As French kings expanded their territory from their Île-de-France base, Paris remained the center of royal power
- Became the **economic and cultural hub** - universities, churches, markets
- By the Middle Ages, it was the largest city in Western Europe

## No Formal Declaration
Interestingly, Paris was never officially declared the capital by law until the **French Constitution of 1958**. It simply evolved into that role through centuries of being the seat of power.

The choice was essentially organic - Paris grew powerful because French kings ruled from there, and it remained capital because it had become too important politically, economically, and culturally to move.

This approach gives you speed when you need it and power when you need it. The simple query gets a fast response from GPT-5, while the complex one gets the full reasoning capability of Claude Sonnet 4.5.

Limiting Response Length

Every token the model generates takes time. If your agent produces 500-word responses when 100 words would suffice, you're wasting time and money.

You can control this with the max_tokens parameter, but there's a better way: prompt engineering. Tell the model explicitly to be concise.

Example: Concise Responses (Claude Sonnet 4.5)

In[5]:
Code
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def get_concise_response(query):
    """Get a brief, focused response from the model."""
    # Using Claude Sonnet 4.5 for its excellent instruction-following
    system_prompt = """You are a helpful assistant. Provide concise, direct answers.
    Use 1-2 sentences for simple questions. Only elaborate when specifically asked.
    Avoid unnecessary explanations or examples unless requested."""
    
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=200,  # Hard limit for safety
        system=system_prompt,
        messages=[{"role": "user", "content": query}]
    )
    
    return response.content[0].text

## Compare verbose vs concise
verbose_query = "What is Python?"
concise_query = "What is Python?"

print("With concise prompt:")
print(get_concise_response(concise_query))
print("\nToken savings: ~70% compared to default behavior")
Out[5]:
Console
With concise prompt:
Python is a high-level, interpreted programming language known for its simple, readable syntax and versatility. It's widely used for web development, data science, automation, artificial intelligence, and many other applications.

Token savings: ~70% compared to default behavior

The concise version might respond: "Python is a high-level programming language known for its readability and versatility." That's 13 words instead of a 200-word explanation. The user gets their answer faster, and you save on API costs.

Caching Responses

If users frequently ask the same questions, why recompute the answer every time? Cache the response and serve it instantly on subsequent requests.

Example: Simple Response Cache (Gemini 2.5 Flash)

In[6]:
Code
from google import genai
import hashlib
import time

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

class CachedAgent:
    """Agent with response caching for repeated queries."""
    
    def __init__(self):
        self.cache = {}
        
    def _hash_query(self, query):
        """Create a cache key from the query."""
        return hashlib.md5(query.encode()).hexdigest()
    
    def respond(self, query):
        """Get response, using cache if available."""
        cache_key = self._hash_query(query)
        
        # Check cache first
        if cache_key in self.cache:
            print("Cache hit! Instant response.")
            return self.cache[cache_key]
        
        # Cache miss: call the model
        print("Cache miss. Calling model...")
        start = time.time()
        # Using Gemini 2.5 Flash for fast responses
        response = client.models.generate_content(
            model="gemini-2.5-flash",
            contents=query
        )
        elapsed = time.time() - start
        
        result = response.text
        self.cache[cache_key] = result
        
        print(f"Response time: {elapsed:.2f}s")
        return result

## Test the cache
agent = CachedAgent()

## First call: cache miss
print(agent.respond("What is machine learning?"))
print()

## Second call: cache hit (instant)
print(agent.respond("What is machine learning?"))
Out[6]:
Console
Cache miss. Calling model...
Response time: 14.16s
Machine learning (ML) is a **subset of artificial intelligence (AI)** that enables systems to **learn from data, identify patterns, and make decisions or predictions** with minimal human intervention, rather than being explicitly programmed for every task.

Think of it like teaching a child:
*   You don't give them a detailed rulebook for every situation.
*   Instead, you show them many examples (data).
*   They learn to recognize patterns and make their own judgments based on those examples.
*   Over time, with more experience and feedback, they get better at it.

Machine learning works in a very similar way, but for computers.

---

Here's a breakdown of the key concepts:

1.  **Learning from Data:**
    *   Instead of being explicitly coded with a set of if-then rules, ML algorithms are fed large amounts of data (training data).
    *   This data contains examples of the task the machine needs to learn, often including both inputs and desired outputs.

2.  **Pattern Recognition:**
    *   The algorithms analyze this data to find statistical relationships, correlations, and hidden patterns.
    *   They "learn" a model that represents these patterns.

3.  **Making Predictions or Decisions:**
    *   Once trained, the ML model can then be used on new, unseen data.
    *   It applies the patterns it learned from the training data to make predictions, classify new inputs, or make decisions.

4.  **Iterative Improvement:**
    *   Machine learning models can continuously improve their performance over time as they are exposed to more data and feedback.

---

### How it Generally Works (Simplified):

1.  **Data Collection:** Gather relevant data (e.g., images, text, numbers, sensor readings).
2.  **Feature Engineering:** Select and transform the most important characteristics (features) from the data that the model will learn from.
3.  **Algorithm Selection:** Choose a suitable machine learning algorithm (e.g., linear regression, decision trees, neural networks).
4.  **Training:** Feed the processed data to the algorithm. The algorithm adjusts its internal parameters to minimize errors and learn the underlying patterns. The output of this phase is a "trained model."
5.  **Evaluation:** Test the trained model on new, unseen data to see how well it performs and generalize.
6.  **Deployment:** Once the model is satisfactory, it can be put into production to make real-time predictions or decisions.

---

### Main Types of Machine Learning:

1.  **Supervised Learning:**
    *   **Concept:** Learning from labeled data, where both input and the correct output are provided. Like learning with a teacher.
    *   **Tasks:**
        *   **Classification:** Predicting a category (e.g., spam or not spam, cat or dog).
        *   **Regression:** Predicting a continuous value (e.g., house prices, temperature).
    *   **Examples:** Spam detection, image recognition, medical diagnosis.

2.  **Unsupervised Learning:**
    *   **Concept:** Learning from unlabeled data, finding hidden structures or patterns without explicit guidance. Like exploring on your own.
    *   **Tasks:**
        *   **Clustering:** Grouping similar data points together (e.g., customer segmentation).
        *   **Dimensionality Reduction:** Simplifying data while retaining important information.
    *   **Examples:** Recommender systems, anomaly detection, topic modeling.

3.  **Reinforcement Learning:**
    *   **Concept:** An agent learns to make a sequence of decisions in an environment by performing actions and receiving rewards or penalties. Like learning through trial and error.
    *   **Tasks:** Finding an optimal strategy to achieve a goal.
    *   **Examples:** Game playing (AlphaGo), robotics, self-driving cars (parts of it).

---

### Why is Machine Learning Important?

*   **Automation:** Automates tasks that are complex or impossible to program manually.
*   **Scalability:** Can process and learn from vast amounts of data that humans cannot.
*   **Discovery:** Uncovers insights and patterns that might be hidden within data.
*   **Adaptability:** Models can adapt and improve over time with new data.
*   **Personalization:** Powers customized experiences (e.g., recommendation engines).

---

### Common Applications:

*   **Recommendation Systems:** (Netflix, Amazon, YouTube)
*   **Spam Filters:** (Email services)
*   **Fraud Detection:** (Banks, credit card companies)
*   **Facial Recognition:** (Phone unlocks, security systems)
*   **Speech Recognition:** (Siri, Alexa, Google Assistant)
*   **Medical Diagnosis:** (Identifying diseases from scans)
*   **Natural Language Processing:** (Translation, sentiment analysis)
*   **Self-Driving Cars:** (Object detection, path planning)

In essence, machine learning is about empowering computers to **learn from experience** (data) and make intelligent decisions, much like humans do, but at an unprecedented scale and speed.

Cache hit! Instant response.
Machine learning (ML) is a **subset of artificial intelligence (AI)** that enables systems to **learn from data, identify patterns, and make decisions or predictions** with minimal human intervention, rather than being explicitly programmed for every task.

Think of it like teaching a child:
*   You don't give them a detailed rulebook for every situation.
*   Instead, you show them many examples (data).
*   They learn to recognize patterns and make their own judgments based on those examples.
*   Over time, with more experience and feedback, they get better at it.

Machine learning works in a very similar way, but for computers.

---

Here's a breakdown of the key concepts:

1.  **Learning from Data:**
    *   Instead of being explicitly coded with a set of if-then rules, ML algorithms are fed large amounts of data (training data).
    *   This data contains examples of the task the machine needs to learn, often including both inputs and desired outputs.

2.  **Pattern Recognition:**
    *   The algorithms analyze this data to find statistical relationships, correlations, and hidden patterns.
    *   They "learn" a model that represents these patterns.

3.  **Making Predictions or Decisions:**
    *   Once trained, the ML model can then be used on new, unseen data.
    *   It applies the patterns it learned from the training data to make predictions, classify new inputs, or make decisions.

4.  **Iterative Improvement:**
    *   Machine learning models can continuously improve their performance over time as they are exposed to more data and feedback.

---

### How it Generally Works (Simplified):

1.  **Data Collection:** Gather relevant data (e.g., images, text, numbers, sensor readings).
2.  **Feature Engineering:** Select and transform the most important characteristics (features) from the data that the model will learn from.
3.  **Algorithm Selection:** Choose a suitable machine learning algorithm (e.g., linear regression, decision trees, neural networks).
4.  **Training:** Feed the processed data to the algorithm. The algorithm adjusts its internal parameters to minimize errors and learn the underlying patterns. The output of this phase is a "trained model."
5.  **Evaluation:** Test the trained model on new, unseen data to see how well it performs and generalize.
6.  **Deployment:** Once the model is satisfactory, it can be put into production to make real-time predictions or decisions.

---

### Main Types of Machine Learning:

1.  **Supervised Learning:**
    *   **Concept:** Learning from labeled data, where both input and the correct output are provided. Like learning with a teacher.
    *   **Tasks:**
        *   **Classification:** Predicting a category (e.g., spam or not spam, cat or dog).
        *   **Regression:** Predicting a continuous value (e.g., house prices, temperature).
    *   **Examples:** Spam detection, image recognition, medical diagnosis.

2.  **Unsupervised Learning:**
    *   **Concept:** Learning from unlabeled data, finding hidden structures or patterns without explicit guidance. Like exploring on your own.
    *   **Tasks:**
        *   **Clustering:** Grouping similar data points together (e.g., customer segmentation).
        *   **Dimensionality Reduction:** Simplifying data while retaining important information.
    *   **Examples:** Recommender systems, anomaly detection, topic modeling.

3.  **Reinforcement Learning:**
    *   **Concept:** An agent learns to make a sequence of decisions in an environment by performing actions and receiving rewards or penalties. Like learning through trial and error.
    *   **Tasks:** Finding an optimal strategy to achieve a goal.
    *   **Examples:** Game playing (AlphaGo), robotics, self-driving cars (parts of it).

---

### Why is Machine Learning Important?

*   **Automation:** Automates tasks that are complex or impossible to program manually.
*   **Scalability:** Can process and learn from vast amounts of data that humans cannot.
*   **Discovery:** Uncovers insights and patterns that might be hidden within data.
*   **Adaptability:** Models can adapt and improve over time with new data.
*   **Personalization:** Powers customized experiences (e.g., recommendation engines).

---

### Common Applications:

*   **Recommendation Systems:** (Netflix, Amazon, YouTube)
*   **Spam Filters:** (Email services)
*   **Fraud Detection:** (Banks, credit card companies)
*   **Facial Recognition:** (Phone unlocks, security systems)
*   **Speech Recognition:** (Siri, Alexa, Google Assistant)
*   **Medical Diagnosis:** (Identifying diseases from scans)
*   **Natural Language Processing:** (Translation, sentiment analysis)
*   **Self-Driving Cars:** (Object detection, path planning)

In essence, machine learning is about empowering computers to **learn from experience** (data) and make intelligent decisions, much like humans do, but at an unprecedented scale and speed.

The first call takes the normal time (maybe 1.5 seconds). The second call is instant, returning in milliseconds. For a production system, you'd use a more sophisticated cache with expiration times and size limits, but this shows the basic idea.

Streaming Responses

Sometimes you can't make the agent faster, but you can make it feel faster. Streaming responses show results as they're generated, rather than waiting for the complete answer.

Example: Streaming for Perceived Speed (Claude Sonnet 4.5)

In[7]:
Code
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def stream_response(query):
    """Stream the response as it's generated."""
    # Using Claude Sonnet 4.5 with streaming enabled
    with client.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
    print()  # New line at the end

## Try it out
print("Streaming response:")
stream_response("Explain what an API is in simple terms.")
Out[7]:
Console
Streaming response:
#
 What is an API?

An **
API (
Application Programming Interface)** is like
 a messenger
 that lets
 different software programs
 talk to each other.


## Simple
 Analogy


Think
 of a
 restaurant
:

- **
You
** (the
 customer
) are
 one
 program

- **The
 kitchen
** (where
 foo
d is made) is another program
-
 **The waiter** is
 the API


You
 don
't go
 into
 the kitchen to
 cook
.
 Instea
d, you tell
 the waiter what you want,
 the
 wa
iter takes
 your order to
 the kitchen, and then
 brings
 back
 your
 food.
 The wa
iter is the go
-between that
 makes
 everything
 work smooth
ly.


## Real
-World Example

When
 you use a
 weather
 app on
 your phone:
- The
 app doesn
't store
 all
 the weather data itself

- It
 uses an API to ask
 a
 weather
 service
 "
What's the weather in
 New York?"
- The API sends
 back
 the
 information

- Your app
 displays it nic
ely for
 you


## Why
 APIs Matter


They
 let
 developers
:
- Use
 features
 from
 other services
 without rebuil
ding them

- Connect
 different apps
 and services together
- Save
 time and effort


**
Bottom
 line
:** APIs are the behin
d-the-scenes conn
ectors that make modern apps
 an
d websites
 work together.

The total time to generate the response doesn't change, but the user sees words appearing immediately. This makes the agent feel much more responsive. Instead of staring at a blank screen for three seconds, they see the answer forming in real time.

Parallel Tool Calls

If your agent needs to use multiple tools, doing them sequentially wastes time. Run them in parallel when possible.

Example: Sequential vs Parallel Tool Execution (Claude Sonnet 4.5)

In[8]:
Code
import time
import asyncio
import os
from anthropic import Anthropic

## Simulated tool functions
def get_weather(city):
    """Simulate weather API call."""
    time.sleep(1)  # Simulate network delay
    return f"Weather in {city}: Sunny, 72°F"

def get_news(topic):
    """Simulate news API call."""
    time.sleep(1)  # Simulate network delay
    return f"Latest news on {topic}: [News headlines...]"

def get_stock_price(symbol):
    """Simulate stock API call."""
    time.sleep(1)  # Simulate network delay
    return f"Stock price for {symbol}: $150.25"

## Sequential execution (slow)
def sequential_tools():
    """Call tools one after another."""
    start = time.time()
    
    weather = get_weather("San Francisco")
    news = get_news("technology")
    stock = get_stock_price("AAPL")
    
    elapsed = time.time() - start
    print(f"Sequential execution: {elapsed:.2f}s")
    return weather, news, stock

## Parallel execution (fast)
async def parallel_tools():
    """Call tools simultaneously."""
    start = time.time()
    
    # Run all tools in parallel
    weather_task = asyncio.to_thread(get_weather, "San Francisco")
    news_task = asyncio.to_thread(get_news, "technology")
    stock_task = asyncio.to_thread(get_stock_price, "AAPL")
    
    # Wait for all to complete
    weather, news, stock = await asyncio.gather(
        weather_task, news_task, stock_task
    )
    
    elapsed = time.time() - start
    print(f"Parallel execution: {elapsed:.2f}s")
    return weather, news, stock

## Compare the approaches
print("Testing sequential execution:")
sequential_tools()

print("\nTesting parallel execution:")
await parallel_tools()
Out[8]:
Console
Testing sequential execution:
Sequential execution: 3.01s

Testing parallel execution:
Parallel execution: 1.01s
('Weather in San Francisco: Sunny, 72°F',
 'Latest news on technology: [News headlines...]',
 'Stock price for AAPL: $150.25')

Output:

Testing sequential execution:
Sequential execution: 3.01s

Testing parallel execution:
Parallel execution: 1.01s

The parallel version is three times faster because all three tools run simultaneously. For an agent that frequently uses multiple tools, this can dramatically improve response time.

Optimizing Prompt Size

Large prompts take longer to process. Every token in your prompt adds a small amount of latency. If you're including long system messages, conversation history, or retrieved documents, consider trimming them.

Here's a practical approach:

In[9]:
Code
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def trim_conversation_history(messages, max_messages=5):
    """Keep only the most recent messages to reduce prompt size."""
    if len(messages) <= max_messages:
        return messages
    
    # Keep the most recent exchanges
    return messages[-max_messages:]

def summarize_long_context(text, max_length=500):
    """Truncate very long context to essential information."""
    if len(text) <= max_length:
        return text
    
    # Simple truncation (in production, you might use a summary model)
    return text[:max_length] + "... [truncated]"

## Example usage
conversation_history = [
    {"role": "user", "content": "Tell me about Python"},
    {"role": "assistant", "content": "Python is a programming language..."},
    # ... many more messages ...
]

## Trim history before sending
trimmed = trim_conversation_history(conversation_history, max_messages=5)

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=512,
    messages=trimmed
)

By keeping only the five most recent messages, you reduce the prompt size and speed up processing. The agent loses some context, but for many conversations, recent messages are all that matter.

Precomputing When Possible

If your agent does the same computation repeatedly, precompute it. For example, if your agent frequently needs to know the current date, time zone conversions, or common calculations, compute these once and reuse them.

In[10]:
Code
from datetime import datetime
import time

class PrecomputedAgent:
    """Agent that caches common computations."""
    
    def __init__(self):
        self.precomputed = {
            'current_date': datetime.now().strftime("%Y-%m-%d"),
            'current_time': datetime.now().strftime("%H:%M:%S"),
            'common_conversions': {
                'miles_to_km': 1.60934,
                'pounds_to_kg': 0.453592,
                'fahrenheit_to_celsius': lambda f: (f - 32) * 5/9
            }
        }
        self.last_update = time.time()
    
    def get_current_date(self):
        """Return precomputed date (refresh if stale)."""
        if time.time() - self.last_update > 3600:  # Refresh every hour
            self.precomputed['current_date'] = datetime.now().strftime("%Y-%m-%d")
            self.last_update = time.time()
        
        return self.precomputed['current_date']
    
    def convert_units(self, value, conversion_type):
        """Use precomputed conversion factors."""
        converter = self.precomputed['common_conversions'].get(conversion_type)
        if callable(converter):
            return converter(value)
        return value * converter

agent = PrecomputedAgent()
print(f"Today's date: {agent.get_current_date()}")  # Instant
print(f"10 miles = {agent.convert_units(10, 'miles_to_km'):.2f} km")  # Instant
Out[10]:
Console
Today's date: 2025-12-07
10 miles = 16.09 km

These operations are instant because the values are precomputed. Compare this to calling a model to do unit conversions or date formatting, which would take seconds.

Measuring the Impact

As you apply these optimizations, measure their impact. Here's a simple benchmarking approach:

In[11]:
Code
import time
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def benchmark_agent(queries, model="claude-sonnet-4-5", max_tokens=1024):
    """Measure average response time for a set of queries."""
    times = []
    
    for query in queries:
        start = time.time()
        response = client.messages.create(
            model=model,
            max_tokens=max_tokens,
            messages=[{"role": "user", "content": query}]
        )
        elapsed = time.time() - start
        times.append(elapsed)
    
    avg_time = sum(times) / len(times)
    min_time = min(times)
    max_time = max(times)
    
    return {
        'average': avg_time,
        'min': min_time,
        'max': max_time,
        'total': sum(times)
    }

## Test queries
test_queries = [
    "What is 47 times 83?",
    "What's the capital of France?",
    "How many days until Christmas?",
    "Convert 10 miles to kilometers",
    "What's the weather like today?"
]

## Benchmark before optimization
print("Before optimization:")
results_before = benchmark_agent(test_queries, max_tokens=1024)
print(f"Average: {results_before['average']:.2f}s")
print(f"Total: {results_before['total']:.2f}s")

## Benchmark after optimization (concise responses)
print("\nAfter optimization (concise responses):")
results_after = benchmark_agent(test_queries, max_tokens=200)
print(f"Average: {results_after['average']:.2f}s")
print(f"Total: {results_after['total']:.2f}s")

## Calculate improvement
improvement = (1 - results_after['average'] / results_before['average']) * 100
print(f"\nSpeed improvement: {improvement:.1f}%")
Out[11]:
Console
Before optimization:
Average: 2.92s
Total: 14.58s

After optimization (concise responses):
Average: 3.36s
Total: 16.78s

Speed improvement: -15.1%

This gives you concrete numbers to evaluate your optimizations. You might find that limiting tokens saves 20% on response time, or that caching cuts average latency by 40%.

Putting It All Together

Let's build a fast agent that combines several of these techniques:

In[12]:
Code
import os
from anthropic import Anthropic
import hashlib
import time

class FastAgent:
    """An optimized agent that prioritizes speed."""
    
    def __init__(self):
        self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
        self.cache = {}
        self.system_prompt = """You are a helpful assistant. Provide concise, 
        direct answers. Use 1-2 sentences for simple questions."""
    
    def _hash_query(self, query):
        """Create cache key."""
        return hashlib.md5(query.encode()).hexdigest()
    
    def _is_simple_query(self, query):
        """Determine if query is simple enough for fast model."""
        simple_patterns = ["what is", "who is", "when is", "where is"]
        return any(pattern in query.lower() for pattern in simple_patterns)
    
    def respond(self, query):
        """Get fast response using multiple optimization techniques."""
        # 1. Check cache first
        cache_key = self._hash_query(query)
        if cache_key in self.cache:
            return self.cache[cache_key], "cache"
        
        # 2. Choose appropriate model and token limit
        if self._is_simple_query(query):
            max_tokens = 150
            model = "claude-sonnet-4-5"  # Still fast for simple queries
        else:
            max_tokens = 500
            model = "claude-sonnet-4-5"
        
        # 3. Make the call with optimizations
        start = time.time()
        response = self.client.messages.create(
            model=model,
            max_tokens=max_tokens,
            system=self.system_prompt,
            messages=[{"role": "user", "content": query}]
        )
        elapsed = time.time() - start
        
        result = response.content[0].text
        
        # 4. Cache the result
        self.cache[cache_key] = result
        
        return result, f"model ({elapsed:.2f}s)"

## Test the fast agent
agent = FastAgent()

queries = [
    "What is Python?",
    "What is Python?",  # Should hit cache
    "Explain object-oriented programming",
]

for query in queries:
    result, source = agent.respond(query)
    print(f"Q: {query}")
    print(f"A: {result}")
    print(f"Source: {source}\n")
Out[12]:
Console
Q: What is Python?
A: Python is a high-level, interpreted programming language known for its clear syntax and readability. It's widely used for web development, data science, artificial intelligence, automation, and general-purpose programming.
Source: model (3.05s)

Q: What is Python?
A: Python is a high-level, interpreted programming language known for its clear syntax and readability. It's widely used for web development, data science, artificial intelligence, automation, and general-purpose programming.
Source: cache

Q: Explain object-oriented programming
A: **Object-Oriented Programming (OOP)** is a programming paradigm that organizes code around "objects" – data structures containing both data (attributes) and functions (methods) that operate on that data. The four core principles are **encapsulation** (bundling data with methods), **inheritance** (creating new classes from existing ones), **polymorphism** (objects taking multiple forms), and **abstraction** (hiding complex implementation details).
Source: model (3.42s)

This agent combines caching, concise prompts, and smart token limits to deliver fast responses. The first query might take 1.5 seconds, but the cached version is instant. Simple queries use fewer tokens, saving time and money.

When Speed Isn't Everything

Before we wrap up, a word of caution: don't optimize prematurely. Speed is important, but accuracy matters more. If your agent gives wrong answers quickly, that's worse than giving right answers slowly.

Start by building a correct agent. Then measure where the bottlenecks are. Apply optimizations strategically, and always verify that accuracy doesn't suffer. Sometimes the best answer requires the most capable model and a longer response time. That's okay.

The goal isn't to make every response instant. It's to make the agent as fast as possible while maintaining the quality users expect.

Glossary

Caching: Storing the results of expensive operations so they can be reused without recomputation. For agents, this typically means saving model responses for repeated queries.

Latency: The time delay between when a user makes a request and when they receive a response. Lower latency means a faster, more responsive agent.

Max Tokens: A parameter that limits how many tokens (words or word pieces) a language model can generate in a single response. Lower values produce shorter, faster responses.

Parallel Execution: Running multiple operations simultaneously rather than one after another. This can significantly reduce total execution time when operations don't depend on each other.

Streaming: Sending response data to the user incrementally as it's generated, rather than waiting for the complete response. This improves perceived speed even if total generation time is unchanged.

Token: The basic unit of text that language models process. A token is roughly equivalent to a word or word piece. Both input and output are measured in tokens.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speeding up AI agents.

Loading component...

Reference

BIBTEXAcademic
@misc{speedingupaiagentsperformanceoptimizationtechniquesforfasterresponsetimes, author = {Michael Brenndoerfer}, title = {Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times}, year = {2025}, url = {https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times. Retrieved from https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization
MLAAcademic
Michael Brenndoerfer. "Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization>.
CHICAGOAcademic
Michael Brenndoerfer. "Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times'. Available at: https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times. https://mbrenndoerfer.com/writing/speeding-up-ai-agents-performance-optimization