Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.

This article is part of the free-to-read AI Agent Handbook
Managing and Reducing Costs
Your assistant works beautifully. It answers questions, uses tools, remembers context, and handles complex tasks. But there's a problem you might not have noticed yet: every interaction costs money.
Each time your agent calls Claude Sonnet 4.5, GPT-5, or Gemini 2.5, you're charged based on the number of tokens processed. Input tokens (your prompt) and output tokens (the response) both count. Run your agent at scale, and those costs add up fast. A single user might generate $0.50 in API costs per day. A thousand users? That's $500 daily, or $15,000 per month.
The good news is that you can dramatically reduce costs without sacrificing much capability. This chapter shows you how to build an agent that's both powerful and economical.
Understanding the Cost Structure
Before we optimize, let's understand what you're paying for. Most language model APIs charge per token, with different rates for input and output.
Here's a simplified example of typical pricing (November 2025):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | Complex reasoning, agents |
| GPT-5 | $2.50 | $10.00 | General-purpose tasks |
| Gemini 2.5 Flash | $0.40 | $1.20 | Simple queries, high volume |
| Gemini 2.5 Pro | $1.25 | $5.00 | Multimodal, large context |
Notice that output tokens cost more than input tokens. This makes sense because generating text requires more computation than processing it. It also means that verbose responses are expensive.
Let's calculate the cost of a typical interaction:
1def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4.5"):
2 """Calculate the cost of a single model interaction."""
3 # Pricing per million tokens (November 2025 rates)
4 pricing = {
5 "claude-sonnet-4.5": {"input": 3.00, "output": 15.00},
6 "gpt-5": {"input": 2.50, "output": 10.00},
7 "gemini-2.5-flash": {"input": 0.40, "output": 1.20},
8 "gemini-2.5-pro": {"input": 1.25, "output": 5.00}
9 }
10
11 rates = pricing[model]
12
13 # Calculate cost (rates are per million tokens)
14 input_cost = (input_tokens / 1_000_000) * rates["input"]
15 output_cost = (output_tokens / 1_000_000) * rates["output"]
16 total_cost = input_cost + output_cost
17
18 return {
19 "input_cost": input_cost,
20 "output_cost": output_cost,
21 "total_cost": total_cost
22 }
23
24## Example: A conversation with context
25input_tokens = 1500 # System prompt + conversation history + query
26output_tokens = 500 # Agent's response
27
28cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4.5")
29print(f"Input cost: ${cost['input_cost']:.6f}")
30print(f"Output cost: ${cost['output_cost']:.6f}")
31print(f"Total cost: ${cost['total_cost']:.6f}")
32print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")1def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4.5"):
2 """Calculate the cost of a single model interaction."""
3 # Pricing per million tokens (November 2025 rates)
4 pricing = {
5 "claude-sonnet-4.5": {"input": 3.00, "output": 15.00},
6 "gpt-5": {"input": 2.50, "output": 10.00},
7 "gemini-2.5-flash": {"input": 0.40, "output": 1.20},
8 "gemini-2.5-pro": {"input": 1.25, "output": 5.00}
9 }
10
11 rates = pricing[model]
12
13 # Calculate cost (rates are per million tokens)
14 input_cost = (input_tokens / 1_000_000) * rates["input"]
15 output_cost = (output_tokens / 1_000_000) * rates["output"]
16 total_cost = input_cost + output_cost
17
18 return {
19 "input_cost": input_cost,
20 "output_cost": output_cost,
21 "total_cost": total_cost
22 }
23
24## Example: A conversation with context
25input_tokens = 1500 # System prompt + conversation history + query
26output_tokens = 500 # Agent's response
27
28cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4.5")
29print(f"Input cost: ${cost['input_cost']:.6f}")
30print(f"Output cost: ${cost['output_cost']:.6f}")
31print(f"Total cost: ${cost['total_cost']:.6f}")
32print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")Output:
1Input cost: $0.004500
2Output cost: $0.007500
3Total cost: $0.012000
4
5Cost per 1000 interactions: $12.001Input cost: $0.004500
2Output cost: $0.007500
3Total cost: $0.012000
4
5Cost per 1000 interactions: $12.00A single interaction costs about one cent. That seems small, but multiply it by thousands of users and millions of interactions, and you're looking at serious money.
Tracking Costs in Your Agent
Before you can optimize, you need visibility into what you're spending. Let's add cost tracking to our assistant:
1from anthropic import Anthropic
2from datetime import datetime
3
4class CostTrackingAgent:
5 """Agent that tracks API costs for monitoring and optimization."""
6
7 def __init__(self):
8 self.client = Anthropic(api_key="YOUR_API_KEY")
9 self.cost_log = []
10
11 # Pricing per million tokens
12 self.pricing = {
13 "claude-sonnet-4.5": {"input": 3.00, "output": 15.00}
14 }
15
16 def _calculate_cost(self, usage, model):
17 """Calculate cost from token usage."""
18 rates = self.pricing[model]
19 input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
20 output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
21 return input_cost + output_cost
22
23 def respond(self, query):
24 """Generate response and track costs."""
25 model = "claude-sonnet-4.5"
26
27 response = self.client.messages.create(
28 model=model,
29 max_tokens=1024,
30 messages=[{"role": "user", "content": query}]
31 )
32
33 # Calculate and log cost
34 cost = self._calculate_cost(response.usage, model)
35
36 self.cost_log.append({
37 "timestamp": datetime.now(),
38 "query": query[:50] + "..." if len(query) > 50 else query,
39 "input_tokens": response.usage.input_tokens,
40 "output_tokens": response.usage.output_tokens,
41 "cost": cost,
42 "model": model
43 })
44
45 return response.content[0].text
46
47 def get_cost_summary(self):
48 """Get summary of costs."""
49 if not self.cost_log:
50 return "No interactions yet."
51
52 total_cost = sum(entry["cost"] for entry in self.cost_log)
53 total_tokens = sum(
54 entry["input_tokens"] + entry["output_tokens"]
55 for entry in self.cost_log
56 )
57
58 return {
59 "total_interactions": len(self.cost_log),
60 "total_cost": total_cost,
61 "total_tokens": total_tokens,
62 "average_cost_per_interaction": total_cost / len(self.cost_log),
63 "most_expensive": max(self.cost_log, key=lambda x: x["cost"])
64 }
65
66## Test the tracking
67agent = CostTrackingAgent()
68
69agent.respond("What is Python?")
70agent.respond("Explain machine learning in simple terms.")
71agent.respond("How do neural networks work?")
72
73summary = agent.get_cost_summary()
74print(f"Total interactions: {summary['total_interactions']}")
75print(f"Total cost: ${summary['total_cost']:.4f}")
76print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
77print(f"\nMost expensive query: {summary['most_expensive']['query']}")
78print(f"Cost: ${summary['most_expensive']['cost']:.4f}")1from anthropic import Anthropic
2from datetime import datetime
3
4class CostTrackingAgent:
5 """Agent that tracks API costs for monitoring and optimization."""
6
7 def __init__(self):
8 self.client = Anthropic(api_key="YOUR_API_KEY")
9 self.cost_log = []
10
11 # Pricing per million tokens
12 self.pricing = {
13 "claude-sonnet-4.5": {"input": 3.00, "output": 15.00}
14 }
15
16 def _calculate_cost(self, usage, model):
17 """Calculate cost from token usage."""
18 rates = self.pricing[model]
19 input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
20 output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
21 return input_cost + output_cost
22
23 def respond(self, query):
24 """Generate response and track costs."""
25 model = "claude-sonnet-4.5"
26
27 response = self.client.messages.create(
28 model=model,
29 max_tokens=1024,
30 messages=[{"role": "user", "content": query}]
31 )
32
33 # Calculate and log cost
34 cost = self._calculate_cost(response.usage, model)
35
36 self.cost_log.append({
37 "timestamp": datetime.now(),
38 "query": query[:50] + "..." if len(query) > 50 else query,
39 "input_tokens": response.usage.input_tokens,
40 "output_tokens": response.usage.output_tokens,
41 "cost": cost,
42 "model": model
43 })
44
45 return response.content[0].text
46
47 def get_cost_summary(self):
48 """Get summary of costs."""
49 if not self.cost_log:
50 return "No interactions yet."
51
52 total_cost = sum(entry["cost"] for entry in self.cost_log)
53 total_tokens = sum(
54 entry["input_tokens"] + entry["output_tokens"]
55 for entry in self.cost_log
56 )
57
58 return {
59 "total_interactions": len(self.cost_log),
60 "total_cost": total_cost,
61 "total_tokens": total_tokens,
62 "average_cost_per_interaction": total_cost / len(self.cost_log),
63 "most_expensive": max(self.cost_log, key=lambda x: x["cost"])
64 }
65
66## Test the tracking
67agent = CostTrackingAgent()
68
69agent.respond("What is Python?")
70agent.respond("Explain machine learning in simple terms.")
71agent.respond("How do neural networks work?")
72
73summary = agent.get_cost_summary()
74print(f"Total interactions: {summary['total_interactions']}")
75print(f"Total cost: ${summary['total_cost']:.4f}")
76print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
77print(f"\nMost expensive query: {summary['most_expensive']['query']}")
78print(f"Cost: ${summary['most_expensive']['cost']:.4f}")This gives you visibility into where your money goes. You might discover that certain queries are far more expensive than others, or that a small percentage of interactions account for most of your costs.
Strategy 1: Use the Cheapest Model That Works
The most effective cost reduction strategy is simple: use cheaper models when possible. Not every task needs your most powerful model.
Think of it like choosing transportation. You wouldn't hire a helicopter to go to the grocery store. A car works fine. Similarly, you don't need Claude Sonnet 4.5 for every query.
Example: Cost-Aware Model Selection (Multi-Provider)
1from anthropic import Anthropic
2from openai import OpenAI
3from google import generativeai as genai
4
5class CostOptimizedAgent:
6 """Agent that chooses the most cost-effective model for each task."""
7
8 def __init__(self):
9 # Initialize clients for different providers
10 self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
11 self.openai = OpenAI(api_key="YOUR_OPENAI_KEY")
12 genai.configure(api_key="YOUR_GOOGLE_KEY")
13 self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14
15 def _classify_task_complexity(self, query):
16 """Determine what level of model capability is needed."""
17 # High complexity: needs reasoning, tool use, or complex understanding
18 high_complexity_indicators = [
19 "explain why", "analyze", "compare and contrast",
20 "step by step", "reasoning", "pros and cons",
21 "evaluate", "critique"
22 ]
23
24 # Medium complexity: straightforward questions or tasks
25 medium_complexity_indicators = [
26 "how to", "what is", "describe", "summarize"
27 ]
28
29 query_lower = query.lower()
30
31 if any(ind in query_lower for ind in high_complexity_indicators):
32 return "high"
33 elif any(ind in query_lower for ind in medium_complexity_indicators):
34 return "medium"
35 else:
36 return "low"
37
38 def respond(self, query):
39 """Route to the most cost-effective model."""
40 complexity = self._classify_task_complexity(query)
41
42 if complexity == "high":
43 # Use Claude Sonnet 4.5 for complex reasoning
44 # Cost: ~$0.012 per interaction
45 response = self.anthropic.messages.create(
46 model="claude-sonnet-4.5",
47 max_tokens=1024,
48 messages=[{"role": "user", "content": query}]
49 )
50 return response.content[0].text, "claude-sonnet-4.5", "high"
51
52 elif complexity == "medium":
53 # Use GPT-5 for general tasks
54 # Cost: ~$0.008 per interaction (33% savings)
55 response = self.openai.chat.completions.create(
56 model="gpt-5",
57 max_tokens=512,
58 messages=[{"role": "user", "content": query}]
59 )
60 return response.choices[0].message.content, "gpt-5", "medium"
61
62 else:
63 # Use Gemini 2.5 Flash for simple queries
64 # Cost: ~$0.001 per interaction (92% savings!)
65 response = self.gemini.generate_content(query)
66 return response.text, "gemini-2.5-flash", "low"
67
68## Test with different complexity levels
69agent = CostOptimizedAgent()
70
71queries = [
72 ("What's the capital of France?", "low"),
73 ("What is machine learning?", "medium"),
74 ("Analyze the pros and cons of different database architectures", "high")
75]
76
77for query, expected in queries:
78 result, model, complexity = agent.respond(query)
79 print(f"Query: {query}")
80 print(f"Complexity: {complexity} (expected: {expected})")
81 print(f"Model: {model}")
82 print(f"Response: {result[:100]}...")
83 print()1from anthropic import Anthropic
2from openai import OpenAI
3from google import generativeai as genai
4
5class CostOptimizedAgent:
6 """Agent that chooses the most cost-effective model for each task."""
7
8 def __init__(self):
9 # Initialize clients for different providers
10 self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
11 self.openai = OpenAI(api_key="YOUR_OPENAI_KEY")
12 genai.configure(api_key="YOUR_GOOGLE_KEY")
13 self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14
15 def _classify_task_complexity(self, query):
16 """Determine what level of model capability is needed."""
17 # High complexity: needs reasoning, tool use, or complex understanding
18 high_complexity_indicators = [
19 "explain why", "analyze", "compare and contrast",
20 "step by step", "reasoning", "pros and cons",
21 "evaluate", "critique"
22 ]
23
24 # Medium complexity: straightforward questions or tasks
25 medium_complexity_indicators = [
26 "how to", "what is", "describe", "summarize"
27 ]
28
29 query_lower = query.lower()
30
31 if any(ind in query_lower for ind in high_complexity_indicators):
32 return "high"
33 elif any(ind in query_lower for ind in medium_complexity_indicators):
34 return "medium"
35 else:
36 return "low"
37
38 def respond(self, query):
39 """Route to the most cost-effective model."""
40 complexity = self._classify_task_complexity(query)
41
42 if complexity == "high":
43 # Use Claude Sonnet 4.5 for complex reasoning
44 # Cost: ~$0.012 per interaction
45 response = self.anthropic.messages.create(
46 model="claude-sonnet-4.5",
47 max_tokens=1024,
48 messages=[{"role": "user", "content": query}]
49 )
50 return response.content[0].text, "claude-sonnet-4.5", "high"
51
52 elif complexity == "medium":
53 # Use GPT-5 for general tasks
54 # Cost: ~$0.008 per interaction (33% savings)
55 response = self.openai.chat.completions.create(
56 model="gpt-5",
57 max_tokens=512,
58 messages=[{"role": "user", "content": query}]
59 )
60 return response.choices[0].message.content, "gpt-5", "medium"
61
62 else:
63 # Use Gemini 2.5 Flash for simple queries
64 # Cost: ~$0.001 per interaction (92% savings!)
65 response = self.gemini.generate_content(query)
66 return response.text, "gemini-2.5-flash", "low"
67
68## Test with different complexity levels
69agent = CostOptimizedAgent()
70
71queries = [
72 ("What's the capital of France?", "low"),
73 ("What is machine learning?", "medium"),
74 ("Analyze the pros and cons of different database architectures", "high")
75]
76
77for query, expected in queries:
78 result, model, complexity = agent.respond(query)
79 print(f"Query: {query}")
80 print(f"Complexity: {complexity} (expected: {expected})")
81 print(f"Model: {model}")
82 print(f"Response: {result[:100]}...")
83 print()By routing simple queries to Gemini 2.5 Flash, you can save 90% or more on those interactions. If 50% of your queries are simple, you've just cut your total costs by 45%.
Strategy 2: Reduce Output Length
Remember that output tokens cost more than input tokens. A response with 1000 tokens costs twice as much as one with 500 tokens. If your agent is verbose, you're wasting money.
Example: Concise Responses (Claude Sonnet 4.5)
1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def compare_response_costs(query):
6 """Compare costs of verbose vs concise responses."""
7
8 # Verbose response (default behavior)
9 verbose_response = client.messages.create(
10 model="claude-sonnet-4.5",
11 max_tokens=1024,
12 messages=[{"role": "user", "content": query}]
13 )
14
15 # Concise response (optimized)
16 concise_system = """You are a helpful assistant. Provide concise, direct answers.
17 Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
18
19 concise_response = client.messages.create(
20 model="claude-sonnet-4.5",
21 max_tokens=200, # Hard limit
22 system=concise_system,
23 messages=[{"role": "user", "content": query}]
24 )
25
26 # Calculate costs
27 def calc_cost(usage):
28 input_cost = (usage.input_tokens / 1_000_000) * 3.00
29 output_cost = (usage.output_tokens / 1_000_000) * 15.00
30 return input_cost + output_cost
31
32 verbose_cost = calc_cost(verbose_response.usage)
33 concise_cost = calc_cost(concise_response.usage)
34 savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
35
36 return {
37 "verbose": {
38 "response": verbose_response.content[0].text,
39 "tokens": verbose_response.usage.output_tokens,
40 "cost": verbose_cost
41 },
42 "concise": {
43 "response": concise_response.content[0].text,
44 "tokens": concise_response.usage.output_tokens,
45 "cost": concise_cost
46 },
47 "savings_percent": savings
48 }
49
50## Test with a simple query
51result = compare_response_costs("What is Python?")
52
53print("Verbose response:")
54print(f"Tokens: {result['verbose']['tokens']}")
55print(f"Cost: ${result['verbose']['cost']:.6f}")
56print(f"Response: {result['verbose']['response'][:150]}...")
57print()
58
59print("Concise response:")
60print(f"Tokens: {result['concise']['tokens']}")
61print(f"Cost: ${result['concise']['cost']:.6f}")
62print(f"Response: {result['concise']['response']}")
63print()
64
65print(f"Cost savings: {result['savings_percent']:.1f}%")1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def compare_response_costs(query):
6 """Compare costs of verbose vs concise responses."""
7
8 # Verbose response (default behavior)
9 verbose_response = client.messages.create(
10 model="claude-sonnet-4.5",
11 max_tokens=1024,
12 messages=[{"role": "user", "content": query}]
13 )
14
15 # Concise response (optimized)
16 concise_system = """You are a helpful assistant. Provide concise, direct answers.
17 Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
18
19 concise_response = client.messages.create(
20 model="claude-sonnet-4.5",
21 max_tokens=200, # Hard limit
22 system=concise_system,
23 messages=[{"role": "user", "content": query}]
24 )
25
26 # Calculate costs
27 def calc_cost(usage):
28 input_cost = (usage.input_tokens / 1_000_000) * 3.00
29 output_cost = (usage.output_tokens / 1_000_000) * 15.00
30 return input_cost + output_cost
31
32 verbose_cost = calc_cost(verbose_response.usage)
33 concise_cost = calc_cost(concise_response.usage)
34 savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
35
36 return {
37 "verbose": {
38 "response": verbose_response.content[0].text,
39 "tokens": verbose_response.usage.output_tokens,
40 "cost": verbose_cost
41 },
42 "concise": {
43 "response": concise_response.content[0].text,
44 "tokens": concise_response.usage.output_tokens,
45 "cost": concise_cost
46 },
47 "savings_percent": savings
48 }
49
50## Test with a simple query
51result = compare_response_costs("What is Python?")
52
53print("Verbose response:")
54print(f"Tokens: {result['verbose']['tokens']}")
55print(f"Cost: ${result['verbose']['cost']:.6f}")
56print(f"Response: {result['verbose']['response'][:150]}...")
57print()
58
59print("Concise response:")
60print(f"Tokens: {result['concise']['tokens']}")
61print(f"Cost: ${result['concise']['cost']:.6f}")
62print(f"Response: {result['concise']['response']}")
63print()
64
65print(f"Cost savings: {result['savings_percent']:.1f}%")The concise version might save 60-70% on output tokens for simple queries. Across thousands of interactions, that's substantial savings.
Strategy 3: Cache Aggressively
If users ask the same questions repeatedly, why pay to generate the answer every time? Cache responses and serve them instantly for free.
Example: Multi-Level Caching (Gemini 2.5 Flash)
1from google import generativeai as genai
2import hashlib
3import time
4from datetime import datetime, timedelta
5
6genai.configure(api_key="YOUR_GOOGLE_API_KEY")
7
8class CachingAgent:
9 """Agent with intelligent caching to minimize API calls."""
10
11 def __init__(self):
12 self.model = genai.GenerativeModel('gemini-2.5-flash')
13
14 # Short-term cache: exact query matches
15 self.exact_cache = {}
16
17 # Long-term cache: common queries that rarely change
18 self.persistent_cache = {
19 "what is python": "Python is a high-level programming language...",
20 "what is machine learning": "Machine learning is a subset of AI...",
21 # Preload common queries
22 }
23
24 # Cache metadata
25 self.cache_stats = {
26 "hits": 0,
27 "misses": 0,
28 "api_calls": 0,
29 "cost_saved": 0.0
30 }
31
32 def _hash_query(self, query):
33 """Create cache key from query."""
34 return hashlib.md5(query.lower().strip().encode()).hexdigest()
35
36 def _estimate_cost_saved(self, query):
37 """Estimate cost saved by cache hit."""
38 # Rough estimate: 100 input tokens + 200 output tokens
39 input_cost = (100 / 1_000_000) * 0.40
40 output_cost = (200 / 1_000_000) * 1.20
41 return input_cost + output_cost
42
43 def respond(self, query):
44 """Get response, using cache when possible."""
45 cache_key = self._hash_query(query)
46 query_normalized = query.lower().strip()
47
48 # Check persistent cache first (common queries)
49 if query_normalized in self.persistent_cache:
50 self.cache_stats["hits"] += 1
51 self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
52 return self.persistent_cache[query_normalized], "persistent_cache"
53
54 # Check exact cache (recent queries)
55 if cache_key in self.exact_cache:
56 self.cache_stats["hits"] += 1
57 self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
58 return self.exact_cache[cache_key], "exact_cache"
59
60 # Cache miss: call the model
61 self.cache_stats["misses"] += 1
62 self.cache_stats["api_calls"] += 1
63
64 response = self.model.generate_content(query)
65 result = response.text
66
67 # Store in exact cache
68 self.exact_cache[cache_key] = result
69
70 # If cache is getting large, prune old entries
71 if len(self.exact_cache) > 1000:
72 # Keep only the most recent 500
73 keys_to_remove = list(self.exact_cache.keys())[:-500]
74 for key in keys_to_remove:
75 del self.exact_cache[key]
76
77 return result, "api_call"
78
79 def get_cache_stats(self):
80 """Get caching performance metrics."""
81 total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
82 hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
83
84 return {
85 "total_requests": total_requests,
86 "cache_hits": self.cache_stats["hits"],
87 "cache_misses": self.cache_stats["misses"],
88 "hit_rate": hit_rate,
89 "api_calls": self.cache_stats["api_calls"],
90 "estimated_cost_saved": self.cache_stats["cost_saved"]
91 }
92
93## Test the caching agent
94agent = CachingAgent()
95
96## Simulate user queries (with some repetition)
97queries = [
98 "What is Python?",
99 "What is machine learning?",
100 "How do I learn programming?",
101 "What is Python?", # Duplicate
102 "What is machine learning?", # Duplicate
103 "How do I learn programming?", # Duplicate
104 "What are data structures?",
105 "What is Python?", # Duplicate again
106]
107
108for query in queries:
109 result, source = agent.respond(query)
110 print(f"Q: {query}")
111 print(f"Source: {source}")
112 print()
113
114## Show cache performance
115stats = agent.get_cache_stats()
116print("Cache Performance:")
117print(f"Total requests: {stats['total_requests']}")
118print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
119print(f"API calls: {stats['api_calls']}")
120print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")1from google import generativeai as genai
2import hashlib
3import time
4from datetime import datetime, timedelta
5
6genai.configure(api_key="YOUR_GOOGLE_API_KEY")
7
8class CachingAgent:
9 """Agent with intelligent caching to minimize API calls."""
10
11 def __init__(self):
12 self.model = genai.GenerativeModel('gemini-2.5-flash')
13
14 # Short-term cache: exact query matches
15 self.exact_cache = {}
16
17 # Long-term cache: common queries that rarely change
18 self.persistent_cache = {
19 "what is python": "Python is a high-level programming language...",
20 "what is machine learning": "Machine learning is a subset of AI...",
21 # Preload common queries
22 }
23
24 # Cache metadata
25 self.cache_stats = {
26 "hits": 0,
27 "misses": 0,
28 "api_calls": 0,
29 "cost_saved": 0.0
30 }
31
32 def _hash_query(self, query):
33 """Create cache key from query."""
34 return hashlib.md5(query.lower().strip().encode()).hexdigest()
35
36 def _estimate_cost_saved(self, query):
37 """Estimate cost saved by cache hit."""
38 # Rough estimate: 100 input tokens + 200 output tokens
39 input_cost = (100 / 1_000_000) * 0.40
40 output_cost = (200 / 1_000_000) * 1.20
41 return input_cost + output_cost
42
43 def respond(self, query):
44 """Get response, using cache when possible."""
45 cache_key = self._hash_query(query)
46 query_normalized = query.lower().strip()
47
48 # Check persistent cache first (common queries)
49 if query_normalized in self.persistent_cache:
50 self.cache_stats["hits"] += 1
51 self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
52 return self.persistent_cache[query_normalized], "persistent_cache"
53
54 # Check exact cache (recent queries)
55 if cache_key in self.exact_cache:
56 self.cache_stats["hits"] += 1
57 self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
58 return self.exact_cache[cache_key], "exact_cache"
59
60 # Cache miss: call the model
61 self.cache_stats["misses"] += 1
62 self.cache_stats["api_calls"] += 1
63
64 response = self.model.generate_content(query)
65 result = response.text
66
67 # Store in exact cache
68 self.exact_cache[cache_key] = result
69
70 # If cache is getting large, prune old entries
71 if len(self.exact_cache) > 1000:
72 # Keep only the most recent 500
73 keys_to_remove = list(self.exact_cache.keys())[:-500]
74 for key in keys_to_remove:
75 del self.exact_cache[key]
76
77 return result, "api_call"
78
79 def get_cache_stats(self):
80 """Get caching performance metrics."""
81 total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
82 hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
83
84 return {
85 "total_requests": total_requests,
86 "cache_hits": self.cache_stats["hits"],
87 "cache_misses": self.cache_stats["misses"],
88 "hit_rate": hit_rate,
89 "api_calls": self.cache_stats["api_calls"],
90 "estimated_cost_saved": self.cache_stats["cost_saved"]
91 }
92
93## Test the caching agent
94agent = CachingAgent()
95
96## Simulate user queries (with some repetition)
97queries = [
98 "What is Python?",
99 "What is machine learning?",
100 "How do I learn programming?",
101 "What is Python?", # Duplicate
102 "What is machine learning?", # Duplicate
103 "How do I learn programming?", # Duplicate
104 "What are data structures?",
105 "What is Python?", # Duplicate again
106]
107
108for query in queries:
109 result, source = agent.respond(query)
110 print(f"Q: {query}")
111 print(f"Source: {source}")
112 print()
113
114## Show cache performance
115stats = agent.get_cache_stats()
116print("Cache Performance:")
117print(f"Total requests: {stats['total_requests']}")
118print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
119print(f"API calls: {stats['api_calls']}")
120print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")With a 50% cache hit rate, you've cut your API costs in half. For high-traffic applications, caching is one of the most effective cost reduction strategies.
Strategy 4: Batch Similar Requests
If you need to process multiple similar queries, batch them into a single API call. This reduces overhead and can be more cost-effective.
Example: Batch Processing (GPT-5)
1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_API_KEY")
4
5def process_individually(queries):
6 """Process each query separately (expensive)."""
7 results = []
8 total_cost = 0.0
9
10 for query in queries:
11 response = client.chat.completions.create(
12 model="gpt-5",
13 max_tokens=100,
14 messages=[{"role": "user", "content": query}]
15 )
16
17 # Estimate cost (rough approximation)
18 tokens = response.usage.total_tokens
19 cost = (tokens / 1_000_000) * 6.25 # Average of input/output rates
20 total_cost += cost
21
22 results.append(response.choices[0].message.content)
23
24 return results, total_cost
25
26def process_batched(queries):
27 """Process all queries in a single call (cheaper)."""
28 # Combine queries into a single prompt
29 batch_prompt = "Answer each of the following questions concisely:\n\n"
30 for i, query in enumerate(queries, 1):
31 batch_prompt += f"{i}. {query}\n"
32
33 response = client.chat.completions.create(
34 model="gpt-5",
35 max_tokens=500,
36 messages=[{"role": "user", "content": batch_prompt}]
37 )
38
39 # Estimate cost
40 tokens = response.usage.total_tokens
41 cost = (tokens / 1_000_000) * 6.25
42
43 # Parse the batched response
44 result = response.choices[0].message.content
45
46 return result, cost
47
48## Test both approaches
49queries = [
50 "What is Python?",
51 "What is JavaScript?",
52 "What is Ruby?",
53 "What is Go?",
54 "What is Rust?"
55]
56
57print("Individual processing:")
58results_individual, cost_individual = process_individually(queries)
59print(f"Cost: ${cost_individual:.4f}")
60print()
61
62print("Batched processing:")
63result_batched, cost_batched = process_batched(queries)
64print(f"Cost: ${cost_batched:.4f}")
65print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
66print()
67print("Batched response:")
68print(result_batched)1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_API_KEY")
4
5def process_individually(queries):
6 """Process each query separately (expensive)."""
7 results = []
8 total_cost = 0.0
9
10 for query in queries:
11 response = client.chat.completions.create(
12 model="gpt-5",
13 max_tokens=100,
14 messages=[{"role": "user", "content": query}]
15 )
16
17 # Estimate cost (rough approximation)
18 tokens = response.usage.total_tokens
19 cost = (tokens / 1_000_000) * 6.25 # Average of input/output rates
20 total_cost += cost
21
22 results.append(response.choices[0].message.content)
23
24 return results, total_cost
25
26def process_batched(queries):
27 """Process all queries in a single call (cheaper)."""
28 # Combine queries into a single prompt
29 batch_prompt = "Answer each of the following questions concisely:\n\n"
30 for i, query in enumerate(queries, 1):
31 batch_prompt += f"{i}. {query}\n"
32
33 response = client.chat.completions.create(
34 model="gpt-5",
35 max_tokens=500,
36 messages=[{"role": "user", "content": batch_prompt}]
37 )
38
39 # Estimate cost
40 tokens = response.usage.total_tokens
41 cost = (tokens / 1_000_000) * 6.25
42
43 # Parse the batched response
44 result = response.choices[0].message.content
45
46 return result, cost
47
48## Test both approaches
49queries = [
50 "What is Python?",
51 "What is JavaScript?",
52 "What is Ruby?",
53 "What is Go?",
54 "What is Rust?"
55]
56
57print("Individual processing:")
58results_individual, cost_individual = process_individually(queries)
59print(f"Cost: ${cost_individual:.4f}")
60print()
61
62print("Batched processing:")
63result_batched, cost_batched = process_batched(queries)
64print(f"Cost: ${cost_batched:.4f}")
65print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
66print()
67print("Batched response:")
68print(result_batched)Batching can save 30-50% on costs for similar queries because you eliminate the overhead of multiple API calls and can share context more efficiently.
Strategy 5: Trim Conversation History
Long conversation histories increase input token costs. If your agent includes the last 20 messages in every request, you're paying to process all that context repeatedly.
Example: Smart History Trimming (Claude Sonnet 4.5)
1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5class HistoryOptimizedAgent:
6 """Agent that manages conversation history efficiently."""
7
8 def __init__(self):
9 self.conversation_history = []
10 self.max_history_messages = 6 # Keep last 3 exchanges
11
12 def _trim_history(self):
13 """Keep only recent messages to reduce token costs."""
14 if len(self.conversation_history) > self.max_history_messages:
15 # Keep only the most recent messages
16 self.conversation_history = self.conversation_history[-self.max_history_messages:]
17
18 def _estimate_tokens(self, text):
19 """Rough token estimate (4 chars ≈ 1 token)."""
20 return len(text) // 4
21
22 def respond(self, user_message):
23 """Generate response with optimized history."""
24 # Add user message to history
25 self.conversation_history.append({
26 "role": "user",
27 "content": user_message
28 })
29
30 # Trim history before sending
31 self._trim_history()
32
33 # Calculate token usage
34 total_input_chars = sum(
35 len(msg["content"]) for msg in self.conversation_history
36 )
37 estimated_input_tokens = total_input_chars // 4
38
39 # Make the call
40 response = client.messages.create(
41 model="claude-sonnet-4.5",
42 max_tokens=512,
43 messages=self.conversation_history
44 )
45
46 # Add assistant response to history
47 assistant_message = response.content[0].text
48 self.conversation_history.append({
49 "role": "assistant",
50 "content": assistant_message
51 })
52
53 # Calculate cost
54 input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
55 output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
56 total_cost = input_cost + output_cost
57
58 return {
59 "response": assistant_message,
60 "input_tokens": response.usage.input_tokens,
61 "output_tokens": response.usage.output_tokens,
62 "cost": total_cost,
63 "history_length": len(self.conversation_history)
64 }
65
66## Test with a multi-turn conversation
67agent = HistoryOptimizedAgent()
68
69queries = [
70 "What is Python?",
71 "What are its main features?",
72 "How does it compare to Java?",
73 "What about performance?",
74 "Should I learn it?",
75 "What resources do you recommend?",
76 "How long will it take?",
77 "What projects should I build?"
78]
79
80total_cost = 0.0
81
82for query in queries:
83 result = agent.respond(query)
84 total_cost += result["cost"]
85
86 print(f"Q: {query}")
87 print(f"Input tokens: {result['input_tokens']}")
88 print(f"History length: {result['history_length']} messages")
89 print(f"Cost: ${result['cost']:.6f}")
90 print()
91
92print(f"Total conversation cost: ${total_cost:.4f}")
93print("\nNote: Without trimming, costs would be ~40% higher")1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5class HistoryOptimizedAgent:
6 """Agent that manages conversation history efficiently."""
7
8 def __init__(self):
9 self.conversation_history = []
10 self.max_history_messages = 6 # Keep last 3 exchanges
11
12 def _trim_history(self):
13 """Keep only recent messages to reduce token costs."""
14 if len(self.conversation_history) > self.max_history_messages:
15 # Keep only the most recent messages
16 self.conversation_history = self.conversation_history[-self.max_history_messages:]
17
18 def _estimate_tokens(self, text):
19 """Rough token estimate (4 chars ≈ 1 token)."""
20 return len(text) // 4
21
22 def respond(self, user_message):
23 """Generate response with optimized history."""
24 # Add user message to history
25 self.conversation_history.append({
26 "role": "user",
27 "content": user_message
28 })
29
30 # Trim history before sending
31 self._trim_history()
32
33 # Calculate token usage
34 total_input_chars = sum(
35 len(msg["content"]) for msg in self.conversation_history
36 )
37 estimated_input_tokens = total_input_chars // 4
38
39 # Make the call
40 response = client.messages.create(
41 model="claude-sonnet-4.5",
42 max_tokens=512,
43 messages=self.conversation_history
44 )
45
46 # Add assistant response to history
47 assistant_message = response.content[0].text
48 self.conversation_history.append({
49 "role": "assistant",
50 "content": assistant_message
51 })
52
53 # Calculate cost
54 input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
55 output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
56 total_cost = input_cost + output_cost
57
58 return {
59 "response": assistant_message,
60 "input_tokens": response.usage.input_tokens,
61 "output_tokens": response.usage.output_tokens,
62 "cost": total_cost,
63 "history_length": len(self.conversation_history)
64 }
65
66## Test with a multi-turn conversation
67agent = HistoryOptimizedAgent()
68
69queries = [
70 "What is Python?",
71 "What are its main features?",
72 "How does it compare to Java?",
73 "What about performance?",
74 "Should I learn it?",
75 "What resources do you recommend?",
76 "How long will it take?",
77 "What projects should I build?"
78]
79
80total_cost = 0.0
81
82for query in queries:
83 result = agent.respond(query)
84 total_cost += result["cost"]
85
86 print(f"Q: {query}")
87 print(f"Input tokens: {result['input_tokens']}")
88 print(f"History length: {result['history_length']} messages")
89 print(f"Cost: ${result['cost']:.6f}")
90 print()
91
92print(f"Total conversation cost: ${total_cost:.4f}")
93print("\nNote: Without trimming, costs would be ~40% higher")By keeping only the last 6 messages (3 exchanges), you prevent the input token count from growing unbounded. This is especially important for long conversations.
Strategy 6: Use Prompt Compression
For agents that need to include large amounts of context (like retrieved documents or long system prompts), consider compressing that information.
Example: Context Summarization (Claude Sonnet 4.5)
1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def summarize_context(long_context, max_length=500):
6 """Compress long context into a shorter summary."""
7 if len(long_context) <= max_length:
8 return long_context
9
10 # Use the model to create a concise summary
11 response = client.messages.create(
12 model="claude-sonnet-4.5",
13 max_tokens=200,
14 messages=[{
15 "role": "user",
16 "content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
17 }]
18 )
19
20 return response.content[0].text
21
22def respond_with_context(query, long_context):
23 """Answer query using compressed context."""
24 # Compress the context first
25 compressed_context = summarize_context(long_context, max_length=500)
26
27 # Use compressed context in the actual query
28 full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
29
30 response = client.messages.create(
31 model="claude-sonnet-4.5",
32 max_tokens=512,
33 messages=[{"role": "user", "content": full_prompt}]
34 )
35
36 return response.content[0].text
37
38## Example: Long document that needs to be included
39long_document = """
40[Imagine a 5000-word document about machine learning here...]
41Machine learning is a field of artificial intelligence that focuses on...
42[... many more paragraphs ...]
43"""
44
45## Without compression: ~5000 tokens input
46## With compression: ~500 tokens input
47## Savings: ~90% on input tokens for this context
48
49query = "What are the key concepts in this document?"
50answer = respond_with_context(query, long_document)
51print(answer)1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def summarize_context(long_context, max_length=500):
6 """Compress long context into a shorter summary."""
7 if len(long_context) <= max_length:
8 return long_context
9
10 # Use the model to create a concise summary
11 response = client.messages.create(
12 model="claude-sonnet-4.5",
13 max_tokens=200,
14 messages=[{
15 "role": "user",
16 "content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
17 }]
18 )
19
20 return response.content[0].text
21
22def respond_with_context(query, long_context):
23 """Answer query using compressed context."""
24 # Compress the context first
25 compressed_context = summarize_context(long_context, max_length=500)
26
27 # Use compressed context in the actual query
28 full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
29
30 response = client.messages.create(
31 model="claude-sonnet-4.5",
32 max_tokens=512,
33 messages=[{"role": "user", "content": full_prompt}]
34 )
35
36 return response.content[0].text
37
38## Example: Long document that needs to be included
39long_document = """
40[Imagine a 5000-word document about machine learning here...]
41Machine learning is a field of artificial intelligence that focuses on...
42[... many more paragraphs ...]
43"""
44
45## Without compression: ~5000 tokens input
46## With compression: ~500 tokens input
47## Savings: ~90% on input tokens for this context
48
49query = "What are the key concepts in this document?"
50answer = respond_with_context(query, long_document)
51print(answer)You pay for the summarization call, but if you use that compressed context multiple times, you save money overall. This is especially valuable for retrieval-augmented generation (RAG) systems where you're including retrieved documents in every query.
Strategy 7: Set Budget Limits
Prevent runaway costs by implementing budget controls in your agent.
Example: Budget-Aware Agent (Multi-Provider)
1from anthropic import Anthropic
2from datetime import datetime, timedelta
3
4class BudgetControlledAgent:
5 """Agent with built-in budget limits and alerts."""
6
7 def __init__(self, daily_budget=10.0):
8 self.client = Anthropic(api_key="YOUR_API_KEY")
9 self.daily_budget = daily_budget
10 self.current_day = datetime.now().date()
11 self.daily_spending = 0.0
12 self.total_spending = 0.0
13
14 def _reset_daily_budget_if_needed(self):
15 """Reset daily spending counter at midnight."""
16 today = datetime.now().date()
17 if today != self.current_day:
18 self.current_day = today
19 self.daily_spending = 0.0
20
21 def _calculate_cost(self, usage):
22 """Calculate cost from token usage."""
23 input_cost = (usage.input_tokens / 1_000_000) * 3.00
24 output_cost = (usage.output_tokens / 1_000_000) * 15.00
25 return input_cost + output_cost
26
27 def respond(self, query):
28 """Generate response if within budget."""
29 self._reset_daily_budget_if_needed()
30
31 # Check if we're over budget
32 if self.daily_spending >= self.daily_budget:
33 return {
34 "response": None,
35 "error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
36 "budget_remaining": 0.0
37 }
38
39 # Make the call
40 response = self.client.messages.create(
41 model="claude-sonnet-4.5",
42 max_tokens=512,
43 messages=[{"role": "user", "content": query}]
44 )
45
46 # Track spending
47 cost = self._calculate_cost(response.usage)
48 self.daily_spending += cost
49 self.total_spending += cost
50
51 # Check if approaching budget limit
52 budget_remaining = self.daily_budget - self.daily_spending
53 warning = None
54
55 if budget_remaining < self.daily_budget * 0.2: # Less than 20% remaining
56 warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
57
58 return {
59 "response": response.content[0].text,
60 "cost": cost,
61 "daily_spending": self.daily_spending,
62 "budget_remaining": budget_remaining,
63 "warning": warning
64 }
65
66 def get_spending_summary(self):
67 """Get spending statistics."""
68 return {
69 "daily_spending": self.daily_spending,
70 "daily_budget": self.daily_budget,
71 "budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
72 "total_spending": self.total_spending
73 }
74
75## Test budget controls
76agent = BudgetControlledAgent(daily_budget=0.10) # $0.10 daily limit
77
78queries = [
79 "What is Python?",
80 "Explain machine learning.",
81 "How do neural networks work?",
82 "What is deep learning?",
83 "Describe reinforcement learning.",
84 "What are transformers?",
85 "Explain attention mechanisms.",
86 "What is GPT?",
87 "How does BERT work?",
88 "What is transfer learning?"
89]
90
91for i, query in enumerate(queries, 1):
92 print(f"\n--- Query {i} ---")
93 result = agent.respond(query)
94
95 if result["response"]:
96 print(f"Response: {result['response'][:100]}...")
97 print(f"Cost: ${result['cost']:.6f}")
98 print(f"Daily spending: ${result['daily_spending']:.4f}")
99
100 if result["warning"]:
101 print(f"⚠️ {result['warning']}")
102 else:
103 print(f"❌ {result['error']}")
104 break
105
106summary = agent.get_spending_summary()
107print(f"\n=== Spending Summary ===")
108print(f"Daily budget: ${summary['daily_budget']:.2f}")
109print(f"Daily spending: ${summary['daily_spending']:.4f}")
110print(f"Budget used: {summary['budget_used_percent']:.1f}%")1from anthropic import Anthropic
2from datetime import datetime, timedelta
3
4class BudgetControlledAgent:
5 """Agent with built-in budget limits and alerts."""
6
7 def __init__(self, daily_budget=10.0):
8 self.client = Anthropic(api_key="YOUR_API_KEY")
9 self.daily_budget = daily_budget
10 self.current_day = datetime.now().date()
11 self.daily_spending = 0.0
12 self.total_spending = 0.0
13
14 def _reset_daily_budget_if_needed(self):
15 """Reset daily spending counter at midnight."""
16 today = datetime.now().date()
17 if today != self.current_day:
18 self.current_day = today
19 self.daily_spending = 0.0
20
21 def _calculate_cost(self, usage):
22 """Calculate cost from token usage."""
23 input_cost = (usage.input_tokens / 1_000_000) * 3.00
24 output_cost = (usage.output_tokens / 1_000_000) * 15.00
25 return input_cost + output_cost
26
27 def respond(self, query):
28 """Generate response if within budget."""
29 self._reset_daily_budget_if_needed()
30
31 # Check if we're over budget
32 if self.daily_spending >= self.daily_budget:
33 return {
34 "response": None,
35 "error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
36 "budget_remaining": 0.0
37 }
38
39 # Make the call
40 response = self.client.messages.create(
41 model="claude-sonnet-4.5",
42 max_tokens=512,
43 messages=[{"role": "user", "content": query}]
44 )
45
46 # Track spending
47 cost = self._calculate_cost(response.usage)
48 self.daily_spending += cost
49 self.total_spending += cost
50
51 # Check if approaching budget limit
52 budget_remaining = self.daily_budget - self.daily_spending
53 warning = None
54
55 if budget_remaining < self.daily_budget * 0.2: # Less than 20% remaining
56 warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
57
58 return {
59 "response": response.content[0].text,
60 "cost": cost,
61 "daily_spending": self.daily_spending,
62 "budget_remaining": budget_remaining,
63 "warning": warning
64 }
65
66 def get_spending_summary(self):
67 """Get spending statistics."""
68 return {
69 "daily_spending": self.daily_spending,
70 "daily_budget": self.daily_budget,
71 "budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
72 "total_spending": self.total_spending
73 }
74
75## Test budget controls
76agent = BudgetControlledAgent(daily_budget=0.10) # $0.10 daily limit
77
78queries = [
79 "What is Python?",
80 "Explain machine learning.",
81 "How do neural networks work?",
82 "What is deep learning?",
83 "Describe reinforcement learning.",
84 "What are transformers?",
85 "Explain attention mechanisms.",
86 "What is GPT?",
87 "How does BERT work?",
88 "What is transfer learning?"
89]
90
91for i, query in enumerate(queries, 1):
92 print(f"\n--- Query {i} ---")
93 result = agent.respond(query)
94
95 if result["response"]:
96 print(f"Response: {result['response'][:100]}...")
97 print(f"Cost: ${result['cost']:.6f}")
98 print(f"Daily spending: ${result['daily_spending']:.4f}")
99
100 if result["warning"]:
101 print(f"⚠️ {result['warning']}")
102 else:
103 print(f"❌ {result['error']}")
104 break
105
106summary = agent.get_spending_summary()
107print(f"\n=== Spending Summary ===")
108print(f"Daily budget: ${summary['daily_budget']:.2f}")
109print(f"Daily spending: ${summary['daily_spending']:.4f}")
110print(f"Budget used: {summary['budget_used_percent']:.1f}%")Budget controls prevent unexpected bills and force you to think about cost optimization. If you hit your budget limit regularly, it's a signal that you need to optimize your agent's efficiency.
Measuring Cost Optimization Impact
As you apply these strategies, track the results. Here's a comprehensive cost analysis tool:
1from anthropic import Anthropic
2from datetime import datetime
3import statistics
4
5class CostAnalyzer:
6 """Analyze and compare costs across different optimization strategies."""
7
8 def __init__(self):
9 self.client = Anthropic(api_key="YOUR_API_KEY")
10 self.baseline_costs = []
11 self.optimized_costs = []
12
13 def _calculate_cost(self, usage):
14 """Calculate cost from token usage."""
15 input_cost = (usage.input_tokens / 1_000_000) * 3.00
16 output_cost = (usage.output_tokens / 1_000_000) * 15.00
17 return input_cost + output_cost
18
19 def run_baseline(self, queries):
20 """Run queries without optimization."""
21 print("Running baseline (no optimization)...")
22
23 for query in queries:
24 response = self.client.messages.create(
25 model="claude-sonnet-4.5",
26 max_tokens=1024, # No limits
27 messages=[{"role": "user", "content": query}]
28 )
29
30 cost = self._calculate_cost(response.usage)
31 self.baseline_costs.append(cost)
32
33 def run_optimized(self, queries):
34 """Run queries with optimization."""
35 print("Running optimized version...")
36
37 system_prompt = """You are a helpful assistant. Provide concise, direct answers.
38 Use 1-2 sentences for simple questions."""
39
40 for query in queries:
41 response = self.client.messages.create(
42 model="claude-sonnet-4.5",
43 max_tokens=200, # Limited
44 system=system_prompt,
45 messages=[{"role": "user", "content": query}]
46 )
47
48 cost = self._calculate_cost(response.usage)
49 self.optimized_costs.append(cost)
50
51 def generate_report(self):
52 """Generate cost comparison report."""
53 baseline_total = sum(self.baseline_costs)
54 optimized_total = sum(self.optimized_costs)
55 savings = baseline_total - optimized_total
56 savings_percent = (savings / baseline_total) * 100
57
58 baseline_avg = statistics.mean(self.baseline_costs)
59 optimized_avg = statistics.mean(self.optimized_costs)
60
61 report = f"""
62=== Cost Optimization Report ===
63
64Baseline (No Optimization):
65 Total cost: ${baseline_total:.4f}
66 Average per query: ${baseline_avg:.6f}
67 Number of queries: {len(self.baseline_costs)}
68
69Optimized:
70 Total cost: ${optimized_total:.4f}
71 Average per query: ${optimized_avg:.6f}
72 Number of queries: {len(self.optimized_costs)}
73
74Savings:
75 Total saved: ${savings:.4f}
76 Percentage saved: {savings_percent:.1f}%
77
78Projected Monthly Savings (at 10,000 queries/month):
79 ${savings * (10000 / len(self.baseline_costs)):.2f}
80"""
81 return report
82
83## Run the analysis
84analyzer = CostAnalyzer()
85
86test_queries = [
87 "What is Python?",
88 "What is JavaScript?",
89 "What is machine learning?",
90 "What are neural networks?",
91 "What is deep learning?",
92 "What is natural language processing?",
93 "What are transformers?",
94 "What is computer vision?",
95 "What is reinforcement learning?",
96 "What is data science?"
97]
98
99analyzer.run_baseline(test_queries)
100analyzer.run_optimized(test_queries)
101
102print(analyzer.generate_report())1from anthropic import Anthropic
2from datetime import datetime
3import statistics
4
5class CostAnalyzer:
6 """Analyze and compare costs across different optimization strategies."""
7
8 def __init__(self):
9 self.client = Anthropic(api_key="YOUR_API_KEY")
10 self.baseline_costs = []
11 self.optimized_costs = []
12
13 def _calculate_cost(self, usage):
14 """Calculate cost from token usage."""
15 input_cost = (usage.input_tokens / 1_000_000) * 3.00
16 output_cost = (usage.output_tokens / 1_000_000) * 15.00
17 return input_cost + output_cost
18
19 def run_baseline(self, queries):
20 """Run queries without optimization."""
21 print("Running baseline (no optimization)...")
22
23 for query in queries:
24 response = self.client.messages.create(
25 model="claude-sonnet-4.5",
26 max_tokens=1024, # No limits
27 messages=[{"role": "user", "content": query}]
28 )
29
30 cost = self._calculate_cost(response.usage)
31 self.baseline_costs.append(cost)
32
33 def run_optimized(self, queries):
34 """Run queries with optimization."""
35 print("Running optimized version...")
36
37 system_prompt = """You are a helpful assistant. Provide concise, direct answers.
38 Use 1-2 sentences for simple questions."""
39
40 for query in queries:
41 response = self.client.messages.create(
42 model="claude-sonnet-4.5",
43 max_tokens=200, # Limited
44 system=system_prompt,
45 messages=[{"role": "user", "content": query}]
46 )
47
48 cost = self._calculate_cost(response.usage)
49 self.optimized_costs.append(cost)
50
51 def generate_report(self):
52 """Generate cost comparison report."""
53 baseline_total = sum(self.baseline_costs)
54 optimized_total = sum(self.optimized_costs)
55 savings = baseline_total - optimized_total
56 savings_percent = (savings / baseline_total) * 100
57
58 baseline_avg = statistics.mean(self.baseline_costs)
59 optimized_avg = statistics.mean(self.optimized_costs)
60
61 report = f"""
62=== Cost Optimization Report ===
63
64Baseline (No Optimization):
65 Total cost: ${baseline_total:.4f}
66 Average per query: ${baseline_avg:.6f}
67 Number of queries: {len(self.baseline_costs)}
68
69Optimized:
70 Total cost: ${optimized_total:.4f}
71 Average per query: ${optimized_avg:.6f}
72 Number of queries: {len(self.optimized_costs)}
73
74Savings:
75 Total saved: ${savings:.4f}
76 Percentage saved: {savings_percent:.1f}%
77
78Projected Monthly Savings (at 10,000 queries/month):
79 ${savings * (10000 / len(self.baseline_costs)):.2f}
80"""
81 return report
82
83## Run the analysis
84analyzer = CostAnalyzer()
85
86test_queries = [
87 "What is Python?",
88 "What is JavaScript?",
89 "What is machine learning?",
90 "What are neural networks?",
91 "What is deep learning?",
92 "What is natural language processing?",
93 "What are transformers?",
94 "What is computer vision?",
95 "What is reinforcement learning?",
96 "What is data science?"
97]
98
99analyzer.run_baseline(test_queries)
100analyzer.run_optimized(test_queries)
101
102print(analyzer.generate_report())This gives you concrete numbers showing the impact of your optimizations. You might find that simple changes save 40-60% on costs.
Balancing Cost and Quality
Here's the key insight: cost optimization is about trade-offs. You can always make your agent cheaper by using worse models or shorter responses, but that might hurt quality.
The goal isn't to minimize cost at all costs. It's to maximize value: the best quality you can get for the money you're willing to spend.
Some guidelines:
-
Use the best model for critical tasks. If accuracy matters more than cost (medical advice, financial decisions, legal questions), don't skimp on model quality.
-
Optimize aggressively for high-volume, low-stakes queries. If you're answering "What's the weather?" thousands of times per day, use the cheapest model that works.
-
Monitor quality metrics alongside cost metrics. Track both how much you're spending and how well your agent performs. If cost optimizations hurt user satisfaction, they're not worth it.
-
Test before deploying. When you change models or prompts to save money, verify that quality doesn't suffer. Run your evaluation suite (from Chapter 11) to catch regressions.
-
Be willing to spend more when it matters. If a user's query is complex or important, it's okay to use your most capable (and expensive) model. The cost of a bad answer is often higher than the cost of the API call.
Putting It All Together
Let's build a production-ready agent that implements multiple cost optimization strategies:
1from anthropic import Anthropic
2from google import generativeai as genai
3import hashlib
4from datetime import datetime
5
6class ProductionCostOptimizedAgent:
7 """Production agent with comprehensive cost optimization."""
8
9 def __init__(self, daily_budget=50.0):
10 # Initialize clients
11 self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
12 genai.configure(api_key="YOUR_GOOGLE_KEY")
13 self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14
15 # Budget tracking
16 self.daily_budget = daily_budget
17 self.daily_spending = 0.0
18 self.current_day = datetime.now().date()
19
20 # Caching
21 self.cache = {}
22 self.cache_hits = 0
23 self.cache_misses = 0
24
25 # Conversation history (trimmed)
26 self.history = []
27 self.max_history = 6
28
29 def _reset_if_new_day(self):
30 """Reset daily counters."""
31 today = datetime.now().date()
32 if today != self.current_day:
33 self.current_day = today
34 self.daily_spending = 0.0
35
36 def _hash_query(self, query):
37 """Create cache key."""
38 return hashlib.md5(query.lower().strip().encode()).hexdigest()
39
40 def _classify_complexity(self, query):
41 """Determine query complexity."""
42 high_complexity = ["explain why", "analyze", "compare", "evaluate"]
43 simple_patterns = ["what is", "who is", "when is"]
44
45 query_lower = query.lower()
46
47 if any(p in query_lower for p in high_complexity):
48 return "high"
49 elif any(p in query_lower for p in simple_patterns):
50 return "low"
51 else:
52 return "medium"
53
54 def respond(self, query):
55 """Generate optimized response."""
56 self._reset_if_new_day()
57
58 # Check budget
59 if self.daily_spending >= self.daily_budget:
60 return {
61 "response": "Daily budget exceeded. Please try again tomorrow.",
62 "source": "budget_limit",
63 "cost": 0.0
64 }
65
66 # Check cache
67 cache_key = self._hash_query(query)
68 if cache_key in self.cache:
69 self.cache_hits += 1
70 return {
71 "response": self.cache[cache_key],
72 "source": "cache",
73 "cost": 0.0
74 }
75
76 self.cache_misses += 1
77
78 # Route to appropriate model
79 complexity = self._classify_complexity(query)
80
81 if complexity == "low":
82 # Use cheapest model for simple queries
83 response_text = self.gemini.generate_content(query).text
84 cost = 0.0008 # Estimated cost for Gemini Flash
85 source = "gemini-2.5-flash"
86
87 else:
88 # Use Claude for complex queries, with concise prompts
89 system = "Provide concise, direct answers. Be brief but complete."
90
91 response = self.anthropic.messages.create(
92 model="claude-sonnet-4.5",
93 max_tokens=300 if complexity == "medium" else 512,
94 system=system,
95 messages=[{"role": "user", "content": query}]
96 )
97
98 response_text = response.content[0].text
99
100 # Calculate cost
101 input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
102 output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
103 cost = input_cost + output_cost
104 source = "claude-sonnet-4.5"
105
106 # Update spending
107 self.daily_spending += cost
108
109 # Cache the response
110 self.cache[cache_key] = response_text
111
112 return {
113 "response": response_text,
114 "source": source,
115 "cost": cost,
116 "daily_spending": self.daily_spending
117 }
118
119 def get_stats(self):
120 """Get performance statistics."""
121 total_requests = self.cache_hits + self.cache_misses
122 hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
123
124 return {
125 "total_requests": total_requests,
126 "cache_hit_rate": hit_rate,
127 "daily_spending": self.daily_spending,
128 "budget_remaining": self.daily_budget - self.daily_spending
129 }
130
131## Test the production agent
132agent = ProductionCostOptimizedAgent(daily_budget=1.0)
133
134test_queries = [
135 "What is Python?",
136 "What is Python?", # Should hit cache
137 "Analyze the trade-offs between microservices and monolithic architectures",
138 "What is JavaScript?",
139 "What is Python?", # Should hit cache again
140]
141
142for query in test_queries:
143 result = agent.respond(query)
144 print(f"Q: {query}")
145 print(f"Source: {result['source']}")
146 print(f"Cost: ${result['cost']:.6f}")
147 if 'daily_spending' in result:
148 print(f"Daily spending: ${result['daily_spending']:.4f}")
149 print()
150
151stats = agent.get_stats()
152print("=== Agent Statistics ===")
153print(f"Total requests: {stats['total_requests']}")
154print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
155print(f"Daily spending: ${stats['daily_spending']:.4f}")
156print(f"Budget remaining: ${stats['budget_remaining']:.4f}")1from anthropic import Anthropic
2from google import generativeai as genai
3import hashlib
4from datetime import datetime
5
6class ProductionCostOptimizedAgent:
7 """Production agent with comprehensive cost optimization."""
8
9 def __init__(self, daily_budget=50.0):
10 # Initialize clients
11 self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
12 genai.configure(api_key="YOUR_GOOGLE_KEY")
13 self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14
15 # Budget tracking
16 self.daily_budget = daily_budget
17 self.daily_spending = 0.0
18 self.current_day = datetime.now().date()
19
20 # Caching
21 self.cache = {}
22 self.cache_hits = 0
23 self.cache_misses = 0
24
25 # Conversation history (trimmed)
26 self.history = []
27 self.max_history = 6
28
29 def _reset_if_new_day(self):
30 """Reset daily counters."""
31 today = datetime.now().date()
32 if today != self.current_day:
33 self.current_day = today
34 self.daily_spending = 0.0
35
36 def _hash_query(self, query):
37 """Create cache key."""
38 return hashlib.md5(query.lower().strip().encode()).hexdigest()
39
40 def _classify_complexity(self, query):
41 """Determine query complexity."""
42 high_complexity = ["explain why", "analyze", "compare", "evaluate"]
43 simple_patterns = ["what is", "who is", "when is"]
44
45 query_lower = query.lower()
46
47 if any(p in query_lower for p in high_complexity):
48 return "high"
49 elif any(p in query_lower for p in simple_patterns):
50 return "low"
51 else:
52 return "medium"
53
54 def respond(self, query):
55 """Generate optimized response."""
56 self._reset_if_new_day()
57
58 # Check budget
59 if self.daily_spending >= self.daily_budget:
60 return {
61 "response": "Daily budget exceeded. Please try again tomorrow.",
62 "source": "budget_limit",
63 "cost": 0.0
64 }
65
66 # Check cache
67 cache_key = self._hash_query(query)
68 if cache_key in self.cache:
69 self.cache_hits += 1
70 return {
71 "response": self.cache[cache_key],
72 "source": "cache",
73 "cost": 0.0
74 }
75
76 self.cache_misses += 1
77
78 # Route to appropriate model
79 complexity = self._classify_complexity(query)
80
81 if complexity == "low":
82 # Use cheapest model for simple queries
83 response_text = self.gemini.generate_content(query).text
84 cost = 0.0008 # Estimated cost for Gemini Flash
85 source = "gemini-2.5-flash"
86
87 else:
88 # Use Claude for complex queries, with concise prompts
89 system = "Provide concise, direct answers. Be brief but complete."
90
91 response = self.anthropic.messages.create(
92 model="claude-sonnet-4.5",
93 max_tokens=300 if complexity == "medium" else 512,
94 system=system,
95 messages=[{"role": "user", "content": query}]
96 )
97
98 response_text = response.content[0].text
99
100 # Calculate cost
101 input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
102 output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
103 cost = input_cost + output_cost
104 source = "claude-sonnet-4.5"
105
106 # Update spending
107 self.daily_spending += cost
108
109 # Cache the response
110 self.cache[cache_key] = response_text
111
112 return {
113 "response": response_text,
114 "source": source,
115 "cost": cost,
116 "daily_spending": self.daily_spending
117 }
118
119 def get_stats(self):
120 """Get performance statistics."""
121 total_requests = self.cache_hits + self.cache_misses
122 hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
123
124 return {
125 "total_requests": total_requests,
126 "cache_hit_rate": hit_rate,
127 "daily_spending": self.daily_spending,
128 "budget_remaining": self.daily_budget - self.daily_spending
129 }
130
131## Test the production agent
132agent = ProductionCostOptimizedAgent(daily_budget=1.0)
133
134test_queries = [
135 "What is Python?",
136 "What is Python?", # Should hit cache
137 "Analyze the trade-offs between microservices and monolithic architectures",
138 "What is JavaScript?",
139 "What is Python?", # Should hit cache again
140]
141
142for query in test_queries:
143 result = agent.respond(query)
144 print(f"Q: {query}")
145 print(f"Source: {result['source']}")
146 print(f"Cost: ${result['cost']:.6f}")
147 if 'daily_spending' in result:
148 print(f"Daily spending: ${result['daily_spending']:.4f}")
149 print()
150
151stats = agent.get_stats()
152print("=== Agent Statistics ===")
153print(f"Total requests: {stats['total_requests']}")
154print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
155print(f"Daily spending: ${stats['daily_spending']:.4f}")
156print(f"Budget remaining: ${stats['budget_remaining']:.4f}")This agent combines multiple strategies:
- Caching for repeated queries (free responses)
- Model routing based on complexity (use cheaper models when possible)
- Concise prompts (reduce output tokens)
- Budget limits (prevent runaway costs)
The result is an agent that's both capable and economical.
Glossary
API Call: A request made to a language model service. Each call typically incurs a cost based on the number of tokens processed.
Batching: Combining multiple similar requests into a single API call to reduce overhead and costs. More efficient than processing each request individually.
Budget Limit: A maximum spending threshold set to prevent unexpected or runaway costs. Can be daily, monthly, or per-user.
Cache Hit: When a requested response is found in the cache and can be served instantly without making an API call. Saves both time and money.
Cache Miss: When a requested response is not in the cache, requiring a new API call to generate it.
Context Window: The maximum amount of text (measured in tokens) that a model can process in a single request, including both input and output.
Input Tokens: The tokens in your prompt, including system messages, conversation history, and the user's query. Generally cheaper than output tokens.
Output Tokens: The tokens generated by the model in its response. Typically cost more than input tokens because generation requires more computation.
Token: The basic unit of text that language models process, roughly equivalent to a word or word piece. Both costs and context limits are measured in tokens.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about managing and reducing AI agent costs.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Scaling Up without Breaking the Bank: AI Agent Performance & Cost Optimization at Scale
Learn how to scale AI agents from single users to thousands while maintaining performance and controlling costs. Covers horizontal scaling, load balancing, monitoring, cost controls, and prompt optimization strategies.

Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times
Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.

Maintenance and Updates: Keeping Your AI Agent Running and Improving Over Time
Learn how to maintain and update AI agents safely, manage costs, respond to user feedback, and keep your system healthy over months and years of operation.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

