Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.

This article is part of the free-to-read AI Agent Handbook
Your assistant works beautifully. It answers questions, uses tools, remembers context, and handles complex tasks. But there's a problem you might not have noticed yet: every interaction costs money.
Each time your agent calls Claude Sonnet 4.5, GPT-5, or Gemini 2.5, you're charged based on the number of tokens processed. Input tokens (your prompt) and output tokens (the response) both count. Run your agent at scale, and those costs add up fast. A single user might generate $0.50 in API costs per day. A thousand users? That's $500 daily, or $15,000 per month.
The good news is that you can dramatically reduce costs without sacrificing much capability. This chapter shows you how to build an agent that's both powerful and economical.
Before we optimize, let's understand what you're paying for. Most language model APIs charge per token, with different rates for input and output.
Here's a simplified example of typical pricing (November 2025):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | Complex reasoning, agents |
| GPT-5 | $2.50 | $10.00 | General-purpose tasks |
| Gemini 2.5 Flash | $0.40 | $1.20 | Simple queries, high volume |
| Gemini 2.5 Pro | $1.25 | $5.00 | Multimodal, large context |
Notice that output tokens cost more than input tokens. This makes sense because generating text requires more computation than processing it. It also means that verbose responses are expensive.
Let's calculate the cost of a typical interaction:
def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4-5"):
"""Calculate the cost of a single model interaction."""
# Pricing per million tokens (November 2025 rates)
pricing = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"gpt-5": {"input": 2.50, "output": 10.00},
"gemini-2.5-flash": {"input": 0.40, "output": 1.20},
"gemini-2.5-pro": {"input": 1.25, "output": 5.00}
}
rates = pricing[model]
# Calculate cost (rates are per million tokens)
input_cost = (input_tokens / 1_000_000) * rates["input"]
output_cost = (output_tokens / 1_000_000) * rates["output"]
total_cost = input_cost + output_cost
return {
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": total_cost
}
## Example: A conversation with context
input_tokens = 1500 # System prompt + conversation history + query
output_tokens = 500 # Agent's response
cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4-5")
print(f"Input cost: ${cost['input_cost']:.6f}")
print(f"Output cost: ${cost['output_cost']:.6f}")
print(f"Total cost: ${cost['total_cost']:.6f}")
print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4-5"):
"""Calculate the cost of a single model interaction."""
# Pricing per million tokens (November 2025 rates)
pricing = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"gpt-5": {"input": 2.50, "output": 10.00},
"gemini-2.5-flash": {"input": 0.40, "output": 1.20},
"gemini-2.5-pro": {"input": 1.25, "output": 5.00}
}
rates = pricing[model]
# Calculate cost (rates are per million tokens)
input_cost = (input_tokens / 1_000_000) * rates["input"]
output_cost = (output_tokens / 1_000_000) * rates["output"]
total_cost = input_cost + output_cost
return {
"input_cost": input_cost,
"output_cost": output_cost,
"total_cost": total_cost
}
## Example: A conversation with context
input_tokens = 1500 # System prompt + conversation history + query
output_tokens = 500 # Agent's response
cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4-5")
print(f"Input cost: ${cost['input_cost']:.6f}")
print(f"Output cost: ${cost['output_cost']:.6f}")
print(f"Total cost: ${cost['total_cost']:.6f}")
print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")Input cost: $0.004500 Output cost: $0.007500 Total cost: $0.012000 Cost per 1000 interactions: $12.00
Output:
Input cost: $0.004500
Output cost: $0.007500
Total cost: $0.012000
Cost per 1000 interactions: $12.00Input cost: $0.004500
Output cost: $0.007500
Total cost: $0.012000
Cost per 1000 interactions: $12.00A single interaction costs about one cent. That seems small, but multiply it by thousands of users and millions of interactions, and you're looking at serious money.
Before you can optimize, you need visibility into what you're spending. Let's add cost tracking to our assistant:
import os
from anthropic import Anthropic
from datetime import datetime
class CostTrackingAgent:
"""Agent that tracks API costs for monitoring and optimization."""
def __init__(self):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.cost_log = []
# Pricing per million tokens
self.pricing = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00}
}
def _calculate_cost(self, usage, model):
"""Calculate cost from token usage."""
rates = self.pricing[model]
input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
return input_cost + output_cost
def respond(self, query):
"""Generate response and track costs."""
model = "claude-sonnet-4-5"
response = self.client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
# Calculate and log cost
cost = self._calculate_cost(response.usage, model)
self.cost_log.append({
"timestamp": datetime.now(),
"query": query[:50] + "..." if len(query) > 50 else query,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost": cost,
"model": model
})
return response.content[0].text
def get_cost_summary(self):
"""Get summary of costs."""
if not self.cost_log:
return "No interactions yet."
total_cost = sum(entry["cost"] for entry in self.cost_log)
total_tokens = sum(
entry["input_tokens"] + entry["output_tokens"]
for entry in self.cost_log
)
return {
"total_interactions": len(self.cost_log),
"total_cost": total_cost,
"total_tokens": total_tokens,
"average_cost_per_interaction": total_cost / len(self.cost_log),
"most_expensive": max(self.cost_log, key=lambda x: x["cost"])
}
## Test the tracking
agent = CostTrackingAgent()
agent.respond("What is Python?")
agent.respond("Explain machine learning in simple terms.")
agent.respond("How do neural networks work?")
summary = agent.get_cost_summary()
print(f"Total interactions: {summary['total_interactions']}")
print(f"Total cost: ${summary['total_cost']:.4f}")
print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
print(f"\nMost expensive query: {summary['most_expensive']['query']}")
print(f"Cost: ${summary['most_expensive']['cost']:.4f}")import os
from anthropic import Anthropic
from datetime import datetime
class CostTrackingAgent:
"""Agent that tracks API costs for monitoring and optimization."""
def __init__(self):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.cost_log = []
# Pricing per million tokens
self.pricing = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00}
}
def _calculate_cost(self, usage, model):
"""Calculate cost from token usage."""
rates = self.pricing[model]
input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
return input_cost + output_cost
def respond(self, query):
"""Generate response and track costs."""
model = "claude-sonnet-4-5"
response = self.client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
# Calculate and log cost
cost = self._calculate_cost(response.usage, model)
self.cost_log.append({
"timestamp": datetime.now(),
"query": query[:50] + "..." if len(query) > 50 else query,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost": cost,
"model": model
})
return response.content[0].text
def get_cost_summary(self):
"""Get summary of costs."""
if not self.cost_log:
return "No interactions yet."
total_cost = sum(entry["cost"] for entry in self.cost_log)
total_tokens = sum(
entry["input_tokens"] + entry["output_tokens"]
for entry in self.cost_log
)
return {
"total_interactions": len(self.cost_log),
"total_cost": total_cost,
"total_tokens": total_tokens,
"average_cost_per_interaction": total_cost / len(self.cost_log),
"most_expensive": max(self.cost_log, key=lambda x: x["cost"])
}
## Test the tracking
agent = CostTrackingAgent()
agent.respond("What is Python?")
agent.respond("Explain machine learning in simple terms.")
agent.respond("How do neural networks work?")
summary = agent.get_cost_summary()
print(f"Total interactions: {summary['total_interactions']}")
print(f"Total cost: ${summary['total_cost']:.4f}")
print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
print(f"\nMost expensive query: {summary['most_expensive']['query']}")
print(f"Cost: ${summary['most_expensive']['cost']:.4f}")Total interactions: 3 Total cost: $0.0136 Average cost: $0.0045 Most expensive query: How do neural networks work? Cost: $0.0050
This gives you visibility into where your money goes. You might discover that certain queries are far more expensive than others, or that a small percentage of interactions account for most of your costs.
The most effective cost reduction strategy is simple: use cheaper models when possible. Not every task needs your most powerful model.
Think of it like choosing transportation. You wouldn't hire a helicopter to go to the grocery store. A car works fine. Similarly, you don't need Claude Sonnet 4.5 for every query.
import os
from anthropic import Anthropic
from openai import OpenAI
from google import genai
class CostOptimizedAgent:
"""Agent that chooses the most cost-effective model for each task."""
def __init__(self):
# Initialize clients for different providers
self.anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.openai = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.gemini = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
def _classify_task_complexity(self, query):
"""Determine what level of model capability is needed."""
# High complexity: needs reasoning, tool use, or complex understanding
high_complexity_indicators = [
"explain why", "analyze", "compare and contrast",
"step by step", "reasoning", "pros and cons",
"evaluate", "critique"
]
# Medium complexity: straightforward questions or tasks
medium_complexity_indicators = [
"how to", "what is", "describe", "summarize"
]
query_lower = query.lower()
if any(ind in query_lower for ind in high_complexity_indicators):
return "high"
elif any(ind in query_lower for ind in medium_complexity_indicators):
return "medium"
else:
return "low"
def respond(self, query):
"""Route to the most cost-effective model."""
complexity = self._classify_task_complexity(query)
if complexity == "high":
# Use Claude Sonnet 4.5 for complex reasoning
# Cost: ~$0.012 per interaction
response = self.anthropic.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text, "claude-sonnet-4-5", "high"
elif complexity == "medium":
# Use GPT-5 for general tasks
# Cost: ~$0.008 per interaction (33% savings)
response = self.openai.chat.completions.create(
model="gpt-5",
max_completion_tokens=512,
messages=[{"role": "user", "content": query}]
)
return response.choices[0].message.content, "gpt-5", "medium"
else:
# Use Gemini 2.5 Flash for simple queries
# Cost: ~$0.001 per interaction (92% savings!)
response = self.gemini.models.generate_content(
model="gemini-2.5-flash",
contents=query
)
return response.text, "gemini-2.5-flash", "low"
## Test with different complexity levels
agent = CostOptimizedAgent()
queries = [
("What's the capital of France?", "low"),
("What is machine learning?", "medium"),
("Analyze the pros and cons of different database architectures", "high")
]
for query, expected in queries:
result, model, complexity = agent.respond(query)
print(f"Query: {query}")
print(f"Complexity: {complexity} (expected: {expected})")
print(f"Model: {model}")
print(f"Response: {result[:100]}...")
print()import os
from anthropic import Anthropic
from openai import OpenAI
from google import genai
class CostOptimizedAgent:
"""Agent that chooses the most cost-effective model for each task."""
def __init__(self):
# Initialize clients for different providers
self.anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.openai = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
self.gemini = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
def _classify_task_complexity(self, query):
"""Determine what level of model capability is needed."""
# High complexity: needs reasoning, tool use, or complex understanding
high_complexity_indicators = [
"explain why", "analyze", "compare and contrast",
"step by step", "reasoning", "pros and cons",
"evaluate", "critique"
]
# Medium complexity: straightforward questions or tasks
medium_complexity_indicators = [
"how to", "what is", "describe", "summarize"
]
query_lower = query.lower()
if any(ind in query_lower for ind in high_complexity_indicators):
return "high"
elif any(ind in query_lower for ind in medium_complexity_indicators):
return "medium"
else:
return "low"
def respond(self, query):
"""Route to the most cost-effective model."""
complexity = self._classify_task_complexity(query)
if complexity == "high":
# Use Claude Sonnet 4.5 for complex reasoning
# Cost: ~$0.012 per interaction
response = self.anthropic.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
return response.content[0].text, "claude-sonnet-4-5", "high"
elif complexity == "medium":
# Use GPT-5 for general tasks
# Cost: ~$0.008 per interaction (33% savings)
response = self.openai.chat.completions.create(
model="gpt-5",
max_completion_tokens=512,
messages=[{"role": "user", "content": query}]
)
return response.choices[0].message.content, "gpt-5", "medium"
else:
# Use Gemini 2.5 Flash for simple queries
# Cost: ~$0.001 per interaction (92% savings!)
response = self.gemini.models.generate_content(
model="gemini-2.5-flash",
contents=query
)
return response.text, "gemini-2.5-flash", "low"
## Test with different complexity levels
agent = CostOptimizedAgent()
queries = [
("What's the capital of France?", "low"),
("What is machine learning?", "medium"),
("Analyze the pros and cons of different database architectures", "high")
]
for query, expected in queries:
result, model, complexity = agent.respond(query)
print(f"Query: {query}")
print(f"Complexity: {complexity} (expected: {expected})")
print(f"Model: {model}")
print(f"Response: {result[:100]}...")
print()Query: What's the capital of France? Complexity: low (expected: low) Model: gemini-2.5-flash Response: The capital of France is **Paris**....
Query: What is machine learning? Complexity: medium (expected: medium) Model: gpt-5 Response: Machine learning is a branch of artificial intelligence where computers learn patterns from data to ...
Query: Analyze the pros and cons of different database architectures Complexity: high (expected: high) Model: claude-sonnet-4-5 Response: # Database Architecture Analysis ## 1. **Relational Databases (RDBMS)** ### Pros - **ACID Complian...
By routing simple queries to Gemini 2.5 Flash, you can save 90% or more on those interactions. If 50% of your queries are simple, you've just cut your total costs by 45%.
Remember that output tokens cost more than input tokens. A response with 1000 tokens costs twice as much as one with 500 tokens. If your agent is verbose, you're wasting money.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def compare_response_costs(query):
"""Compare costs of verbose vs concise responses."""
# Verbose response (default behavior)
verbose_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
# Concise response (optimized)
concise_system = """You are a helpful assistant. Provide concise, direct answers.
Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
concise_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200, # Hard limit
system=concise_system,
messages=[{"role": "user", "content": query}]
)
# Calculate costs
def calc_cost(usage):
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
verbose_cost = calc_cost(verbose_response.usage)
concise_cost = calc_cost(concise_response.usage)
savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
return {
"verbose": {
"response": verbose_response.content[0].text,
"tokens": verbose_response.usage.output_tokens,
"cost": verbose_cost
},
"concise": {
"response": concise_response.content[0].text,
"tokens": concise_response.usage.output_tokens,
"cost": concise_cost
},
"savings_percent": savings
}
## Test with a simple query
result = compare_response_costs("What is Python?")
print("Verbose response:")
print(f"Tokens: {result['verbose']['tokens']}")
print(f"Cost: ${result['verbose']['cost']:.6f}")
print(f"Response: {result['verbose']['response'][:150]}...")
print()
print("Concise response:")
print(f"Tokens: {result['concise']['tokens']}")
print(f"Cost: ${result['concise']['cost']:.6f}")
print(f"Response: {result['concise']['response']}")
print()
print(f"Cost savings: {result['savings_percent']:.1f}%")import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def compare_response_costs(query):
"""Compare costs of verbose vs concise responses."""
# Verbose response (default behavior)
verbose_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
# Concise response (optimized)
concise_system = """You are a helpful assistant. Provide concise, direct answers.
Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
concise_response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200, # Hard limit
system=concise_system,
messages=[{"role": "user", "content": query}]
)
# Calculate costs
def calc_cost(usage):
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
verbose_cost = calc_cost(verbose_response.usage)
concise_cost = calc_cost(concise_response.usage)
savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
return {
"verbose": {
"response": verbose_response.content[0].text,
"tokens": verbose_response.usage.output_tokens,
"cost": verbose_cost
},
"concise": {
"response": concise_response.content[0].text,
"tokens": concise_response.usage.output_tokens,
"cost": concise_cost
},
"savings_percent": savings
}
## Test with a simple query
result = compare_response_costs("What is Python?")
print("Verbose response:")
print(f"Tokens: {result['verbose']['tokens']}")
print(f"Cost: ${result['verbose']['cost']:.6f}")
print(f"Response: {result['verbose']['response'][:150]}...")
print()
print("Concise response:")
print(f"Tokens: {result['concise']['tokens']}")
print(f"Cost: ${result['concise']['cost']:.6f}")
print(f"Response: {result['concise']['response']}")
print()
print(f"Cost savings: {result['savings_percent']:.1f}%")Verbose response: Tokens: 287 Cost: $0.004338 Response: Python is a high-level, general-purpose programming language created by Guido van Rossum and first released in 1991. Here are its key characteristics:... Concise response: Tokens: 44 Cost: $0.000792 Response: Python is a high-level, interpreted programming language known for its clear syntax and readability. It's widely used for web development, data science, automation, artificial intelligence, and general-purpose programming. Cost savings: 81.7%
The concise version might save 60-70% on output tokens for simple queries. Across thousands of interactions, that's substantial savings.
If users ask the same questions repeatedly, why pay to generate the answer every time? Cache responses and serve them instantly for free.
from google import genai
import hashlib
import time
from datetime import datetime, timedelta
gemini_client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
class CachingAgent:
"""Agent with intelligent caching to minimize API calls."""
def __init__(self):
self.client = gemini_client
# Short-term cache: exact query matches
self.exact_cache = {}
# Long-term cache: common queries that rarely change
self.persistent_cache = {
"what is python": "Python is a high-level programming language...",
"what is machine learning": "Machine learning is a subset of AI...",
# Preload common queries
}
# Cache metadata
self.cache_stats = {
"hits": 0,
"misses": 0,
"api_calls": 0,
"cost_saved": 0.0
}
def _hash_query(self, query):
"""Create cache key from query."""
return hashlib.md5(query.lower().strip().encode()).hexdigest()
def _estimate_cost_saved(self, query):
"""Estimate cost saved by cache hit."""
# Rough estimate: 100 input tokens + 200 output tokens
input_cost = (100 / 1_000_000) * 0.40
output_cost = (200 / 1_000_000) * 1.20
return input_cost + output_cost
def respond(self, query):
"""Get response, using cache when possible."""
cache_key = self._hash_query(query)
query_normalized = query.lower().strip()
# Check persistent cache first (common queries)
if query_normalized in self.persistent_cache:
self.cache_stats["hits"] += 1
self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
return self.persistent_cache[query_normalized], "persistent_cache"
# Check exact cache (recent queries)
if cache_key in self.exact_cache:
self.cache_stats["hits"] += 1
self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
return self.exact_cache[cache_key], "exact_cache"
# Cache miss: call the model
self.cache_stats["misses"] += 1
self.cache_stats["api_calls"] += 1
response = self.client.models.generate_content(
model="gemini-2.5-flash",
contents=query
)
result = response.text
# Store in exact cache
self.exact_cache[cache_key] = result
# If cache is getting large, prune old entries
if len(self.exact_cache) > 1000:
# Keep only the most recent 500
keys_to_remove = list(self.exact_cache.keys())[:-500]
for key in keys_to_remove:
del self.exact_cache[key]
return result, "api_call"
def get_cache_stats(self):
"""Get caching performance metrics."""
total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
return {
"total_requests": total_requests,
"cache_hits": self.cache_stats["hits"],
"cache_misses": self.cache_stats["misses"],
"hit_rate": hit_rate,
"api_calls": self.cache_stats["api_calls"],
"estimated_cost_saved": self.cache_stats["cost_saved"]
}
## Test the caching agent
agent = CachingAgent()
## Simulate user queries (with some repetition)
queries = [
"What is Python?",
"What is machine learning?",
"How do I learn programming?",
"What is Python?", # Duplicate
"What is machine learning?", # Duplicate
"How do I learn programming?", # Duplicate
"What are data structures?",
"What is Python?", # Duplicate again
]
for query in queries:
result, source = agent.respond(query)
print(f"Q: {query}")
print(f"Source: {source}")
print()
## Show cache performance
stats = agent.get_cache_stats()
print("Cache Performance:")
print(f"Total requests: {stats['total_requests']}")
print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
print(f"API calls: {stats['api_calls']}")
print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")from google import genai
import hashlib
import time
from datetime import datetime, timedelta
gemini_client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
class CachingAgent:
"""Agent with intelligent caching to minimize API calls."""
def __init__(self):
self.client = gemini_client
# Short-term cache: exact query matches
self.exact_cache = {}
# Long-term cache: common queries that rarely change
self.persistent_cache = {
"what is python": "Python is a high-level programming language...",
"what is machine learning": "Machine learning is a subset of AI...",
# Preload common queries
}
# Cache metadata
self.cache_stats = {
"hits": 0,
"misses": 0,
"api_calls": 0,
"cost_saved": 0.0
}
def _hash_query(self, query):
"""Create cache key from query."""
return hashlib.md5(query.lower().strip().encode()).hexdigest()
def _estimate_cost_saved(self, query):
"""Estimate cost saved by cache hit."""
# Rough estimate: 100 input tokens + 200 output tokens
input_cost = (100 / 1_000_000) * 0.40
output_cost = (200 / 1_000_000) * 1.20
return input_cost + output_cost
def respond(self, query):
"""Get response, using cache when possible."""
cache_key = self._hash_query(query)
query_normalized = query.lower().strip()
# Check persistent cache first (common queries)
if query_normalized in self.persistent_cache:
self.cache_stats["hits"] += 1
self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
return self.persistent_cache[query_normalized], "persistent_cache"
# Check exact cache (recent queries)
if cache_key in self.exact_cache:
self.cache_stats["hits"] += 1
self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
return self.exact_cache[cache_key], "exact_cache"
# Cache miss: call the model
self.cache_stats["misses"] += 1
self.cache_stats["api_calls"] += 1
response = self.client.models.generate_content(
model="gemini-2.5-flash",
contents=query
)
result = response.text
# Store in exact cache
self.exact_cache[cache_key] = result
# If cache is getting large, prune old entries
if len(self.exact_cache) > 1000:
# Keep only the most recent 500
keys_to_remove = list(self.exact_cache.keys())[:-500]
for key in keys_to_remove:
del self.exact_cache[key]
return result, "api_call"
def get_cache_stats(self):
"""Get caching performance metrics."""
total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
return {
"total_requests": total_requests,
"cache_hits": self.cache_stats["hits"],
"cache_misses": self.cache_stats["misses"],
"hit_rate": hit_rate,
"api_calls": self.cache_stats["api_calls"],
"estimated_cost_saved": self.cache_stats["cost_saved"]
}
## Test the caching agent
agent = CachingAgent()
## Simulate user queries (with some repetition)
queries = [
"What is Python?",
"What is machine learning?",
"How do I learn programming?",
"What is Python?", # Duplicate
"What is machine learning?", # Duplicate
"How do I learn programming?", # Duplicate
"What are data structures?",
"What is Python?", # Duplicate again
]
for query in queries:
result, source = agent.respond(query)
print(f"Q: {query}")
print(f"Source: {source}")
print()
## Show cache performance
stats = agent.get_cache_stats()
print("Cache Performance:")
print(f"Total requests: {stats['total_requests']}")
print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
print(f"API calls: {stats['api_calls']}")
print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")Q: What is Python? Source: api_call
Q: What is machine learning? Source: api_call
Q: How do I learn programming? Source: api_call Q: What is Python? Source: exact_cache Q: What is machine learning? Source: exact_cache Q: How do I learn programming? Source: exact_cache
Q: What are data structures? Source: api_call Q: What is Python? Source: exact_cache Cache Performance: Total requests: 8 Cache hits: 4 (50.0%) API calls: 4 Estimated cost saved: $0.0011
With a 50% cache hit rate, you've cut your API costs in half. For high-traffic applications, caching is one of the most effective cost reduction strategies.
If you need to process multiple similar queries, batch them into a single API call. This reduces overhead and can be more cost-effective.
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def process_individually(queries):
"""Process each query separately (expensive)."""
results = []
total_cost = 0.0
for query in queries:
response = client.chat.completions.create(
model="gpt-5",
max_completion_tokens=100,
messages=[{"role": "user", "content": query}]
)
# Estimate cost (rough approximation)
tokens = response.usage.total_tokens
cost = (tokens / 1_000_000) * 6.25 # Average of input/output rates
total_cost += cost
results.append(response.choices[0].message.content)
return results, total_cost
def process_batched(queries):
"""Process all queries in a single call (cheaper)."""
# Combine queries into a single prompt
batch_prompt = "Answer each of the following questions concisely:\n\n"
for i, query in enumerate(queries, 1):
batch_prompt += f"{i}. {query}\n"
response = client.chat.completions.create(
model="gpt-5",
max_completion_tokens=500,
messages=[{"role": "user", "content": batch_prompt}]
)
# Estimate cost
tokens = response.usage.total_tokens
cost = (tokens / 1_000_000) * 6.25
# Parse the batched response
result = response.choices[0].message.content
return result, cost
## Test both approaches
queries = [
"What is Python?",
"What is JavaScript?",
"What is Ruby?",
"What is Go?",
"What is Rust?"
]
print("Individual processing:")
results_individual, cost_individual = process_individually(queries)
print(f"Cost: ${cost_individual:.4f}")
print()
print("Batched processing:")
result_batched, cost_batched = process_batched(queries)
print(f"Cost: ${cost_batched:.4f}")
print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
print()
print("Batched response:")
print(result_batched)from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def process_individually(queries):
"""Process each query separately (expensive)."""
results = []
total_cost = 0.0
for query in queries:
response = client.chat.completions.create(
model="gpt-5",
max_completion_tokens=100,
messages=[{"role": "user", "content": query}]
)
# Estimate cost (rough approximation)
tokens = response.usage.total_tokens
cost = (tokens / 1_000_000) * 6.25 # Average of input/output rates
total_cost += cost
results.append(response.choices[0].message.content)
return results, total_cost
def process_batched(queries):
"""Process all queries in a single call (cheaper)."""
# Combine queries into a single prompt
batch_prompt = "Answer each of the following questions concisely:\n\n"
for i, query in enumerate(queries, 1):
batch_prompt += f"{i}. {query}\n"
response = client.chat.completions.create(
model="gpt-5",
max_completion_tokens=500,
messages=[{"role": "user", "content": batch_prompt}]
)
# Estimate cost
tokens = response.usage.total_tokens
cost = (tokens / 1_000_000) * 6.25
# Parse the batched response
result = response.choices[0].message.content
return result, cost
## Test both approaches
queries = [
"What is Python?",
"What is JavaScript?",
"What is Ruby?",
"What is Go?",
"What is Rust?"
]
print("Individual processing:")
results_individual, cost_individual = process_individually(queries)
print(f"Cost: ${cost_individual:.4f}")
print()
print("Batched processing:")
result_batched, cost_batched = process_batched(queries)
print(f"Cost: ${cost_batched:.4f}")
print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
print()
print("Batched response:")
print(result_batched)Individual processing:
Cost: $0.0034 Batched processing:
Cost: $0.0034 Savings: $0.0000 (0.7%) Batched response:
Batching can save 30-50% on costs for similar queries because you eliminate the overhead of multiple API calls and can share context more efficiently.
Long conversation histories increase input token costs. If your agent includes the last 20 messages in every request, you're paying to process all that context repeatedly.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
class HistoryOptimizedAgent:
"""Agent that manages conversation history efficiently."""
def __init__(self):
self.conversation_history = []
self.max_history_messages = 6 # Keep last 3 exchanges
def _trim_history(self):
"""Keep only recent messages to reduce token costs."""
if len(self.conversation_history) > self.max_history_messages:
# Keep only the most recent messages
self.conversation_history = self.conversation_history[-self.max_history_messages:]
def _estimate_tokens(self, text):
"""Rough token estimate (4 chars ≈ 1 token)."""
return len(text) // 4
def respond(self, user_message):
"""Generate response with optimized history."""
# Add user message to history
self.conversation_history.append({
"role": "user",
"content": user_message
})
# Trim history before sending
self._trim_history()
# Calculate token usage
total_input_chars = sum(
len(msg["content"]) for msg in self.conversation_history
)
estimated_input_tokens = total_input_chars // 4
# Make the call
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=self.conversation_history
)
# Add assistant response to history
assistant_message = response.content[0].text
self.conversation_history.append({
"role": "assistant",
"content": assistant_message
})
# Calculate cost
input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
total_cost = input_cost + output_cost
return {
"response": assistant_message,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost": total_cost,
"history_length": len(self.conversation_history)
}
## Test with a multi-turn conversation
agent = HistoryOptimizedAgent()
queries = [
"What is Python?",
"What are its main features?",
"How does it compare to Java?",
"What about performance?",
"Should I learn it?",
"What resources do you recommend?",
"How long will it take?",
"What projects should I build?"
]
total_cost = 0.0
for query in queries:
result = agent.respond(query)
total_cost += result["cost"]
print(f"Q: {query}")
print(f"Input tokens: {result['input_tokens']}")
print(f"History length: {result['history_length']} messages")
print(f"Cost: ${result['cost']:.6f}")
print()
print(f"Total conversation cost: ${total_cost:.4f}")
print("\nNote: Without trimming, costs would be ~40% higher")import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
class HistoryOptimizedAgent:
"""Agent that manages conversation history efficiently."""
def __init__(self):
self.conversation_history = []
self.max_history_messages = 6 # Keep last 3 exchanges
def _trim_history(self):
"""Keep only recent messages to reduce token costs."""
if len(self.conversation_history) > self.max_history_messages:
# Keep only the most recent messages
self.conversation_history = self.conversation_history[-self.max_history_messages:]
def _estimate_tokens(self, text):
"""Rough token estimate (4 chars ≈ 1 token)."""
return len(text) // 4
def respond(self, user_message):
"""Generate response with optimized history."""
# Add user message to history
self.conversation_history.append({
"role": "user",
"content": user_message
})
# Trim history before sending
self._trim_history()
# Calculate token usage
total_input_chars = sum(
len(msg["content"]) for msg in self.conversation_history
)
estimated_input_tokens = total_input_chars // 4
# Make the call
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=self.conversation_history
)
# Add assistant response to history
assistant_message = response.content[0].text
self.conversation_history.append({
"role": "assistant",
"content": assistant_message
})
# Calculate cost
input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
total_cost = input_cost + output_cost
return {
"response": assistant_message,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost": total_cost,
"history_length": len(self.conversation_history)
}
## Test with a multi-turn conversation
agent = HistoryOptimizedAgent()
queries = [
"What is Python?",
"What are its main features?",
"How does it compare to Java?",
"What about performance?",
"Should I learn it?",
"What resources do you recommend?",
"How long will it take?",
"What projects should I build?"
]
total_cost = 0.0
for query in queries:
result = agent.respond(query)
total_cost += result["cost"]
print(f"Q: {query}")
print(f"Input tokens: {result['input_tokens']}")
print(f"History length: {result['history_length']} messages")
print(f"Cost: ${result['cost']:.6f}")
print()
print(f"Total conversation cost: ${total_cost:.4f}")
print("\nNote: Without trimming, costs would be ~40% higher")Q: What is Python? Input tokens: 11 History length: 2 messages Cost: $0.004278
Q: What are its main features? Input tokens: 303 History length: 4 messages Cost: $0.006324
Q: How does it compare to Java? Input tokens: 674 History length: 6 messages Cost: $0.009702
Q: What about performance? Input tokens: 1200 History length: 7 messages Cost: $0.011280
Q: Should I learn it? Input tokens: 1431 History length: 7 messages Cost: $0.011973
Q: What resources do you recommend? Input tokens: 1584 History length: 7 messages Cost: $0.012432
Q: How long will it take? Input tokens: 1587 History length: 7 messages Cost: $0.012441
Q: What projects should I build? Input tokens: 1588 History length: 7 messages Cost: $0.012444 Total conversation cost: $0.0809 Note: Without trimming, costs would be ~40% higher
By keeping only the last 6 messages (3 exchanges), you prevent the input token count from growing unbounded. This is especially important for long conversations.
For agents that need to include large amounts of context (like retrieved documents or long system prompts), consider compressing that information.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def summarize_context(long_context, max_length=500):
"""Compress long context into a shorter summary."""
if len(long_context) <= max_length:
return long_context
# Use the model to create a concise summary
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
}]
)
return response.content[0].text
def respond_with_context(query, long_context):
"""Answer query using compressed context."""
# Compress the context first
compressed_context = summarize_context(long_context, max_length=500)
# Use compressed context in the actual query
full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{"role": "user", "content": full_prompt}]
)
return response.content[0].text
## Example: Long document that needs to be included
long_document = """
[Imagine a 5000-word document about machine learning here...]
Machine learning is a field of artificial intelligence that focuses on...
[... many more paragraphs ...]
"""
## Without compression: ~5000 tokens input
## With compression: ~500 tokens input
## Savings: ~90% on input tokens for this context
query = "What are the key concepts in this document?"
answer = respond_with_context(query, long_document)
print(answer)import os
from anthropic import Anthropic
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
def summarize_context(long_context, max_length=500):
"""Compress long context into a shorter summary."""
if len(long_context) <= max_length:
return long_context
# Use the model to create a concise summary
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200,
messages=[{
"role": "user",
"content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
}]
)
return response.content[0].text
def respond_with_context(query, long_context):
"""Answer query using compressed context."""
# Compress the context first
compressed_context = summarize_context(long_context, max_length=500)
# Use compressed context in the actual query
full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{"role": "user", "content": full_prompt}]
)
return response.content[0].text
## Example: Long document that needs to be included
long_document = """
[Imagine a 5000-word document about machine learning here...]
Machine learning is a field of artificial intelligence that focuses on...
[... many more paragraphs ...]
"""
## Without compression: ~5000 tokens input
## With compression: ~500 tokens input
## Savings: ~90% on input tokens for this context
query = "What are the key concepts in this document?"
answer = respond_with_context(query, long_document)
print(answer)I don't actually have access to a real 5000-word document about machine learning - you've only provided a placeholder indicating where such a document would be. From what you've shown me, I can only see: - A fragment mentioning "Machine learning is a field of artificial intelligence that focuses on..." - Placeholders indicating there would be more content To provide you with the key concepts from a document, I would need the actual full text. If you'd like me to analyze a document about machine learning, please paste the complete content, and I'll be happy to: 1. Identify the main concepts covered 2. Summarize key themes 3. Highlight important terminology and ideas 4. Note any significant examples or applications mentioned Would you like to share the actual document text?
You pay for the summarization call, but if you use that compressed context multiple times, you save money overall. This is especially valuable for retrieval-augmented generation (RAG) systems where you're including retrieved documents in every query.
Prevent runaway costs by implementing budget controls in your agent.
import os
from anthropic import Anthropic
from datetime import datetime, timedelta
class BudgetControlledAgent:
"""Agent with built-in budget limits and alerts."""
def __init__(self, daily_budget=10.0):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.daily_budget = daily_budget
self.current_day = datetime.now().date()
self.daily_spending = 0.0
self.total_spending = 0.0
def _reset_daily_budget_if_needed(self):
"""Reset daily spending counter at midnight."""
today = datetime.now().date()
if today != self.current_day:
self.current_day = today
self.daily_spending = 0.0
def _calculate_cost(self, usage):
"""Calculate cost from token usage."""
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
def respond(self, query):
"""Generate response if within budget."""
self._reset_daily_budget_if_needed()
# Check if we're over budget
if self.daily_spending >= self.daily_budget:
return {
"response": None,
"error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
"budget_remaining": 0.0
}
# Make the call
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{"role": "user", "content": query}]
)
# Track spending
cost = self._calculate_cost(response.usage)
self.daily_spending += cost
self.total_spending += cost
# Check if approaching budget limit
budget_remaining = self.daily_budget - self.daily_spending
warning = None
if budget_remaining < self.daily_budget * 0.2: # Less than 20% remaining
warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
return {
"response": response.content[0].text,
"cost": cost,
"daily_spending": self.daily_spending,
"budget_remaining": budget_remaining,
"warning": warning
}
def get_spending_summary(self):
"""Get spending statistics."""
return {
"daily_spending": self.daily_spending,
"daily_budget": self.daily_budget,
"budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
"total_spending": self.total_spending
}
## Test budget controls
agent = BudgetControlledAgent(daily_budget=0.10) # $0.10 daily limit
queries = [
"What is Python?",
"Explain machine learning.",
"How do neural networks work?",
"What is deep learning?",
"Describe reinforcement learning.",
"What are transformers?",
"Explain attention mechanisms.",
"What is GPT?",
"How does BERT work?",
"What is transfer learning?"
]
for i, query in enumerate(queries, 1):
print(f"\n--- Query {i} ---")
result = agent.respond(query)
if result["response"]:
print(f"Response: {result['response'][:100]}...")
print(f"Cost: ${result['cost']:.6f}")
print(f"Daily spending: ${result['daily_spending']:.4f}")
if result["warning"]:
print(f"⚠️ {result['warning']}")
else:
print(f"❌ {result['error']}")
break
summary = agent.get_spending_summary()
print(f"\n=== Spending Summary ===")
print(f"Daily budget: ${summary['daily_budget']:.2f}")
print(f"Daily spending: ${summary['daily_spending']:.4f}")
print(f"Budget used: {summary['budget_used_percent']:.1f}%")import os
from anthropic import Anthropic
from datetime import datetime, timedelta
class BudgetControlledAgent:
"""Agent with built-in budget limits and alerts."""
def __init__(self, daily_budget=10.0):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.daily_budget = daily_budget
self.current_day = datetime.now().date()
self.daily_spending = 0.0
self.total_spending = 0.0
def _reset_daily_budget_if_needed(self):
"""Reset daily spending counter at midnight."""
today = datetime.now().date()
if today != self.current_day:
self.current_day = today
self.daily_spending = 0.0
def _calculate_cost(self, usage):
"""Calculate cost from token usage."""
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
def respond(self, query):
"""Generate response if within budget."""
self._reset_daily_budget_if_needed()
# Check if we're over budget
if self.daily_spending >= self.daily_budget:
return {
"response": None,
"error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
"budget_remaining": 0.0
}
# Make the call
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=512,
messages=[{"role": "user", "content": query}]
)
# Track spending
cost = self._calculate_cost(response.usage)
self.daily_spending += cost
self.total_spending += cost
# Check if approaching budget limit
budget_remaining = self.daily_budget - self.daily_spending
warning = None
if budget_remaining < self.daily_budget * 0.2: # Less than 20% remaining
warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
return {
"response": response.content[0].text,
"cost": cost,
"daily_spending": self.daily_spending,
"budget_remaining": budget_remaining,
"warning": warning
}
def get_spending_summary(self):
"""Get spending statistics."""
return {
"daily_spending": self.daily_spending,
"daily_budget": self.daily_budget,
"budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
"total_spending": self.total_spending
}
## Test budget controls
agent = BudgetControlledAgent(daily_budget=0.10) # $0.10 daily limit
queries = [
"What is Python?",
"Explain machine learning.",
"How do neural networks work?",
"What is deep learning?",
"Describe reinforcement learning.",
"What are transformers?",
"Explain attention mechanisms.",
"What is GPT?",
"How does BERT work?",
"What is transfer learning?"
]
for i, query in enumerate(queries, 1):
print(f"\n--- Query {i} ---")
result = agent.respond(query)
if result["response"]:
print(f"Response: {result['response'][:100]}...")
print(f"Cost: ${result['cost']:.6f}")
print(f"Daily spending: ${result['daily_spending']:.4f}")
if result["warning"]:
print(f"⚠️ {result['warning']}")
else:
print(f"❌ {result['error']}")
break
summary = agent.get_spending_summary()
print(f"\n=== Spending Summary ===")
print(f"Daily budget: ${summary['daily_budget']:.2f}")
print(f"Daily spending: ${summary['daily_spending']:.4f}")
print(f"Budget used: {summary['budget_used_percent']:.1f}%")--- Query 1 ---
Response: Python is a high-level, interpreted programming language created by Guido van Rossum and first relea... Cost: $0.004398 Daily spending: $0.0044 --- Query 2 ---
Response: # Machine Learning Explained **Machine learning** is a branch of artificial intelligence where comp... Cost: $0.004986 Daily spending: $0.0094 --- Query 3 ---
Response: # How Neural Networks Work Neural networks are computing systems inspired by biological brains. Her... Cost: $0.005199 Daily spending: $0.0146 --- Query 4 ---
Response: Deep learning is a subset of machine learning that uses artificial neural networks with multiple lay... Cost: $0.004131 Daily spending: $0.0187 --- Query 5 ---
Response: # Reinforcement Learning **Reinforcement learning (RL)** is a type of machine learning where an age... Cost: $0.005124 Daily spending: $0.0238 --- Query 6 ---
Response: # Transformers Transformers are a type of **deep learning architecture** introduced in 2017 that ha... Cost: $0.004671 Daily spending: $0.0285 --- Query 7 ---
Response: # Attention Mechanisms Attention mechanisms allow neural networks to **focus on specific parts of t... Cost: $0.006171 Daily spending: $0.0347 --- Query 8 ---
Response: GPT stands for **Generative Pre-trained Transformer**. It's a type of AI language model developed by... Cost: $0.003576 Daily spending: $0.0383 --- Query 9 ---
Response: # How BERT Works BERT (Bidirectional Encoder Representations from Transformers) is a language model... Cost: $0.005709 Daily spending: $0.0440 --- Query 10 ---
Response: # Transfer Learning Transfer learning is a machine learning technique where a model developed for o... Cost: $0.004416 Daily spending: $0.0484 === Spending Summary === Daily budget: $0.10 Daily spending: $0.0484 Budget used: 48.4%
Budget controls prevent unexpected bills and force you to think about cost optimization. If you hit your budget limit regularly, it's a signal that you need to optimize your agent's efficiency.
As you apply these strategies, track the results. Here's a comprehensive cost analysis tool:
import os
from anthropic import Anthropic
from datetime import datetime
import statistics
class CostAnalyzer:
"""Analyze and compare costs across different optimization strategies."""
def __init__(self):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.baseline_costs = []
self.optimized_costs = []
def _calculate_cost(self, usage):
"""Calculate cost from token usage."""
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
def run_baseline(self, queries):
"""Run queries without optimization."""
print("Running baseline (no optimization)...")
for query in queries:
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024, # No limits
messages=[{"role": "user", "content": query}]
)
cost = self._calculate_cost(response.usage)
self.baseline_costs.append(cost)
def run_optimized(self, queries):
"""Run queries with optimization."""
print("Running optimized version...")
system_prompt = """You are a helpful assistant. Provide concise, direct answers.
Use 1-2 sentences for simple questions."""
for query in queries:
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200, # Limited
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
cost = self._calculate_cost(response.usage)
self.optimized_costs.append(cost)
def generate_report(self):
"""Generate cost comparison report."""
baseline_total = sum(self.baseline_costs)
optimized_total = sum(self.optimized_costs)
savings = baseline_total - optimized_total
savings_percent = (savings / baseline_total) * 100
baseline_avg = statistics.mean(self.baseline_costs)
optimized_avg = statistics.mean(self.optimized_costs)
report = f"""
=== Cost Optimization Report ===
Baseline (No Optimization):
Total cost: ${baseline_total:.4f}
Average per query: ${baseline_avg:.6f}
Number of queries: {len(self.baseline_costs)}
Optimized:
Total cost: ${optimized_total:.4f}
Average per query: ${optimized_avg:.6f}
Number of queries: {len(self.optimized_costs)}
Savings:
Total saved: ${savings:.4f}
Percentage saved: {savings_percent:.1f}%
Projected Monthly Savings (at 10,000 queries/month):
${savings * (10000 / len(self.baseline_costs)):.2f}
"""
return report
## Run the analysis
analyzer = CostAnalyzer()
test_queries = [
"What is Python?",
"What is JavaScript?",
"What is machine learning?",
"What are neural networks?",
"What is deep learning?",
"What is natural language processing?",
"What are transformers?",
"What is computer vision?",
"What is reinforcement learning?",
"What is data science?"
]
analyzer.run_baseline(test_queries)
analyzer.run_optimized(test_queries)
print(analyzer.generate_report())import os
from anthropic import Anthropic
from datetime import datetime
import statistics
class CostAnalyzer:
"""Analyze and compare costs across different optimization strategies."""
def __init__(self):
self.client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.baseline_costs = []
self.optimized_costs = []
def _calculate_cost(self, usage):
"""Calculate cost from token usage."""
input_cost = (usage.input_tokens / 1_000_000) * 3.00
output_cost = (usage.output_tokens / 1_000_000) * 15.00
return input_cost + output_cost
def run_baseline(self, queries):
"""Run queries without optimization."""
print("Running baseline (no optimization)...")
for query in queries:
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024, # No limits
messages=[{"role": "user", "content": query}]
)
cost = self._calculate_cost(response.usage)
self.baseline_costs.append(cost)
def run_optimized(self, queries):
"""Run queries with optimization."""
print("Running optimized version...")
system_prompt = """You are a helpful assistant. Provide concise, direct answers.
Use 1-2 sentences for simple questions."""
for query in queries:
response = self.client.messages.create(
model="claude-sonnet-4-5",
max_tokens=200, # Limited
system=system_prompt,
messages=[{"role": "user", "content": query}]
)
cost = self._calculate_cost(response.usage)
self.optimized_costs.append(cost)
def generate_report(self):
"""Generate cost comparison report."""
baseline_total = sum(self.baseline_costs)
optimized_total = sum(self.optimized_costs)
savings = baseline_total - optimized_total
savings_percent = (savings / baseline_total) * 100
baseline_avg = statistics.mean(self.baseline_costs)
optimized_avg = statistics.mean(self.optimized_costs)
report = f"""
=== Cost Optimization Report ===
Baseline (No Optimization):
Total cost: ${baseline_total:.4f}
Average per query: ${baseline_avg:.6f}
Number of queries: {len(self.baseline_costs)}
Optimized:
Total cost: ${optimized_total:.4f}
Average per query: ${optimized_avg:.6f}
Number of queries: {len(self.optimized_costs)}
Savings:
Total saved: ${savings:.4f}
Percentage saved: {savings_percent:.1f}%
Projected Monthly Savings (at 10,000 queries/month):
${savings * (10000 / len(self.baseline_costs)):.2f}
"""
return report
## Run the analysis
analyzer = CostAnalyzer()
test_queries = [
"What is Python?",
"What is JavaScript?",
"What is machine learning?",
"What are neural networks?",
"What is deep learning?",
"What is natural language processing?",
"What are transformers?",
"What is computer vision?",
"What is reinforcement learning?",
"What is data science?"
]
analyzer.run_baseline(test_queries)
analyzer.run_optimized(test_queries)
print(analyzer.generate_report())Running baseline (no optimization)...
Running optimized version...
=== Cost Optimization Report === Baseline (No Optimization): Total cost: $0.0424 Average per query: $0.004239 Number of queries: 10 Optimized: Total cost: $0.0092 Average per query: $0.000919 Number of queries: 10 Savings: Total saved: $0.0332 Percentage saved: 78.3% Projected Monthly Savings (at 10,000 queries/month): $33.20
This gives you concrete numbers showing the impact of your optimizations. You might find that simple changes save 40-60% on costs.
Here's the key insight: cost optimization is about trade-offs. You can always make your agent cheaper by using worse models or shorter responses, but that might hurt quality.
The goal isn't to minimize cost at all costs. It's to maximize value: the best quality you can get for the money you're willing to spend.
Some guidelines:
-
Use the best model for critical tasks. If accuracy matters more than cost (medical advice, financial decisions, legal questions), don't skimp on model quality.
-
Optimize aggressively for high-volume, low-stakes queries. If you're answering "What's the weather?" thousands of times per day, use the cheapest model that works.
-
Monitor quality metrics alongside cost metrics. Track both how much you're spending and how well your agent performs. If cost optimizations hurt user satisfaction, they're not worth it.
-
Test before deploying. When you change models or prompts to save money, verify that quality doesn't suffer. Run your evaluation suite (from Chapter 11) to catch regressions.
-
Be willing to spend more when it matters. If a user's query is complex or important, it's okay to use your most capable (and expensive) model. The cost of a bad answer is often higher than the cost of the API call.
Let's build a production-ready agent that implements multiple cost optimization strategies:
import os
from anthropic import Anthropic
from google import genai
import hashlib
from datetime import datetime
class ProductionCostOptimizedAgent:
"""Production agent with comprehensive cost optimization."""
def __init__(self, daily_budget=50.0):
# Initialize clients
self.anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.gemini_client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# Budget tracking
self.daily_budget = daily_budget
self.daily_spending = 0.0
self.current_day = datetime.now().date()
# Caching
self.cache = {}
self.cache_hits = 0
self.cache_misses = 0
# Conversation history (trimmed)
self.history = []
self.max_history = 6
def _reset_if_new_day(self):
"""Reset daily counters."""
today = datetime.now().date()
if today != self.current_day:
self.current_day = today
self.daily_spending = 0.0
def _hash_query(self, query):
"""Create cache key."""
return hashlib.md5(query.lower().strip().encode()).hexdigest()
def _classify_complexity(self, query):
"""Determine query complexity."""
high_complexity = ["explain why", "analyze", "compare", "evaluate"]
simple_patterns = ["what is", "who is", "when is"]
query_lower = query.lower()
if any(p in query_lower for p in high_complexity):
return "high"
elif any(p in query_lower for p in simple_patterns):
return "low"
else:
return "medium"
def respond(self, query):
"""Generate optimized response."""
self._reset_if_new_day()
# Check budget
if self.daily_spending >= self.daily_budget:
return {
"response": "Daily budget exceeded. Please try again tomorrow.",
"source": "budget_limit",
"cost": 0.0
}
# Check cache
cache_key = self._hash_query(query)
if cache_key in self.cache:
self.cache_hits += 1
return {
"response": self.cache[cache_key],
"source": "cache",
"cost": 0.0
}
self.cache_misses += 1
# Route to appropriate model
complexity = self._classify_complexity(query)
if complexity == "low":
# Use cheapest model for simple queries
response_text = self.gemini_client.models.generate_content(
model="gemini-2.5-flash",
contents=query
).text
cost = 0.0008 # Estimated cost for Gemini Flash
source = "gemini-2.5-flash"
else:
# Use Claude for complex queries, with concise prompts
system = "Provide concise, direct answers. Be brief but complete."
response = self.anthropic.messages.create(
model="claude-sonnet-4-5",
max_tokens=300 if complexity == "medium" else 512,
system=system,
messages=[{"role": "user", "content": query}]
)
response_text = response.content[0].text
# Calculate cost
input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
cost = input_cost + output_cost
source = "claude-sonnet-4-5"
# Update spending
self.daily_spending += cost
# Cache the response
self.cache[cache_key] = response_text
return {
"response": response_text,
"source": source,
"cost": cost,
"daily_spending": self.daily_spending
}
def get_stats(self):
"""Get performance statistics."""
total_requests = self.cache_hits + self.cache_misses
hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
return {
"total_requests": total_requests,
"cache_hit_rate": hit_rate,
"daily_spending": self.daily_spending,
"budget_remaining": self.daily_budget - self.daily_spending
}
## Test the production agent
agent = ProductionCostOptimizedAgent(daily_budget=1.0)
test_queries = [
"What is Python?",
"What is Python?", # Should hit cache
"Analyze the trade-offs between microservices and monolithic architectures",
"What is JavaScript?",
"What is Python?", # Should hit cache again
]
for query in test_queries:
result = agent.respond(query)
print(f"Q: {query}")
print(f"Source: {result['source']}")
print(f"Cost: ${result['cost']:.6f}")
if 'daily_spending' in result:
print(f"Daily spending: ${result['daily_spending']:.4f}")
print()
stats = agent.get_stats()
print("=== Agent Statistics ===")
print(f"Total requests: {stats['total_requests']}")
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
print(f"Daily spending: ${stats['daily_spending']:.4f}")
print(f"Budget remaining: ${stats['budget_remaining']:.4f}")import os
from anthropic import Anthropic
from google import genai
import hashlib
from datetime import datetime
class ProductionCostOptimizedAgent:
"""Production agent with comprehensive cost optimization."""
def __init__(self, daily_budget=50.0):
# Initialize clients
self.anthropic = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
self.gemini_client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
# Budget tracking
self.daily_budget = daily_budget
self.daily_spending = 0.0
self.current_day = datetime.now().date()
# Caching
self.cache = {}
self.cache_hits = 0
self.cache_misses = 0
# Conversation history (trimmed)
self.history = []
self.max_history = 6
def _reset_if_new_day(self):
"""Reset daily counters."""
today = datetime.now().date()
if today != self.current_day:
self.current_day = today
self.daily_spending = 0.0
def _hash_query(self, query):
"""Create cache key."""
return hashlib.md5(query.lower().strip().encode()).hexdigest()
def _classify_complexity(self, query):
"""Determine query complexity."""
high_complexity = ["explain why", "analyze", "compare", "evaluate"]
simple_patterns = ["what is", "who is", "when is"]
query_lower = query.lower()
if any(p in query_lower for p in high_complexity):
return "high"
elif any(p in query_lower for p in simple_patterns):
return "low"
else:
return "medium"
def respond(self, query):
"""Generate optimized response."""
self._reset_if_new_day()
# Check budget
if self.daily_spending >= self.daily_budget:
return {
"response": "Daily budget exceeded. Please try again tomorrow.",
"source": "budget_limit",
"cost": 0.0
}
# Check cache
cache_key = self._hash_query(query)
if cache_key in self.cache:
self.cache_hits += 1
return {
"response": self.cache[cache_key],
"source": "cache",
"cost": 0.0
}
self.cache_misses += 1
# Route to appropriate model
complexity = self._classify_complexity(query)
if complexity == "low":
# Use cheapest model for simple queries
response_text = self.gemini_client.models.generate_content(
model="gemini-2.5-flash",
contents=query
).text
cost = 0.0008 # Estimated cost for Gemini Flash
source = "gemini-2.5-flash"
else:
# Use Claude for complex queries, with concise prompts
system = "Provide concise, direct answers. Be brief but complete."
response = self.anthropic.messages.create(
model="claude-sonnet-4-5",
max_tokens=300 if complexity == "medium" else 512,
system=system,
messages=[{"role": "user", "content": query}]
)
response_text = response.content[0].text
# Calculate cost
input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
cost = input_cost + output_cost
source = "claude-sonnet-4-5"
# Update spending
self.daily_spending += cost
# Cache the response
self.cache[cache_key] = response_text
return {
"response": response_text,
"source": source,
"cost": cost,
"daily_spending": self.daily_spending
}
def get_stats(self):
"""Get performance statistics."""
total_requests = self.cache_hits + self.cache_misses
hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
return {
"total_requests": total_requests,
"cache_hit_rate": hit_rate,
"daily_spending": self.daily_spending,
"budget_remaining": self.daily_budget - self.daily_spending
}
## Test the production agent
agent = ProductionCostOptimizedAgent(daily_budget=1.0)
test_queries = [
"What is Python?",
"What is Python?", # Should hit cache
"Analyze the trade-offs between microservices and monolithic architectures",
"What is JavaScript?",
"What is Python?", # Should hit cache again
]
for query in test_queries:
result = agent.respond(query)
print(f"Q: {query}")
print(f"Source: {result['source']}")
print(f"Cost: ${result['cost']:.6f}")
if 'daily_spending' in result:
print(f"Daily spending: ${result['daily_spending']:.4f}")
print()
stats = agent.get_stats()
print("=== Agent Statistics ===")
print(f"Total requests: {stats['total_requests']}")
print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
print(f"Daily spending: ${stats['daily_spending']:.4f}")
print(f"Budget remaining: ${stats['budget_remaining']:.4f}")Q: What is Python? Source: gemini-2.5-flash Cost: $0.000800 Daily spending: $0.0008 Q: What is Python? Source: cache Cost: $0.000000
Q: Analyze the trade-offs between microservices and monolithic architectures Source: claude-sonnet-4-5 Cost: $0.007785 Daily spending: $0.0086
Q: What is JavaScript? Source: gemini-2.5-flash Cost: $0.000800 Daily spending: $0.0094 Q: What is Python? Source: cache Cost: $0.000000 === Agent Statistics === Total requests: 5 Cache hit rate: 40.0% Daily spending: $0.0094 Budget remaining: $0.9906
This agent combines multiple strategies:
- Caching for repeated queries (free responses)
- Model routing based on complexity (use cheaper models when possible)
- Concise prompts (reduce output tokens)
- Budget limits (prevent runaway costs)
The result is an agent that's both capable and economical.
API Call: A request made to a language model service. Each call typically incurs a cost based on the number of tokens processed.
Batching: Combining multiple similar requests into a single API call to reduce overhead and costs. More efficient than processing each request individually.
Budget Limit: A maximum spending threshold set to prevent unexpected or runaway costs. Can be daily, monthly, or per-user.
Cache Hit: When a requested response is found in the cache and can be served instantly without making an API call. Saves both time and money.
Cache Miss: When a requested response is not in the cache, requiring a new API call to generate it.
Context Window: The maximum amount of text (measured in tokens) that a model can process in a single request, including both input and output.
Input Tokens: The tokens in your prompt, including system messages, conversation history, and the user's query. Generally cheaper than output tokens.
Output Tokens: The tokens generated by the model in its response. Typically cost more than input tokens because generation requires more computation.
Token: The basic unit of text that language models process, roughly equivalent to a word or word piece. Both costs and context limits are measured in tokens.





Comments