Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies

Michael Brenndoerfer

AI Agent Handbook Machine Learning Data, Analytics & AI Software Engineering

Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.

Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

Managing and Reducing Costs

Your assistant works beautifully. It answers questions, uses tools, remembers context, and handles complex tasks. But there's a problem you might not have noticed yet: every interaction costs money.

Each time your agent calls Claude Sonnet 4.5, GPT-5, or Gemini 2.5, you're charged based on the number of tokens processed. Input tokens (your prompt) and output tokens (the response) both count. Run your agent at scale, and those costs add up fast. A single user might generate $0.50 in API costs per day. A thousand users? That's $500 daily, or $15,000 per month.

The good news is that you can dramatically reduce costs without sacrificing much capability. This chapter shows you how to build an agent that's both powerful and economical.

Understanding the Cost Structure

Before we optimize, let's understand what you're paying for. Most language model APIs charge per token, with different rates for input and output.

Here's a simplified example of typical pricing (November 2025):

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
Claude Sonnet 4.5	$3.00	$15.00	Complex reasoning, agents
GPT-5	$2.50	$10.00	General-purpose tasks
Gemini 2.5 Flash	$0.40	$1.20	Simple queries, high volume
Gemini 2.5 Pro	$1.25	$5.00	Multimodal, large context

Notice that output tokens cost more than input tokens. This makes sense because generating text requires more computation than processing it. It also means that verbose responses are expensive.

Let's calculate the cost of a typical interaction:

1def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4.5"):
2    """Calculate the cost of a single model interaction."""
3    # Pricing per million tokens (November 2025 rates)
4    pricing = {
5        "claude-sonnet-4.5": {"input": 3.00, "output": 15.00},
6        "gpt-5": {"input": 2.50, "output": 10.00},
7        "gemini-2.5-flash": {"input": 0.40, "output": 1.20},
8        "gemini-2.5-pro": {"input": 1.25, "output": 5.00}
9    }
10    
11    rates = pricing[model]
12    
13    # Calculate cost (rates are per million tokens)
14    input_cost = (input_tokens / 1_000_000) * rates["input"]
15    output_cost = (output_tokens / 1_000_000) * rates["output"]
16    total_cost = input_cost + output_cost
17    
18    return {
19        "input_cost": input_cost,
20        "output_cost": output_cost,
21        "total_cost": total_cost
22    }
23
24## Example: A conversation with context
25input_tokens = 1500  # System prompt + conversation history + query
26output_tokens = 500  # Agent's response
27
28cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4.5")
29print(f"Input cost: ${cost['input_cost']:.6f}")
30print(f"Output cost: ${cost['output_cost']:.6f}")
31print(f"Total cost: ${cost['total_cost']:.6f}")
32print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")

1def calculate_interaction_cost(input_tokens, output_tokens, model="claude-sonnet-4.5"):
2    """Calculate the cost of a single model interaction."""
3    # Pricing per million tokens (November 2025 rates)
4    pricing = {
5        "claude-sonnet-4.5": {"input": 3.00, "output": 15.00},
6        "gpt-5": {"input": 2.50, "output": 10.00},
7        "gemini-2.5-flash": {"input": 0.40, "output": 1.20},
8        "gemini-2.5-pro": {"input": 1.25, "output": 5.00}
9    }
10    
11    rates = pricing[model]
12    
13    # Calculate cost (rates are per million tokens)
14    input_cost = (input_tokens / 1_000_000) * rates["input"]
15    output_cost = (output_tokens / 1_000_000) * rates["output"]
16    total_cost = input_cost + output_cost
17    
18    return {
19        "input_cost": input_cost,
20        "output_cost": output_cost,
21        "total_cost": total_cost
22    }
23
24## Example: A conversation with context
25input_tokens = 1500  # System prompt + conversation history + query
26output_tokens = 500  # Agent's response
27
28cost = calculate_interaction_cost(input_tokens, output_tokens, "claude-sonnet-4.5")
29print(f"Input cost: ${cost['input_cost']:.6f}")
30print(f"Output cost: ${cost['output_cost']:.6f}")
31print(f"Total cost: ${cost['total_cost']:.6f}")
32print(f"\nCost per 1000 interactions: ${cost['total_cost'] * 1000:.2f}")

Output:

1Input cost: $0.004500
2Output cost: $0.007500
3Total cost: $0.012000
4
5Cost per 1000 interactions: $12.00

1Input cost: $0.004500
2Output cost: $0.007500
3Total cost: $0.012000
4
5Cost per 1000 interactions: $12.00

A single interaction costs about one cent. That seems small, but multiply it by thousands of users and millions of interactions, and you're looking at serious money.

Tracking Costs in Your Agent

Before you can optimize, you need visibility into what you're spending. Let's add cost tracking to our assistant:

1from anthropic import Anthropic
2from datetime import datetime
3
4class CostTrackingAgent:
5    """Agent that tracks API costs for monitoring and optimization."""
6    
7    def __init__(self):
8        self.client = Anthropic(api_key="YOUR_API_KEY")
9        self.cost_log = []
10        
11        # Pricing per million tokens
12        self.pricing = {
13            "claude-sonnet-4.5": {"input": 3.00, "output": 15.00}
14        }
15    
16    def _calculate_cost(self, usage, model):
17        """Calculate cost from token usage."""
18        rates = self.pricing[model]
19        input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
20        output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
21        return input_cost + output_cost
22    
23    def respond(self, query):
24        """Generate response and track costs."""
25        model = "claude-sonnet-4.5"
26        
27        response = self.client.messages.create(
28            model=model,
29            max_tokens=1024,
30            messages=[{"role": "user", "content": query}]
31        )
32        
33        # Calculate and log cost
34        cost = self._calculate_cost(response.usage, model)
35        
36        self.cost_log.append({
37            "timestamp": datetime.now(),
38            "query": query[:50] + "..." if len(query) > 50 else query,
39            "input_tokens": response.usage.input_tokens,
40            "output_tokens": response.usage.output_tokens,
41            "cost": cost,
42            "model": model
43        })
44        
45        return response.content[0].text
46    
47    def get_cost_summary(self):
48        """Get summary of costs."""
49        if not self.cost_log:
50            return "No interactions yet."
51        
52        total_cost = sum(entry["cost"] for entry in self.cost_log)
53        total_tokens = sum(
54            entry["input_tokens"] + entry["output_tokens"] 
55            for entry in self.cost_log
56        )
57        
58        return {
59            "total_interactions": len(self.cost_log),
60            "total_cost": total_cost,
61            "total_tokens": total_tokens,
62            "average_cost_per_interaction": total_cost / len(self.cost_log),
63            "most_expensive": max(self.cost_log, key=lambda x: x["cost"])
64        }
65
66## Test the tracking
67agent = CostTrackingAgent()
68
69agent.respond("What is Python?")
70agent.respond("Explain machine learning in simple terms.")
71agent.respond("How do neural networks work?")
72
73summary = agent.get_cost_summary()
74print(f"Total interactions: {summary['total_interactions']}")
75print(f"Total cost: ${summary['total_cost']:.4f}")
76print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
77print(f"\nMost expensive query: {summary['most_expensive']['query']}")
78print(f"Cost: ${summary['most_expensive']['cost']:.4f}")

1from anthropic import Anthropic
2from datetime import datetime
3
4class CostTrackingAgent:
5    """Agent that tracks API costs for monitoring and optimization."""
6    
7    def __init__(self):
8        self.client = Anthropic(api_key="YOUR_API_KEY")
9        self.cost_log = []
10        
11        # Pricing per million tokens
12        self.pricing = {
13            "claude-sonnet-4.5": {"input": 3.00, "output": 15.00}
14        }
15    
16    def _calculate_cost(self, usage, model):
17        """Calculate cost from token usage."""
18        rates = self.pricing[model]
19        input_cost = (usage.input_tokens / 1_000_000) * rates["input"]
20        output_cost = (usage.output_tokens / 1_000_000) * rates["output"]
21        return input_cost + output_cost
22    
23    def respond(self, query):
24        """Generate response and track costs."""
25        model = "claude-sonnet-4.5"
26        
27        response = self.client.messages.create(
28            model=model,
29            max_tokens=1024,
30            messages=[{"role": "user", "content": query}]
31        )
32        
33        # Calculate and log cost
34        cost = self._calculate_cost(response.usage, model)
35        
36        self.cost_log.append({
37            "timestamp": datetime.now(),
38            "query": query[:50] + "..." if len(query) > 50 else query,
39            "input_tokens": response.usage.input_tokens,
40            "output_tokens": response.usage.output_tokens,
41            "cost": cost,
42            "model": model
43        })
44        
45        return response.content[0].text
46    
47    def get_cost_summary(self):
48        """Get summary of costs."""
49        if not self.cost_log:
50            return "No interactions yet."
51        
52        total_cost = sum(entry["cost"] for entry in self.cost_log)
53        total_tokens = sum(
54            entry["input_tokens"] + entry["output_tokens"] 
55            for entry in self.cost_log
56        )
57        
58        return {
59            "total_interactions": len(self.cost_log),
60            "total_cost": total_cost,
61            "total_tokens": total_tokens,
62            "average_cost_per_interaction": total_cost / len(self.cost_log),
63            "most_expensive": max(self.cost_log, key=lambda x: x["cost"])
64        }
65
66## Test the tracking
67agent = CostTrackingAgent()
68
69agent.respond("What is Python?")
70agent.respond("Explain machine learning in simple terms.")
71agent.respond("How do neural networks work?")
72
73summary = agent.get_cost_summary()
74print(f"Total interactions: {summary['total_interactions']}")
75print(f"Total cost: ${summary['total_cost']:.4f}")
76print(f"Average cost: ${summary['average_cost_per_interaction']:.4f}")
77print(f"\nMost expensive query: {summary['most_expensive']['query']}")
78print(f"Cost: ${summary['most_expensive']['cost']:.4f}")

This gives you visibility into where your money goes. You might discover that certain queries are far more expensive than others, or that a small percentage of interactions account for most of your costs.

Strategy 1: Use the Cheapest Model That Works

The most effective cost reduction strategy is simple: use cheaper models when possible. Not every task needs your most powerful model.

Think of it like choosing transportation. You wouldn't hire a helicopter to go to the grocery store. A car works fine. Similarly, you don't need Claude Sonnet 4.5 for every query.

Example: Cost-Aware Model Selection (Multi-Provider)

1from anthropic import Anthropic
2from openai import OpenAI
3from google import generativeai as genai
4
5class CostOptimizedAgent:
6    """Agent that chooses the most cost-effective model for each task."""
7    
8    def __init__(self):
9        # Initialize clients for different providers
10        self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
11        self.openai = OpenAI(api_key="YOUR_OPENAI_KEY")
12        genai.configure(api_key="YOUR_GOOGLE_KEY")
13        self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14    
15    def _classify_task_complexity(self, query):
16        """Determine what level of model capability is needed."""
17        # High complexity: needs reasoning, tool use, or complex understanding
18        high_complexity_indicators = [
19            "explain why", "analyze", "compare and contrast",
20            "step by step", "reasoning", "pros and cons",
21            "evaluate", "critique"
22        ]
23        
24        # Medium complexity: straightforward questions or tasks
25        medium_complexity_indicators = [
26            "how to", "what is", "describe", "summarize"
27        ]
28        
29        query_lower = query.lower()
30        
31        if any(ind in query_lower for ind in high_complexity_indicators):
32            return "high"
33        elif any(ind in query_lower for ind in medium_complexity_indicators):
34            return "medium"
35        else:
36            return "low"
37    
38    def respond(self, query):
39        """Route to the most cost-effective model."""
40        complexity = self._classify_task_complexity(query)
41        
42        if complexity == "high":
43            # Use Claude Sonnet 4.5 for complex reasoning
44            # Cost: ~$0.012 per interaction
45            response = self.anthropic.messages.create(
46                model="claude-sonnet-4.5",
47                max_tokens=1024,
48                messages=[{"role": "user", "content": query}]
49            )
50            return response.content[0].text, "claude-sonnet-4.5", "high"
51        
52        elif complexity == "medium":
53            # Use GPT-5 for general tasks
54            # Cost: ~$0.008 per interaction (33% savings)
55            response = self.openai.chat.completions.create(
56                model="gpt-5",
57                max_tokens=512,
58                messages=[{"role": "user", "content": query}]
59            )
60            return response.choices[0].message.content, "gpt-5", "medium"
61        
62        else:
63            # Use Gemini 2.5 Flash for simple queries
64            # Cost: ~$0.001 per interaction (92% savings!)
65            response = self.gemini.generate_content(query)
66            return response.text, "gemini-2.5-flash", "low"
67
68## Test with different complexity levels
69agent = CostOptimizedAgent()
70
71queries = [
72    ("What's the capital of France?", "low"),
73    ("What is machine learning?", "medium"),
74    ("Analyze the pros and cons of different database architectures", "high")
75]
76
77for query, expected in queries:
78    result, model, complexity = agent.respond(query)
79    print(f"Query: {query}")
80    print(f"Complexity: {complexity} (expected: {expected})")
81    print(f"Model: {model}")
82    print(f"Response: {result[:100]}...")
83    print()

1from anthropic import Anthropic
2from openai import OpenAI
3from google import generativeai as genai
4
5class CostOptimizedAgent:
6    """Agent that chooses the most cost-effective model for each task."""
7    
8    def __init__(self):
9        # Initialize clients for different providers
10        self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
11        self.openai = OpenAI(api_key="YOUR_OPENAI_KEY")
12        genai.configure(api_key="YOUR_GOOGLE_KEY")
13        self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14    
15    def _classify_task_complexity(self, query):
16        """Determine what level of model capability is needed."""
17        # High complexity: needs reasoning, tool use, or complex understanding
18        high_complexity_indicators = [
19            "explain why", "analyze", "compare and contrast",
20            "step by step", "reasoning", "pros and cons",
21            "evaluate", "critique"
22        ]
23        
24        # Medium complexity: straightforward questions or tasks
25        medium_complexity_indicators = [
26            "how to", "what is", "describe", "summarize"
27        ]
28        
29        query_lower = query.lower()
30        
31        if any(ind in query_lower for ind in high_complexity_indicators):
32            return "high"
33        elif any(ind in query_lower for ind in medium_complexity_indicators):
34            return "medium"
35        else:
36            return "low"
37    
38    def respond(self, query):
39        """Route to the most cost-effective model."""
40        complexity = self._classify_task_complexity(query)
41        
42        if complexity == "high":
43            # Use Claude Sonnet 4.5 for complex reasoning
44            # Cost: ~$0.012 per interaction
45            response = self.anthropic.messages.create(
46                model="claude-sonnet-4.5",
47                max_tokens=1024,
48                messages=[{"role": "user", "content": query}]
49            )
50            return response.content[0].text, "claude-sonnet-4.5", "high"
51        
52        elif complexity == "medium":
53            # Use GPT-5 for general tasks
54            # Cost: ~$0.008 per interaction (33% savings)
55            response = self.openai.chat.completions.create(
56                model="gpt-5",
57                max_tokens=512,
58                messages=[{"role": "user", "content": query}]
59            )
60            return response.choices[0].message.content, "gpt-5", "medium"
61        
62        else:
63            # Use Gemini 2.5 Flash for simple queries
64            # Cost: ~$0.001 per interaction (92% savings!)
65            response = self.gemini.generate_content(query)
66            return response.text, "gemini-2.5-flash", "low"
67
68## Test with different complexity levels
69agent = CostOptimizedAgent()
70
71queries = [
72    ("What's the capital of France?", "low"),
73    ("What is machine learning?", "medium"),
74    ("Analyze the pros and cons of different database architectures", "high")
75]
76
77for query, expected in queries:
78    result, model, complexity = agent.respond(query)
79    print(f"Query: {query}")
80    print(f"Complexity: {complexity} (expected: {expected})")
81    print(f"Model: {model}")
82    print(f"Response: {result[:100]}...")
83    print()

By routing simple queries to Gemini 2.5 Flash, you can save 90% or more on those interactions. If 50% of your queries are simple, you've just cut your total costs by 45%.

Strategy 2: Reduce Output Length

Remember that output tokens cost more than input tokens. A response with 1000 tokens costs twice as much as one with 500 tokens. If your agent is verbose, you're wasting money.

Example: Concise Responses (Claude Sonnet 4.5)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def compare_response_costs(query):
6    """Compare costs of verbose vs concise responses."""
7    
8    # Verbose response (default behavior)
9    verbose_response = client.messages.create(
10        model="claude-sonnet-4.5",
11        max_tokens=1024,
12        messages=[{"role": "user", "content": query}]
13    )
14    
15    # Concise response (optimized)
16    concise_system = """You are a helpful assistant. Provide concise, direct answers.
17    Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
18    
19    concise_response = client.messages.create(
20        model="claude-sonnet-4.5",
21        max_tokens=200,  # Hard limit
22        system=concise_system,
23        messages=[{"role": "user", "content": query}]
24    )
25    
26    # Calculate costs
27    def calc_cost(usage):
28        input_cost = (usage.input_tokens / 1_000_000) * 3.00
29        output_cost = (usage.output_tokens / 1_000_000) * 15.00
30        return input_cost + output_cost
31    
32    verbose_cost = calc_cost(verbose_response.usage)
33    concise_cost = calc_cost(concise_response.usage)
34    savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
35    
36    return {
37        "verbose": {
38            "response": verbose_response.content[0].text,
39            "tokens": verbose_response.usage.output_tokens,
40            "cost": verbose_cost
41        },
42        "concise": {
43            "response": concise_response.content[0].text,
44            "tokens": concise_response.usage.output_tokens,
45            "cost": concise_cost
46        },
47        "savings_percent": savings
48    }
49
50## Test with a simple query
51result = compare_response_costs("What is Python?")
52
53print("Verbose response:")
54print(f"Tokens: {result['verbose']['tokens']}")
55print(f"Cost: ${result['verbose']['cost']:.6f}")
56print(f"Response: {result['verbose']['response'][:150]}...")
57print()
58
59print("Concise response:")
60print(f"Tokens: {result['concise']['tokens']}")
61print(f"Cost: ${result['concise']['cost']:.6f}")
62print(f"Response: {result['concise']['response']}")
63print()
64
65print(f"Cost savings: {result['savings_percent']:.1f}%")

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def compare_response_costs(query):
6    """Compare costs of verbose vs concise responses."""
7    
8    # Verbose response (default behavior)
9    verbose_response = client.messages.create(
10        model="claude-sonnet-4.5",
11        max_tokens=1024,
12        messages=[{"role": "user", "content": query}]
13    )
14    
15    # Concise response (optimized)
16    concise_system = """You are a helpful assistant. Provide concise, direct answers.
17    Use 1-2 sentences for simple questions. Avoid unnecessary elaboration."""
18    
19    concise_response = client.messages.create(
20        model="claude-sonnet-4.5",
21        max_tokens=200,  # Hard limit
22        system=concise_system,
23        messages=[{"role": "user", "content": query}]
24    )
25    
26    # Calculate costs
27    def calc_cost(usage):
28        input_cost = (usage.input_tokens / 1_000_000) * 3.00
29        output_cost = (usage.output_tokens / 1_000_000) * 15.00
30        return input_cost + output_cost
31    
32    verbose_cost = calc_cost(verbose_response.usage)
33    concise_cost = calc_cost(concise_response.usage)
34    savings = ((verbose_cost - concise_cost) / verbose_cost) * 100
35    
36    return {
37        "verbose": {
38            "response": verbose_response.content[0].text,
39            "tokens": verbose_response.usage.output_tokens,
40            "cost": verbose_cost
41        },
42        "concise": {
43            "response": concise_response.content[0].text,
44            "tokens": concise_response.usage.output_tokens,
45            "cost": concise_cost
46        },
47        "savings_percent": savings
48    }
49
50## Test with a simple query
51result = compare_response_costs("What is Python?")
52
53print("Verbose response:")
54print(f"Tokens: {result['verbose']['tokens']}")
55print(f"Cost: ${result['verbose']['cost']:.6f}")
56print(f"Response: {result['verbose']['response'][:150]}...")
57print()
58
59print("Concise response:")
60print(f"Tokens: {result['concise']['tokens']}")
61print(f"Cost: ${result['concise']['cost']:.6f}")
62print(f"Response: {result['concise']['response']}")
63print()
64
65print(f"Cost savings: {result['savings_percent']:.1f}%")

The concise version might save 60-70% on output tokens for simple queries. Across thousands of interactions, that's substantial savings.

Strategy 3: Cache Aggressively

If users ask the same questions repeatedly, why pay to generate the answer every time? Cache responses and serve them instantly for free.

Example: Multi-Level Caching (Gemini 2.5 Flash)

1from google import generativeai as genai
2import hashlib
3import time
4from datetime import datetime, timedelta
5
6genai.configure(api_key="YOUR_GOOGLE_API_KEY")
7
8class CachingAgent:
9    """Agent with intelligent caching to minimize API calls."""
10    
11    def __init__(self):
12        self.model = genai.GenerativeModel('gemini-2.5-flash')
13        
14        # Short-term cache: exact query matches
15        self.exact_cache = {}
16        
17        # Long-term cache: common queries that rarely change
18        self.persistent_cache = {
19            "what is python": "Python is a high-level programming language...",
20            "what is machine learning": "Machine learning is a subset of AI...",
21            # Preload common queries
22        }
23        
24        # Cache metadata
25        self.cache_stats = {
26            "hits": 0,
27            "misses": 0,
28            "api_calls": 0,
29            "cost_saved": 0.0
30        }
31    
32    def _hash_query(self, query):
33        """Create cache key from query."""
34        return hashlib.md5(query.lower().strip().encode()).hexdigest()
35    
36    def _estimate_cost_saved(self, query):
37        """Estimate cost saved by cache hit."""
38        # Rough estimate: 100 input tokens + 200 output tokens
39        input_cost = (100 / 1_000_000) * 0.40
40        output_cost = (200 / 1_000_000) * 1.20
41        return input_cost + output_cost
42    
43    def respond(self, query):
44        """Get response, using cache when possible."""
45        cache_key = self._hash_query(query)
46        query_normalized = query.lower().strip()
47        
48        # Check persistent cache first (common queries)
49        if query_normalized in self.persistent_cache:
50            self.cache_stats["hits"] += 1
51            self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
52            return self.persistent_cache[query_normalized], "persistent_cache"
53        
54        # Check exact cache (recent queries)
55        if cache_key in self.exact_cache:
56            self.cache_stats["hits"] += 1
57            self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
58            return self.exact_cache[cache_key], "exact_cache"
59        
60        # Cache miss: call the model
61        self.cache_stats["misses"] += 1
62        self.cache_stats["api_calls"] += 1
63        
64        response = self.model.generate_content(query)
65        result = response.text
66        
67        # Store in exact cache
68        self.exact_cache[cache_key] = result
69        
70        # If cache is getting large, prune old entries
71        if len(self.exact_cache) > 1000:
72            # Keep only the most recent 500
73            keys_to_remove = list(self.exact_cache.keys())[:-500]
74            for key in keys_to_remove:
75                del self.exact_cache[key]
76        
77        return result, "api_call"
78    
79    def get_cache_stats(self):
80        """Get caching performance metrics."""
81        total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
82        hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
83        
84        return {
85            "total_requests": total_requests,
86            "cache_hits": self.cache_stats["hits"],
87            "cache_misses": self.cache_stats["misses"],
88            "hit_rate": hit_rate,
89            "api_calls": self.cache_stats["api_calls"],
90            "estimated_cost_saved": self.cache_stats["cost_saved"]
91        }
92
93## Test the caching agent
94agent = CachingAgent()
95
96## Simulate user queries (with some repetition)
97queries = [
98    "What is Python?",
99    "What is machine learning?",
100    "How do I learn programming?",
101    "What is Python?",  # Duplicate
102    "What is machine learning?",  # Duplicate
103    "How do I learn programming?",  # Duplicate
104    "What are data structures?",
105    "What is Python?",  # Duplicate again
106]
107
108for query in queries:
109    result, source = agent.respond(query)
110    print(f"Q: {query}")
111    print(f"Source: {source}")
112    print()
113
114## Show cache performance
115stats = agent.get_cache_stats()
116print("Cache Performance:")
117print(f"Total requests: {stats['total_requests']}")
118print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
119print(f"API calls: {stats['api_calls']}")
120print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")

1from google import generativeai as genai
2import hashlib
3import time
4from datetime import datetime, timedelta
5
6genai.configure(api_key="YOUR_GOOGLE_API_KEY")
7
8class CachingAgent:
9    """Agent with intelligent caching to minimize API calls."""
10    
11    def __init__(self):
12        self.model = genai.GenerativeModel('gemini-2.5-flash')
13        
14        # Short-term cache: exact query matches
15        self.exact_cache = {}
16        
17        # Long-term cache: common queries that rarely change
18        self.persistent_cache = {
19            "what is python": "Python is a high-level programming language...",
20            "what is machine learning": "Machine learning is a subset of AI...",
21            # Preload common queries
22        }
23        
24        # Cache metadata
25        self.cache_stats = {
26            "hits": 0,
27            "misses": 0,
28            "api_calls": 0,
29            "cost_saved": 0.0
30        }
31    
32    def _hash_query(self, query):
33        """Create cache key from query."""
34        return hashlib.md5(query.lower().strip().encode()).hexdigest()
35    
36    def _estimate_cost_saved(self, query):
37        """Estimate cost saved by cache hit."""
38        # Rough estimate: 100 input tokens + 200 output tokens
39        input_cost = (100 / 1_000_000) * 0.40
40        output_cost = (200 / 1_000_000) * 1.20
41        return input_cost + output_cost
42    
43    def respond(self, query):
44        """Get response, using cache when possible."""
45        cache_key = self._hash_query(query)
46        query_normalized = query.lower().strip()
47        
48        # Check persistent cache first (common queries)
49        if query_normalized in self.persistent_cache:
50            self.cache_stats["hits"] += 1
51            self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
52            return self.persistent_cache[query_normalized], "persistent_cache"
53        
54        # Check exact cache (recent queries)
55        if cache_key in self.exact_cache:
56            self.cache_stats["hits"] += 1
57            self.cache_stats["cost_saved"] += self._estimate_cost_saved(query)
58            return self.exact_cache[cache_key], "exact_cache"
59        
60        # Cache miss: call the model
61        self.cache_stats["misses"] += 1
62        self.cache_stats["api_calls"] += 1
63        
64        response = self.model.generate_content(query)
65        result = response.text
66        
67        # Store in exact cache
68        self.exact_cache[cache_key] = result
69        
70        # If cache is getting large, prune old entries
71        if len(self.exact_cache) > 1000:
72            # Keep only the most recent 500
73            keys_to_remove = list(self.exact_cache.keys())[:-500]
74            for key in keys_to_remove:
75                del self.exact_cache[key]
76        
77        return result, "api_call"
78    
79    def get_cache_stats(self):
80        """Get caching performance metrics."""
81        total_requests = self.cache_stats["hits"] + self.cache_stats["misses"]
82        hit_rate = (self.cache_stats["hits"] / total_requests * 100) if total_requests > 0 else 0
83        
84        return {
85            "total_requests": total_requests,
86            "cache_hits": self.cache_stats["hits"],
87            "cache_misses": self.cache_stats["misses"],
88            "hit_rate": hit_rate,
89            "api_calls": self.cache_stats["api_calls"],
90            "estimated_cost_saved": self.cache_stats["cost_saved"]
91        }
92
93## Test the caching agent
94agent = CachingAgent()
95
96## Simulate user queries (with some repetition)
97queries = [
98    "What is Python?",
99    "What is machine learning?",
100    "How do I learn programming?",
101    "What is Python?",  # Duplicate
102    "What is machine learning?",  # Duplicate
103    "How do I learn programming?",  # Duplicate
104    "What are data structures?",
105    "What is Python?",  # Duplicate again
106]
107
108for query in queries:
109    result, source = agent.respond(query)
110    print(f"Q: {query}")
111    print(f"Source: {source}")
112    print()
113
114## Show cache performance
115stats = agent.get_cache_stats()
116print("Cache Performance:")
117print(f"Total requests: {stats['total_requests']}")
118print(f"Cache hits: {stats['cache_hits']} ({stats['hit_rate']:.1f}%)")
119print(f"API calls: {stats['api_calls']}")
120print(f"Estimated cost saved: ${stats['estimated_cost_saved']:.4f}")

With a 50% cache hit rate, you've cut your API costs in half. For high-traffic applications, caching is one of the most effective cost reduction strategies.

Strategy 4: Batch Similar Requests

If you need to process multiple similar queries, batch them into a single API call. This reduces overhead and can be more cost-effective.

Example: Batch Processing (GPT-5)

1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_API_KEY")
4
5def process_individually(queries):
6    """Process each query separately (expensive)."""
7    results = []
8    total_cost = 0.0
9    
10    for query in queries:
11        response = client.chat.completions.create(
12            model="gpt-5",
13            max_tokens=100,
14            messages=[{"role": "user", "content": query}]
15        )
16        
17        # Estimate cost (rough approximation)
18        tokens = response.usage.total_tokens
19        cost = (tokens / 1_000_000) * 6.25  # Average of input/output rates
20        total_cost += cost
21        
22        results.append(response.choices[0].message.content)
23    
24    return results, total_cost
25
26def process_batched(queries):
27    """Process all queries in a single call (cheaper)."""
28    # Combine queries into a single prompt
29    batch_prompt = "Answer each of the following questions concisely:\n\n"
30    for i, query in enumerate(queries, 1):
31        batch_prompt += f"{i}. {query}\n"
32    
33    response = client.chat.completions.create(
34        model="gpt-5",
35        max_tokens=500,
36        messages=[{"role": "user", "content": batch_prompt}]
37    )
38    
39    # Estimate cost
40    tokens = response.usage.total_tokens
41    cost = (tokens / 1_000_000) * 6.25
42    
43    # Parse the batched response
44    result = response.choices[0].message.content
45    
46    return result, cost
47
48## Test both approaches
49queries = [
50    "What is Python?",
51    "What is JavaScript?",
52    "What is Ruby?",
53    "What is Go?",
54    "What is Rust?"
55]
56
57print("Individual processing:")
58results_individual, cost_individual = process_individually(queries)
59print(f"Cost: ${cost_individual:.4f}")
60print()
61
62print("Batched processing:")
63result_batched, cost_batched = process_batched(queries)
64print(f"Cost: ${cost_batched:.4f}")
65print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
66print()
67print("Batched response:")
68print(result_batched)

1from openai import OpenAI
2
3client = OpenAI(api_key="YOUR_API_KEY")
4
5def process_individually(queries):
6    """Process each query separately (expensive)."""
7    results = []
8    total_cost = 0.0
9    
10    for query in queries:
11        response = client.chat.completions.create(
12            model="gpt-5",
13            max_tokens=100,
14            messages=[{"role": "user", "content": query}]
15        )
16        
17        # Estimate cost (rough approximation)
18        tokens = response.usage.total_tokens
19        cost = (tokens / 1_000_000) * 6.25  # Average of input/output rates
20        total_cost += cost
21        
22        results.append(response.choices[0].message.content)
23    
24    return results, total_cost
25
26def process_batched(queries):
27    """Process all queries in a single call (cheaper)."""
28    # Combine queries into a single prompt
29    batch_prompt = "Answer each of the following questions concisely:\n\n"
30    for i, query in enumerate(queries, 1):
31        batch_prompt += f"{i}. {query}\n"
32    
33    response = client.chat.completions.create(
34        model="gpt-5",
35        max_tokens=500,
36        messages=[{"role": "user", "content": batch_prompt}]
37    )
38    
39    # Estimate cost
40    tokens = response.usage.total_tokens
41    cost = (tokens / 1_000_000) * 6.25
42    
43    # Parse the batched response
44    result = response.choices[0].message.content
45    
46    return result, cost
47
48## Test both approaches
49queries = [
50    "What is Python?",
51    "What is JavaScript?",
52    "What is Ruby?",
53    "What is Go?",
54    "What is Rust?"
55]
56
57print("Individual processing:")
58results_individual, cost_individual = process_individually(queries)
59print(f"Cost: ${cost_individual:.4f}")
60print()
61
62print("Batched processing:")
63result_batched, cost_batched = process_batched(queries)
64print(f"Cost: ${cost_batched:.4f}")
65print(f"Savings: ${cost_individual - cost_batched:.4f} ({((cost_individual - cost_batched) / cost_individual * 100):.1f}%)")
66print()
67print("Batched response:")
68print(result_batched)

Batching can save 30-50% on costs for similar queries because you eliminate the overhead of multiple API calls and can share context more efficiently.

Strategy 5: Trim Conversation History

Long conversation histories increase input token costs. If your agent includes the last 20 messages in every request, you're paying to process all that context repeatedly.

Example: Smart History Trimming (Claude Sonnet 4.5)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5class HistoryOptimizedAgent:
6    """Agent that manages conversation history efficiently."""
7    
8    def __init__(self):
9        self.conversation_history = []
10        self.max_history_messages = 6  # Keep last 3 exchanges
11    
12    def _trim_history(self):
13        """Keep only recent messages to reduce token costs."""
14        if len(self.conversation_history) > self.max_history_messages:
15            # Keep only the most recent messages
16            self.conversation_history = self.conversation_history[-self.max_history_messages:]
17    
18    def _estimate_tokens(self, text):
19        """Rough token estimate (4 chars ≈ 1 token)."""
20        return len(text) // 4
21    
22    def respond(self, user_message):
23        """Generate response with optimized history."""
24        # Add user message to history
25        self.conversation_history.append({
26            "role": "user",
27            "content": user_message
28        })
29        
30        # Trim history before sending
31        self._trim_history()
32        
33        # Calculate token usage
34        total_input_chars = sum(
35            len(msg["content"]) for msg in self.conversation_history
36        )
37        estimated_input_tokens = total_input_chars // 4
38        
39        # Make the call
40        response = client.messages.create(
41            model="claude-sonnet-4.5",
42            max_tokens=512,
43            messages=self.conversation_history
44        )
45        
46        # Add assistant response to history
47        assistant_message = response.content[0].text
48        self.conversation_history.append({
49            "role": "assistant",
50            "content": assistant_message
51        })
52        
53        # Calculate cost
54        input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
55        output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
56        total_cost = input_cost + output_cost
57        
58        return {
59            "response": assistant_message,
60            "input_tokens": response.usage.input_tokens,
61            "output_tokens": response.usage.output_tokens,
62            "cost": total_cost,
63            "history_length": len(self.conversation_history)
64        }
65
66## Test with a multi-turn conversation
67agent = HistoryOptimizedAgent()
68
69queries = [
70    "What is Python?",
71    "What are its main features?",
72    "How does it compare to Java?",
73    "What about performance?",
74    "Should I learn it?",
75    "What resources do you recommend?",
76    "How long will it take?",
77    "What projects should I build?"
78]
79
80total_cost = 0.0
81
82for query in queries:
83    result = agent.respond(query)
84    total_cost += result["cost"]
85    
86    print(f"Q: {query}")
87    print(f"Input tokens: {result['input_tokens']}")
88    print(f"History length: {result['history_length']} messages")
89    print(f"Cost: ${result['cost']:.6f}")
90    print()
91
92print(f"Total conversation cost: ${total_cost:.4f}")
93print("\nNote: Without trimming, costs would be ~40% higher")

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5class HistoryOptimizedAgent:
6    """Agent that manages conversation history efficiently."""
7    
8    def __init__(self):
9        self.conversation_history = []
10        self.max_history_messages = 6  # Keep last 3 exchanges
11    
12    def _trim_history(self):
13        """Keep only recent messages to reduce token costs."""
14        if len(self.conversation_history) > self.max_history_messages:
15            # Keep only the most recent messages
16            self.conversation_history = self.conversation_history[-self.max_history_messages:]
17    
18    def _estimate_tokens(self, text):
19        """Rough token estimate (4 chars ≈ 1 token)."""
20        return len(text) // 4
21    
22    def respond(self, user_message):
23        """Generate response with optimized history."""
24        # Add user message to history
25        self.conversation_history.append({
26            "role": "user",
27            "content": user_message
28        })
29        
30        # Trim history before sending
31        self._trim_history()
32        
33        # Calculate token usage
34        total_input_chars = sum(
35            len(msg["content"]) for msg in self.conversation_history
36        )
37        estimated_input_tokens = total_input_chars // 4
38        
39        # Make the call
40        response = client.messages.create(
41            model="claude-sonnet-4.5",
42            max_tokens=512,
43            messages=self.conversation_history
44        )
45        
46        # Add assistant response to history
47        assistant_message = response.content[0].text
48        self.conversation_history.append({
49            "role": "assistant",
50            "content": assistant_message
51        })
52        
53        # Calculate cost
54        input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
55        output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
56        total_cost = input_cost + output_cost
57        
58        return {
59            "response": assistant_message,
60            "input_tokens": response.usage.input_tokens,
61            "output_tokens": response.usage.output_tokens,
62            "cost": total_cost,
63            "history_length": len(self.conversation_history)
64        }
65
66## Test with a multi-turn conversation
67agent = HistoryOptimizedAgent()
68
69queries = [
70    "What is Python?",
71    "What are its main features?",
72    "How does it compare to Java?",
73    "What about performance?",
74    "Should I learn it?",
75    "What resources do you recommend?",
76    "How long will it take?",
77    "What projects should I build?"
78]
79
80total_cost = 0.0
81
82for query in queries:
83    result = agent.respond(query)
84    total_cost += result["cost"]
85    
86    print(f"Q: {query}")
87    print(f"Input tokens: {result['input_tokens']}")
88    print(f"History length: {result['history_length']} messages")
89    print(f"Cost: ${result['cost']:.6f}")
90    print()
91
92print(f"Total conversation cost: ${total_cost:.4f}")
93print("\nNote: Without trimming, costs would be ~40% higher")

By keeping only the last 6 messages (3 exchanges), you prevent the input token count from growing unbounded. This is especially important for long conversations.

Strategy 6: Use Prompt Compression

For agents that need to include large amounts of context (like retrieved documents or long system prompts), consider compressing that information.

Example: Context Summarization (Claude Sonnet 4.5)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def summarize_context(long_context, max_length=500):
6    """Compress long context into a shorter summary."""
7    if len(long_context) <= max_length:
8        return long_context
9    
10    # Use the model to create a concise summary
11    response = client.messages.create(
12        model="claude-sonnet-4.5",
13        max_tokens=200,
14        messages=[{
15            "role": "user",
16            "content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
17        }]
18    )
19    
20    return response.content[0].text
21
22def respond_with_context(query, long_context):
23    """Answer query using compressed context."""
24    # Compress the context first
25    compressed_context = summarize_context(long_context, max_length=500)
26    
27    # Use compressed context in the actual query
28    full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
29    
30    response = client.messages.create(
31        model="claude-sonnet-4.5",
32        max_tokens=512,
33        messages=[{"role": "user", "content": full_prompt}]
34    )
35    
36    return response.content[0].text
37
38## Example: Long document that needs to be included
39long_document = """
40[Imagine a 5000-word document about machine learning here...]
41Machine learning is a field of artificial intelligence that focuses on...
42[... many more paragraphs ...]
43"""
44
45## Without compression: ~5000 tokens input
46## With compression: ~500 tokens input
47## Savings: ~90% on input tokens for this context
48
49query = "What are the key concepts in this document?"
50answer = respond_with_context(query, long_document)
51print(answer)

1from anthropic import Anthropic
2
3client = Anthropic(api_key="YOUR_API_KEY")
4
5def summarize_context(long_context, max_length=500):
6    """Compress long context into a shorter summary."""
7    if len(long_context) <= max_length:
8        return long_context
9    
10    # Use the model to create a concise summary
11    response = client.messages.create(
12        model="claude-sonnet-4.5",
13        max_tokens=200,
14        messages=[{
15            "role": "user",
16            "content": f"Summarize this in {max_length//4} words or less:\n\n{long_context}"
17        }]
18    )
19    
20    return response.content[0].text
21
22def respond_with_context(query, long_context):
23    """Answer query using compressed context."""
24    # Compress the context first
25    compressed_context = summarize_context(long_context, max_length=500)
26    
27    # Use compressed context in the actual query
28    full_prompt = f"Context: {compressed_context}\n\nQuestion: {query}"
29    
30    response = client.messages.create(
31        model="claude-sonnet-4.5",
32        max_tokens=512,
33        messages=[{"role": "user", "content": full_prompt}]
34    )
35    
36    return response.content[0].text
37
38## Example: Long document that needs to be included
39long_document = """
40[Imagine a 5000-word document about machine learning here...]
41Machine learning is a field of artificial intelligence that focuses on...
42[... many more paragraphs ...]
43"""
44
45## Without compression: ~5000 tokens input
46## With compression: ~500 tokens input
47## Savings: ~90% on input tokens for this context
48
49query = "What are the key concepts in this document?"
50answer = respond_with_context(query, long_document)
51print(answer)

You pay for the summarization call, but if you use that compressed context multiple times, you save money overall. This is especially valuable for retrieval-augmented generation (RAG) systems where you're including retrieved documents in every query.

Strategy 7: Set Budget Limits

Prevent runaway costs by implementing budget controls in your agent.

Example: Budget-Aware Agent (Multi-Provider)

1from anthropic import Anthropic
2from datetime import datetime, timedelta
3
4class BudgetControlledAgent:
5    """Agent with built-in budget limits and alerts."""
6    
7    def __init__(self, daily_budget=10.0):
8        self.client = Anthropic(api_key="YOUR_API_KEY")
9        self.daily_budget = daily_budget
10        self.current_day = datetime.now().date()
11        self.daily_spending = 0.0
12        self.total_spending = 0.0
13    
14    def _reset_daily_budget_if_needed(self):
15        """Reset daily spending counter at midnight."""
16        today = datetime.now().date()
17        if today != self.current_day:
18            self.current_day = today
19            self.daily_spending = 0.0
20    
21    def _calculate_cost(self, usage):
22        """Calculate cost from token usage."""
23        input_cost = (usage.input_tokens / 1_000_000) * 3.00
24        output_cost = (usage.output_tokens / 1_000_000) * 15.00
25        return input_cost + output_cost
26    
27    def respond(self, query):
28        """Generate response if within budget."""
29        self._reset_daily_budget_if_needed()
30        
31        # Check if we're over budget
32        if self.daily_spending >= self.daily_budget:
33            return {
34                "response": None,
35                "error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
36                "budget_remaining": 0.0
37            }
38        
39        # Make the call
40        response = self.client.messages.create(
41            model="claude-sonnet-4.5",
42            max_tokens=512,
43            messages=[{"role": "user", "content": query}]
44        )
45        
46        # Track spending
47        cost = self._calculate_cost(response.usage)
48        self.daily_spending += cost
49        self.total_spending += cost
50        
51        # Check if approaching budget limit
52        budget_remaining = self.daily_budget - self.daily_spending
53        warning = None
54        
55        if budget_remaining < self.daily_budget * 0.2:  # Less than 20% remaining
56            warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
57        
58        return {
59            "response": response.content[0].text,
60            "cost": cost,
61            "daily_spending": self.daily_spending,
62            "budget_remaining": budget_remaining,
63            "warning": warning
64        }
65    
66    def get_spending_summary(self):
67        """Get spending statistics."""
68        return {
69            "daily_spending": self.daily_spending,
70            "daily_budget": self.daily_budget,
71            "budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
72            "total_spending": self.total_spending
73        }
74
75## Test budget controls
76agent = BudgetControlledAgent(daily_budget=0.10)  # $0.10 daily limit
77
78queries = [
79    "What is Python?",
80    "Explain machine learning.",
81    "How do neural networks work?",
82    "What is deep learning?",
83    "Describe reinforcement learning.",
84    "What are transformers?",
85    "Explain attention mechanisms.",
86    "What is GPT?",
87    "How does BERT work?",
88    "What is transfer learning?"
89]
90
91for i, query in enumerate(queries, 1):
92    print(f"\n--- Query {i} ---")
93    result = agent.respond(query)
94    
95    if result["response"]:
96        print(f"Response: {result['response'][:100]}...")
97        print(f"Cost: ${result['cost']:.6f}")
98        print(f"Daily spending: ${result['daily_spending']:.4f}")
99        
100        if result["warning"]:
101            print(f"⚠️  {result['warning']}")
102    else:
103        print(f"❌ {result['error']}")
104        break
105
106summary = agent.get_spending_summary()
107print(f"\n=== Spending Summary ===")
108print(f"Daily budget: ${summary['daily_budget']:.2f}")
109print(f"Daily spending: ${summary['daily_spending']:.4f}")
110print(f"Budget used: {summary['budget_used_percent']:.1f}%")

1from anthropic import Anthropic
2from datetime import datetime, timedelta
3
4class BudgetControlledAgent:
5    """Agent with built-in budget limits and alerts."""
6    
7    def __init__(self, daily_budget=10.0):
8        self.client = Anthropic(api_key="YOUR_API_KEY")
9        self.daily_budget = daily_budget
10        self.current_day = datetime.now().date()
11        self.daily_spending = 0.0
12        self.total_spending = 0.0
13    
14    def _reset_daily_budget_if_needed(self):
15        """Reset daily spending counter at midnight."""
16        today = datetime.now().date()
17        if today != self.current_day:
18            self.current_day = today
19            self.daily_spending = 0.0
20    
21    def _calculate_cost(self, usage):
22        """Calculate cost from token usage."""
23        input_cost = (usage.input_tokens / 1_000_000) * 3.00
24        output_cost = (usage.output_tokens / 1_000_000) * 15.00
25        return input_cost + output_cost
26    
27    def respond(self, query):
28        """Generate response if within budget."""
29        self._reset_daily_budget_if_needed()
30        
31        # Check if we're over budget
32        if self.daily_spending >= self.daily_budget:
33            return {
34                "response": None,
35                "error": f"Daily budget of ${self.daily_budget:.2f} exceeded. Current spending: ${self.daily_spending:.2f}",
36                "budget_remaining": 0.0
37            }
38        
39        # Make the call
40        response = self.client.messages.create(
41            model="claude-sonnet-4.5",
42            max_tokens=512,
43            messages=[{"role": "user", "content": query}]
44        )
45        
46        # Track spending
47        cost = self._calculate_cost(response.usage)
48        self.daily_spending += cost
49        self.total_spending += cost
50        
51        # Check if approaching budget limit
52        budget_remaining = self.daily_budget - self.daily_spending
53        warning = None
54        
55        if budget_remaining < self.daily_budget * 0.2:  # Less than 20% remaining
56            warning = f"Warning: Only ${budget_remaining:.2f} remaining in daily budget"
57        
58        return {
59            "response": response.content[0].text,
60            "cost": cost,
61            "daily_spending": self.daily_spending,
62            "budget_remaining": budget_remaining,
63            "warning": warning
64        }
65    
66    def get_spending_summary(self):
67        """Get spending statistics."""
68        return {
69            "daily_spending": self.daily_spending,
70            "daily_budget": self.daily_budget,
71            "budget_used_percent": (self.daily_spending / self.daily_budget) * 100,
72            "total_spending": self.total_spending
73        }
74
75## Test budget controls
76agent = BudgetControlledAgent(daily_budget=0.10)  # $0.10 daily limit
77
78queries = [
79    "What is Python?",
80    "Explain machine learning.",
81    "How do neural networks work?",
82    "What is deep learning?",
83    "Describe reinforcement learning.",
84    "What are transformers?",
85    "Explain attention mechanisms.",
86    "What is GPT?",
87    "How does BERT work?",
88    "What is transfer learning?"
89]
90
91for i, query in enumerate(queries, 1):
92    print(f"\n--- Query {i} ---")
93    result = agent.respond(query)
94    
95    if result["response"]:
96        print(f"Response: {result['response'][:100]}...")
97        print(f"Cost: ${result['cost']:.6f}")
98        print(f"Daily spending: ${result['daily_spending']:.4f}")
99        
100        if result["warning"]:
101            print(f"⚠️  {result['warning']}")
102    else:
103        print(f"❌ {result['error']}")
104        break
105
106summary = agent.get_spending_summary()
107print(f"\n=== Spending Summary ===")
108print(f"Daily budget: ${summary['daily_budget']:.2f}")
109print(f"Daily spending: ${summary['daily_spending']:.4f}")
110print(f"Budget used: {summary['budget_used_percent']:.1f}%")

Budget controls prevent unexpected bills and force you to think about cost optimization. If you hit your budget limit regularly, it's a signal that you need to optimize your agent's efficiency.

Measuring Cost Optimization Impact

As you apply these strategies, track the results. Here's a comprehensive cost analysis tool:

1from anthropic import Anthropic
2from datetime import datetime
3import statistics
4
5class CostAnalyzer:
6    """Analyze and compare costs across different optimization strategies."""
7    
8    def __init__(self):
9        self.client = Anthropic(api_key="YOUR_API_KEY")
10        self.baseline_costs = []
11        self.optimized_costs = []
12    
13    def _calculate_cost(self, usage):
14        """Calculate cost from token usage."""
15        input_cost = (usage.input_tokens / 1_000_000) * 3.00
16        output_cost = (usage.output_tokens / 1_000_000) * 15.00
17        return input_cost + output_cost
18    
19    def run_baseline(self, queries):
20        """Run queries without optimization."""
21        print("Running baseline (no optimization)...")
22        
23        for query in queries:
24            response = self.client.messages.create(
25                model="claude-sonnet-4.5",
26                max_tokens=1024,  # No limits
27                messages=[{"role": "user", "content": query}]
28            )
29            
30            cost = self._calculate_cost(response.usage)
31            self.baseline_costs.append(cost)
32    
33    def run_optimized(self, queries):
34        """Run queries with optimization."""
35        print("Running optimized version...")
36        
37        system_prompt = """You are a helpful assistant. Provide concise, direct answers.
38        Use 1-2 sentences for simple questions."""
39        
40        for query in queries:
41            response = self.client.messages.create(
42                model="claude-sonnet-4.5",
43                max_tokens=200,  # Limited
44                system=system_prompt,
45                messages=[{"role": "user", "content": query}]
46            )
47            
48            cost = self._calculate_cost(response.usage)
49            self.optimized_costs.append(cost)
50    
51    def generate_report(self):
52        """Generate cost comparison report."""
53        baseline_total = sum(self.baseline_costs)
54        optimized_total = sum(self.optimized_costs)
55        savings = baseline_total - optimized_total
56        savings_percent = (savings / baseline_total) * 100
57        
58        baseline_avg = statistics.mean(self.baseline_costs)
59        optimized_avg = statistics.mean(self.optimized_costs)
60        
61        report = f"""
62=== Cost Optimization Report ===
63
64Baseline (No Optimization):
65  Total cost: ${baseline_total:.4f}
66  Average per query: ${baseline_avg:.6f}
67  Number of queries: {len(self.baseline_costs)}
68
69Optimized:
70  Total cost: ${optimized_total:.4f}
71  Average per query: ${optimized_avg:.6f}
72  Number of queries: {len(self.optimized_costs)}
73
74Savings:
75  Total saved: ${savings:.4f}
76  Percentage saved: {savings_percent:.1f}%
77  
78Projected Monthly Savings (at 10,000 queries/month):
79  ${savings * (10000 / len(self.baseline_costs)):.2f}
80"""
81        return report
82
83## Run the analysis
84analyzer = CostAnalyzer()
85
86test_queries = [
87    "What is Python?",
88    "What is JavaScript?",
89    "What is machine learning?",
90    "What are neural networks?",
91    "What is deep learning?",
92    "What is natural language processing?",
93    "What are transformers?",
94    "What is computer vision?",
95    "What is reinforcement learning?",
96    "What is data science?"
97]
98
99analyzer.run_baseline(test_queries)
100analyzer.run_optimized(test_queries)
101
102print(analyzer.generate_report())

1from anthropic import Anthropic
2from datetime import datetime
3import statistics
4
5class CostAnalyzer:
6    """Analyze and compare costs across different optimization strategies."""
7    
8    def __init__(self):
9        self.client = Anthropic(api_key="YOUR_API_KEY")
10        self.baseline_costs = []
11        self.optimized_costs = []
12    
13    def _calculate_cost(self, usage):
14        """Calculate cost from token usage."""
15        input_cost = (usage.input_tokens / 1_000_000) * 3.00
16        output_cost = (usage.output_tokens / 1_000_000) * 15.00
17        return input_cost + output_cost
18    
19    def run_baseline(self, queries):
20        """Run queries without optimization."""
21        print("Running baseline (no optimization)...")
22        
23        for query in queries:
24            response = self.client.messages.create(
25                model="claude-sonnet-4.5",
26                max_tokens=1024,  # No limits
27                messages=[{"role": "user", "content": query}]
28            )
29            
30            cost = self._calculate_cost(response.usage)
31            self.baseline_costs.append(cost)
32    
33    def run_optimized(self, queries):
34        """Run queries with optimization."""
35        print("Running optimized version...")
36        
37        system_prompt = """You are a helpful assistant. Provide concise, direct answers.
38        Use 1-2 sentences for simple questions."""
39        
40        for query in queries:
41            response = self.client.messages.create(
42                model="claude-sonnet-4.5",
43                max_tokens=200,  # Limited
44                system=system_prompt,
45                messages=[{"role": "user", "content": query}]
46            )
47            
48            cost = self._calculate_cost(response.usage)
49            self.optimized_costs.append(cost)
50    
51    def generate_report(self):
52        """Generate cost comparison report."""
53        baseline_total = sum(self.baseline_costs)
54        optimized_total = sum(self.optimized_costs)
55        savings = baseline_total - optimized_total
56        savings_percent = (savings / baseline_total) * 100
57        
58        baseline_avg = statistics.mean(self.baseline_costs)
59        optimized_avg = statistics.mean(self.optimized_costs)
60        
61        report = f"""
62=== Cost Optimization Report ===
63
64Baseline (No Optimization):
65  Total cost: ${baseline_total:.4f}
66  Average per query: ${baseline_avg:.6f}
67  Number of queries: {len(self.baseline_costs)}
68
69Optimized:
70  Total cost: ${optimized_total:.4f}
71  Average per query: ${optimized_avg:.6f}
72  Number of queries: {len(self.optimized_costs)}
73
74Savings:
75  Total saved: ${savings:.4f}
76  Percentage saved: {savings_percent:.1f}%
77  
78Projected Monthly Savings (at 10,000 queries/month):
79  ${savings * (10000 / len(self.baseline_costs)):.2f}
80"""
81        return report
82
83## Run the analysis
84analyzer = CostAnalyzer()
85
86test_queries = [
87    "What is Python?",
88    "What is JavaScript?",
89    "What is machine learning?",
90    "What are neural networks?",
91    "What is deep learning?",
92    "What is natural language processing?",
93    "What are transformers?",
94    "What is computer vision?",
95    "What is reinforcement learning?",
96    "What is data science?"
97]
98
99analyzer.run_baseline(test_queries)
100analyzer.run_optimized(test_queries)
101
102print(analyzer.generate_report())

This gives you concrete numbers showing the impact of your optimizations. You might find that simple changes save 40-60% on costs.

Balancing Cost and Quality

Here's the key insight: cost optimization is about trade-offs. You can always make your agent cheaper by using worse models or shorter responses, but that might hurt quality.

The goal isn't to minimize cost at all costs. It's to maximize value: the best quality you can get for the money you're willing to spend.

Some guidelines:

Use the best model for critical tasks. If accuracy matters more than cost (medical advice, financial decisions, legal questions), don't skimp on model quality.
Optimize aggressively for high-volume, low-stakes queries. If you're answering "What's the weather?" thousands of times per day, use the cheapest model that works.
Monitor quality metrics alongside cost metrics. Track both how much you're spending and how well your agent performs. If cost optimizations hurt user satisfaction, they're not worth it.
Test before deploying. When you change models or prompts to save money, verify that quality doesn't suffer. Run your evaluation suite (from Chapter 11) to catch regressions.
Be willing to spend more when it matters. If a user's query is complex or important, it's okay to use your most capable (and expensive) model. The cost of a bad answer is often higher than the cost of the API call.

Putting It All Together

Let's build a production-ready agent that implements multiple cost optimization strategies:

1from anthropic import Anthropic
2from google import generativeai as genai
3import hashlib
4from datetime import datetime
5
6class ProductionCostOptimizedAgent:
7    """Production agent with comprehensive cost optimization."""
8    
9    def __init__(self, daily_budget=50.0):
10        # Initialize clients
11        self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
12        genai.configure(api_key="YOUR_GOOGLE_KEY")
13        self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14        
15        # Budget tracking
16        self.daily_budget = daily_budget
17        self.daily_spending = 0.0
18        self.current_day = datetime.now().date()
19        
20        # Caching
21        self.cache = {}
22        self.cache_hits = 0
23        self.cache_misses = 0
24        
25        # Conversation history (trimmed)
26        self.history = []
27        self.max_history = 6
28    
29    def _reset_if_new_day(self):
30        """Reset daily counters."""
31        today = datetime.now().date()
32        if today != self.current_day:
33            self.current_day = today
34            self.daily_spending = 0.0
35    
36    def _hash_query(self, query):
37        """Create cache key."""
38        return hashlib.md5(query.lower().strip().encode()).hexdigest()
39    
40    def _classify_complexity(self, query):
41        """Determine query complexity."""
42        high_complexity = ["explain why", "analyze", "compare", "evaluate"]
43        simple_patterns = ["what is", "who is", "when is"]
44        
45        query_lower = query.lower()
46        
47        if any(p in query_lower for p in high_complexity):
48            return "high"
49        elif any(p in query_lower for p in simple_patterns):
50            return "low"
51        else:
52            return "medium"
53    
54    def respond(self, query):
55        """Generate optimized response."""
56        self._reset_if_new_day()
57        
58        # Check budget
59        if self.daily_spending >= self.daily_budget:
60            return {
61                "response": "Daily budget exceeded. Please try again tomorrow.",
62                "source": "budget_limit",
63                "cost": 0.0
64            }
65        
66        # Check cache
67        cache_key = self._hash_query(query)
68        if cache_key in self.cache:
69            self.cache_hits += 1
70            return {
71                "response": self.cache[cache_key],
72                "source": "cache",
73                "cost": 0.0
74            }
75        
76        self.cache_misses += 1
77        
78        # Route to appropriate model
79        complexity = self._classify_complexity(query)
80        
81        if complexity == "low":
82            # Use cheapest model for simple queries
83            response_text = self.gemini.generate_content(query).text
84            cost = 0.0008  # Estimated cost for Gemini Flash
85            source = "gemini-2.5-flash"
86        
87        else:
88            # Use Claude for complex queries, with concise prompts
89            system = "Provide concise, direct answers. Be brief but complete."
90            
91            response = self.anthropic.messages.create(
92                model="claude-sonnet-4.5",
93                max_tokens=300 if complexity == "medium" else 512,
94                system=system,
95                messages=[{"role": "user", "content": query}]
96            )
97            
98            response_text = response.content[0].text
99            
100            # Calculate cost
101            input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
102            output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
103            cost = input_cost + output_cost
104            source = "claude-sonnet-4.5"
105        
106        # Update spending
107        self.daily_spending += cost
108        
109        # Cache the response
110        self.cache[cache_key] = response_text
111        
112        return {
113            "response": response_text,
114            "source": source,
115            "cost": cost,
116            "daily_spending": self.daily_spending
117        }
118    
119    def get_stats(self):
120        """Get performance statistics."""
121        total_requests = self.cache_hits + self.cache_misses
122        hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
123        
124        return {
125            "total_requests": total_requests,
126            "cache_hit_rate": hit_rate,
127            "daily_spending": self.daily_spending,
128            "budget_remaining": self.daily_budget - self.daily_spending
129        }
130
131## Test the production agent
132agent = ProductionCostOptimizedAgent(daily_budget=1.0)
133
134test_queries = [
135    "What is Python?",
136    "What is Python?",  # Should hit cache
137    "Analyze the trade-offs between microservices and monolithic architectures",
138    "What is JavaScript?",
139    "What is Python?",  # Should hit cache again
140]
141
142for query in test_queries:
143    result = agent.respond(query)
144    print(f"Q: {query}")
145    print(f"Source: {result['source']}")
146    print(f"Cost: ${result['cost']:.6f}")
147    if 'daily_spending' in result:
148        print(f"Daily spending: ${result['daily_spending']:.4f}")
149    print()
150
151stats = agent.get_stats()
152print("=== Agent Statistics ===")
153print(f"Total requests: {stats['total_requests']}")
154print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
155print(f"Daily spending: ${stats['daily_spending']:.4f}")
156print(f"Budget remaining: ${stats['budget_remaining']:.4f}")

1from anthropic import Anthropic
2from google import generativeai as genai
3import hashlib
4from datetime import datetime
5
6class ProductionCostOptimizedAgent:
7    """Production agent with comprehensive cost optimization."""
8    
9    def __init__(self, daily_budget=50.0):
10        # Initialize clients
11        self.anthropic = Anthropic(api_key="YOUR_ANTHROPIC_KEY")
12        genai.configure(api_key="YOUR_GOOGLE_KEY")
13        self.gemini = genai.GenerativeModel('gemini-2.5-flash')
14        
15        # Budget tracking
16        self.daily_budget = daily_budget
17        self.daily_spending = 0.0
18        self.current_day = datetime.now().date()
19        
20        # Caching
21        self.cache = {}
22        self.cache_hits = 0
23        self.cache_misses = 0
24        
25        # Conversation history (trimmed)
26        self.history = []
27        self.max_history = 6
28    
29    def _reset_if_new_day(self):
30        """Reset daily counters."""
31        today = datetime.now().date()
32        if today != self.current_day:
33            self.current_day = today
34            self.daily_spending = 0.0
35    
36    def _hash_query(self, query):
37        """Create cache key."""
38        return hashlib.md5(query.lower().strip().encode()).hexdigest()
39    
40    def _classify_complexity(self, query):
41        """Determine query complexity."""
42        high_complexity = ["explain why", "analyze", "compare", "evaluate"]
43        simple_patterns = ["what is", "who is", "when is"]
44        
45        query_lower = query.lower()
46        
47        if any(p in query_lower for p in high_complexity):
48            return "high"
49        elif any(p in query_lower for p in simple_patterns):
50            return "low"
51        else:
52            return "medium"
53    
54    def respond(self, query):
55        """Generate optimized response."""
56        self._reset_if_new_day()
57        
58        # Check budget
59        if self.daily_spending >= self.daily_budget:
60            return {
61                "response": "Daily budget exceeded. Please try again tomorrow.",
62                "source": "budget_limit",
63                "cost": 0.0
64            }
65        
66        # Check cache
67        cache_key = self._hash_query(query)
68        if cache_key in self.cache:
69            self.cache_hits += 1
70            return {
71                "response": self.cache[cache_key],
72                "source": "cache",
73                "cost": 0.0
74            }
75        
76        self.cache_misses += 1
77        
78        # Route to appropriate model
79        complexity = self._classify_complexity(query)
80        
81        if complexity == "low":
82            # Use cheapest model for simple queries
83            response_text = self.gemini.generate_content(query).text
84            cost = 0.0008  # Estimated cost for Gemini Flash
85            source = "gemini-2.5-flash"
86        
87        else:
88            # Use Claude for complex queries, with concise prompts
89            system = "Provide concise, direct answers. Be brief but complete."
90            
91            response = self.anthropic.messages.create(
92                model="claude-sonnet-4.5",
93                max_tokens=300 if complexity == "medium" else 512,
94                system=system,
95                messages=[{"role": "user", "content": query}]
96            )
97            
98            response_text = response.content[0].text
99            
100            # Calculate cost
101            input_cost = (response.usage.input_tokens / 1_000_000) * 3.00
102            output_cost = (response.usage.output_tokens / 1_000_000) * 15.00
103            cost = input_cost + output_cost
104            source = "claude-sonnet-4.5"
105        
106        # Update spending
107        self.daily_spending += cost
108        
109        # Cache the response
110        self.cache[cache_key] = response_text
111        
112        return {
113            "response": response_text,
114            "source": source,
115            "cost": cost,
116            "daily_spending": self.daily_spending
117        }
118    
119    def get_stats(self):
120        """Get performance statistics."""
121        total_requests = self.cache_hits + self.cache_misses
122        hit_rate = (self.cache_hits / total_requests * 100) if total_requests > 0 else 0
123        
124        return {
125            "total_requests": total_requests,
126            "cache_hit_rate": hit_rate,
127            "daily_spending": self.daily_spending,
128            "budget_remaining": self.daily_budget - self.daily_spending
129        }
130
131## Test the production agent
132agent = ProductionCostOptimizedAgent(daily_budget=1.0)
133
134test_queries = [
135    "What is Python?",
136    "What is Python?",  # Should hit cache
137    "Analyze the trade-offs between microservices and monolithic architectures",
138    "What is JavaScript?",
139    "What is Python?",  # Should hit cache again
140]
141
142for query in test_queries:
143    result = agent.respond(query)
144    print(f"Q: {query}")
145    print(f"Source: {result['source']}")
146    print(f"Cost: ${result['cost']:.6f}")
147    if 'daily_spending' in result:
148        print(f"Daily spending: ${result['daily_spending']:.4f}")
149    print()
150
151stats = agent.get_stats()
152print("=== Agent Statistics ===")
153print(f"Total requests: {stats['total_requests']}")
154print(f"Cache hit rate: {stats['cache_hit_rate']:.1f}%")
155print(f"Daily spending: ${stats['daily_spending']:.4f}")
156print(f"Budget remaining: ${stats['budget_remaining']:.4f}")

This agent combines multiple strategies:

Caching for repeated queries (free responses)
Model routing based on complexity (use cheaper models when possible)
Concise prompts (reduce output tokens)
Budget limits (prevent runaway costs)

The result is an agent that's both capable and economical.

Glossary

API Call: A request made to a language model service. Each call typically incurs a cost based on the number of tokens processed.

Batching: Combining multiple similar requests into a single API call to reduce overhead and costs. More efficient than processing each request individually.

Budget Limit: A maximum spending threshold set to prevent unexpected or runaway costs. Can be daily, monthly, or per-user.

Cache Hit: When a requested response is found in the cache and can be served instantly without making an API call. Saves both time and money.

Cache Miss: When a requested response is not in the cache, requiring a new API call to generate it.

Context Window: The maximum amount of text (measured in tokens) that a model can process in a single request, including both input and output.

Input Tokens: The tokens in your prompt, including system messages, conversation history, and the user's query. Generally cheaper than output tokens.

Output Tokens: The tokens generated by the model in its response. Typically cost more than input tokens because generation requires more computation.

Token: The basic unit of text that language models process, roughly equivalent to a word or word piece. Both costs and context limits are measured in tokens.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about managing and reducing AI agent costs.

Loading component...

Back to AI Agent Handbook

Previous Chapter

Speeding Up the Agent

Next Chapter

Scaling Up without Breaking the Bank

Reference

BIBTEXAcademic

@misc{managingandreducingaiagentcostscompleteguidetocostoptimizationstrategies, author = {Michael Brenndoerfer}, title = {Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies}, year = {2025}, url = {https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-10} }

APAAcademic

Michael Brenndoerfer (2025). Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies. Retrieved from https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies

MLAAcademic

Michael Brenndoerfer. "Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies." 2025. Web. 11/10/2025. <https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies>.

CHICAGOAcademic

Michael Brenndoerfer. "Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies." Accessed 11/10/2025. https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies'. Available at: https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies (Accessed: 11/10/2025).

SimpleBasic

Michael Brenndoerfer (2025). Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies. https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies

Direct link:

https://mbrenndoerfer.com/writing/managing-reducing-ai-agent-costs-optimization-strategies

Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveManaging and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies

Managing and Reducing Costs

Understanding the Cost Structure

Tracking Costs in Your Agent

Strategy 1: Use the Cheapest Model That Works

Example: Cost-Aware Model Selection (Multi-Provider)

Strategy 2: Reduce Output Length

Example: Concise Responses (Claude Sonnet 4.5)

Strategy 3: Cache Aggressively

Example: Multi-Level Caching (Gemini 2.5 Flash)

Strategy 4: Batch Similar Requests

Example: Batch Processing (GPT-5)

Strategy 5: Trim Conversation History

Example: Smart History Trimming (Claude Sonnet 4.5)

Strategy 6: Use Prompt Compression

Example: Context Summarization (Claude Sonnet 4.5)

Strategy 7: Set Budget Limits

Example: Budget-Aware Agent (Multi-Provider)

Measuring Cost Optimization Impact

Balancing Cost and Quality

Putting It All Together

Glossary

Quiz

Speeding Up the Agent

Scaling Up without Breaking the Bank

Reference

About the author: Michael Brenndoerfer

Related Content

Scaling Up without Breaking the Bank: AI Agent Performance & Cost Optimization at Scale

Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times

Maintenance and Updates: Keeping Your AI Agent Running and Improving Over Time

Stay updated