Learn how to create and use test cases to evaluate AI agent performance. Build comprehensive test suites, track results over time, and use testing frameworks like pytest, LangSmith, LangFuse, and Promptfoo to measure your agent's capabilities systematically.

This article is part of the free-to-read AI Agent Handbook
Testing the Agent with Examples
You've defined what success looks like for your agent. Now it's time to put those criteria into practice by actually testing the agent. Just as students take quizzes to demonstrate what they've learned, your agent needs a set of test cases to show whether it can handle the tasks you've designed it for.
Testing with examples isn't about catching your agent doing something wrong. It's about learning where the agent performs well and where it needs improvement. By the end of this chapter, you'll have a practical testing framework you can use to measure your assistant's performance and track its progress over time.
Building Your Test Suite
A good test suite covers the full range of your agent's capabilities. For our personal assistant, that means creating examples that exercise each of the skills we've built: answering questions, using tools, remembering information, and planning multi-step tasks.
Let's start simple. Here's what a basic test case looks like:
1test_case = {
2 "id": "math_001",
3 "input": "What's 1,234 multiplied by 5,678?",
4 "expected_behavior": "uses_calculator",
5 "expected_output": "7,006,652",
6 "category": "tool_use"
7}1test_case = {
2 "id": "math_001",
3 "input": "What's 1,234 multiplied by 5,678?",
4 "expected_behavior": "uses_calculator",
5 "expected_output": "7,006,652",
6 "category": "tool_use"
7}Each test case captures three essential pieces:
- What you're asking the agent to do (the input)
- How you expect it to approach the task (the behavior)
- What result you want to see (the output)
Notice that we're not just checking if the agent gets the right answer. We're also verifying that it uses the right approach. For a math problem, we expect the agent to use its calculator tool rather than guessing. This distinction matters because it tells us whether the agent is reasoning correctly, not just getting lucky.
Let's build a small test suite that covers different aspects of our assistant's capabilities:
1## Using Claude Sonnet 4.5 for its superior agent reasoning and tool use
2test_suite = [
3 {
4 "id": "math_001",
5 "input": "What's 1,234 multiplied by 5,678?",
6 "expected_behavior": "uses_calculator",
7 "expected_output": "7,006,652",
8 "category": "tool_use"
9 },
10 {
11 "id": "memory_001",
12 "input": "Remember that my birthday is July 20th.",
13 "expected_behavior": "stores_to_memory",
14 "expected_output": "confirmation of storage",
15 "category": "memory"
16 },
17 {
18 "id": "memory_002",
19 "input": "When is my birthday?",
20 "expected_behavior": "retrieves_from_memory",
21 "expected_output": "July 20th",
22 "category": "memory",
23 "requires": ["memory_001"] # This test depends on a previous one
24 },
25 {
26 "id": "reasoning_001",
27 "input": "If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?",
28 "expected_behavior": "step_by_step_reasoning",
29 "expected_output": "1:15 PM",
30 "category": "reasoning"
31 },
32 {
33 "id": "planning_001",
34 "input": "Find today's weather and recommend what I should wear.",
35 "expected_behavior": "uses_weather_tool_then_makes_recommendation",
36 "expected_output": "weather report followed by clothing suggestion",
37 "category": "planning"
38 }
39]1## Using Claude Sonnet 4.5 for its superior agent reasoning and tool use
2test_suite = [
3 {
4 "id": "math_001",
5 "input": "What's 1,234 multiplied by 5,678?",
6 "expected_behavior": "uses_calculator",
7 "expected_output": "7,006,652",
8 "category": "tool_use"
9 },
10 {
11 "id": "memory_001",
12 "input": "Remember that my birthday is July 20th.",
13 "expected_behavior": "stores_to_memory",
14 "expected_output": "confirmation of storage",
15 "category": "memory"
16 },
17 {
18 "id": "memory_002",
19 "input": "When is my birthday?",
20 "expected_behavior": "retrieves_from_memory",
21 "expected_output": "July 20th",
22 "category": "memory",
23 "requires": ["memory_001"] # This test depends on a previous one
24 },
25 {
26 "id": "reasoning_001",
27 "input": "If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?",
28 "expected_behavior": "step_by_step_reasoning",
29 "expected_output": "1:15 PM",
30 "category": "reasoning"
31 },
32 {
33 "id": "planning_001",
34 "input": "Find today's weather and recommend what I should wear.",
35 "expected_behavior": "uses_weather_tool_then_makes_recommendation",
36 "expected_output": "weather report followed by clothing suggestion",
37 "category": "planning"
38 }
39]This suite tests five different capabilities. We have a math calculation that requires tool use, memory storage and retrieval, a reasoning problem, and a planning task that involves multiple steps. Notice how some tests depend on others. The second memory test only works if the first one successfully stored the birthday information.
Running the Tests
With our test suite ready, we need a way to run each test and check whether the agent passes. Let's build a simple test runner:
1def run_test(agent, test_case):
2 """
3 Run a single test case and return the results.
4 """
5 print(f"\nRunning test: {test_case['id']}")
6 print(f"Input: {test_case['input']}")
7
8 # Run the agent with the test input
9 response = agent.process(test_case['input'])
10
11 # Record what happened
12 result = {
13 "test_id": test_case['id'],
14 "input": test_case['input'],
15 "output": response['text'],
16 "behavior": response.get('actions_taken', []),
17 "passed": False,
18 "notes": []
19 }
20
21 # Check if the agent's behavior matches expectations
22 if test_case['expected_behavior'] in str(response.get('actions_taken', [])):
23 result['notes'].append(f"✓ Correct behavior: {test_case['expected_behavior']}")
24 else:
25 result['notes'].append(f"✗ Expected {test_case['expected_behavior']}, got {response.get('actions_taken', [])}")
26
27 # Check if the output matches expectations
28 if test_case['expected_output'].lower() in response['text'].lower():
29 result['notes'].append(f"✓ Correct output")
30 result['passed'] = True
31 else:
32 result['notes'].append(f"✗ Output doesn't match expected: {test_case['expected_output']}")
33
34 return result1def run_test(agent, test_case):
2 """
3 Run a single test case and return the results.
4 """
5 print(f"\nRunning test: {test_case['id']}")
6 print(f"Input: {test_case['input']}")
7
8 # Run the agent with the test input
9 response = agent.process(test_case['input'])
10
11 # Record what happened
12 result = {
13 "test_id": test_case['id'],
14 "input": test_case['input'],
15 "output": response['text'],
16 "behavior": response.get('actions_taken', []),
17 "passed": False,
18 "notes": []
19 }
20
21 # Check if the agent's behavior matches expectations
22 if test_case['expected_behavior'] in str(response.get('actions_taken', [])):
23 result['notes'].append(f"✓ Correct behavior: {test_case['expected_behavior']}")
24 else:
25 result['notes'].append(f"✗ Expected {test_case['expected_behavior']}, got {response.get('actions_taken', [])}")
26
27 # Check if the output matches expectations
28 if test_case['expected_output'].lower() in response['text'].lower():
29 result['notes'].append(f"✓ Correct output")
30 result['passed'] = True
31 else:
32 result['notes'].append(f"✗ Output doesn't match expected: {test_case['expected_output']}")
33
34 return resultThis test runner does a few important things. First, it runs the agent with the test input and captures the response. Then it checks two things: did the agent use the right approach (behavior), and did it produce the right answer (output)? Both checks need to pass for the test to succeed.
The checking logic here is intentionally simple. We're looking for substrings and checking if certain actions were taken. In a production system, you might want more sophisticated matching, but this is enough to get started.
Let's run our entire test suite:
1def run_test_suite(agent, test_suite):
2 """
3 Run all tests and generate a report.
4 """
5 results = []
6
7 for test_case in test_suite:
8 # Check if this test has dependencies
9 if 'requires' in test_case:
10 # Make sure required tests ran first
11 for required_id in test_case['requires']:
12 required_result = next((r for r in results if r['test_id'] == required_id), None)
13 if not required_result or not required_result['passed']:
14 print(f"Skipping {test_case['id']} - dependency {required_id} failed")
15 continue
16
17 result = run_test(agent, test_case)
18 results.append(result)
19
20 # Print immediate feedback
21 status = "PASS" if result['passed'] else "FAIL"
22 print(f"Result: {status}")
23 for note in result['notes']:
24 print(f" {note}")
25
26 return results1def run_test_suite(agent, test_suite):
2 """
3 Run all tests and generate a report.
4 """
5 results = []
6
7 for test_case in test_suite:
8 # Check if this test has dependencies
9 if 'requires' in test_case:
10 # Make sure required tests ran first
11 for required_id in test_case['requires']:
12 required_result = next((r for r in results if r['test_id'] == required_id), None)
13 if not required_result or not required_result['passed']:
14 print(f"Skipping {test_case['id']} - dependency {required_id} failed")
15 continue
16
17 result = run_test(agent, test_case)
18 results.append(result)
19
20 # Print immediate feedback
21 status = "PASS" if result['passed'] else "FAIL"
22 print(f"Result: {status}")
23 for note in result['notes']:
24 print(f" {note}")
25
26 return resultsWhen you run this, you'll see output like this:
1Running test: math_001
2Input: What's 1,234 multiplied by 5,678?
3Result: PASS
4 ✓ Correct behavior: uses_calculator
5 ✓ Correct output
6
7Running test: memory_001
8Input: Remember that my birthday is July 20th.
9Result: PASS
10 ✓ Correct behavior: stores_to_memory
11 ✓ Correct output
12
13Running test: memory_002
14Input: When is my birthday?
15Result: PASS
16 ✓ Correct behavior: retrieves_from_memory
17 ✓ Correct output
18
19Running test: reasoning_001
20Input: If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?
21Result: PASS
22 ✓ Correct behavior: step_by_step_reasoning
23 ✓ Correct output
24
25Running test: planning_001
26Input: Find today's weather and recommend what I should wear.
27Result: FAIL
28 ✓ Correct behavior: uses_weather_tool_then_makes_recommendation
29 ✗ Output doesn't match expected: weather report followed by clothing suggestion1Running test: math_001
2Input: What's 1,234 multiplied by 5,678?
3Result: PASS
4 ✓ Correct behavior: uses_calculator
5 ✓ Correct output
6
7Running test: memory_001
8Input: Remember that my birthday is July 20th.
9Result: PASS
10 ✓ Correct behavior: stores_to_memory
11 ✓ Correct output
12
13Running test: memory_002
14Input: When is my birthday?
15Result: PASS
16 ✓ Correct behavior: retrieves_from_memory
17 ✓ Correct output
18
19Running test: reasoning_001
20Input: If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?
21Result: PASS
22 ✓ Correct behavior: step_by_step_reasoning
23 ✓ Correct output
24
25Running test: planning_001
26Input: Find today's weather and recommend what I should wear.
27Result: FAIL
28 ✓ Correct behavior: uses_weather_tool_then_makes_recommendation
29 ✗ Output doesn't match expected: weather report followed by clothing suggestionIn this example, four tests passed and one failed. The planning test shows the right behavior (it called the weather tool and made a recommendation), but the output format didn't match what we expected. This is valuable feedback. Maybe our expected output was too strict, or maybe the agent needs to format its responses more consistently.
Analyzing Test Results
Once you've run your tests, you need to understand what they're telling you. Let's create a simple summary function:
1def summarize_results(results):
2 """
3 Generate a summary of test results.
4 """
5 total = len(results)
6 passed = sum(1 for r in results if r['passed'])
7 failed = total - passed
8
9 print(f"\n{'='*50}")
10 print(f"Test Summary")
11 print(f"{'='*50}")
12 print(f"Total tests: {total}")
13 print(f"Passed: {passed} ({passed/total*100:.1f}%)")
14 print(f"Failed: {failed} ({failed/total*100:.1f}%)")
15
16 # Break down by category
17 categories = {}
18 for result in results:
19 # Find the test case to get its category
20 test_case = next((tc for tc in test_suite if tc['id'] == result['test_id']), None)
21 if test_case:
22 cat = test_case['category']
23 if cat not in categories:
24 categories[cat] = {'total': 0, 'passed': 0}
25 categories[cat]['total'] += 1
26 if result['passed']:
27 categories[cat]['passed'] += 1
28
29 print(f"\nBy Category:")
30 for cat, stats in categories.items():
31 success_rate = stats['passed'] / stats['total'] * 100
32 print(f" {cat}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
33
34 # List failed tests
35 if failed > 0:
36 print(f"\nFailed Tests:")
37 for result in results:
38 if not result['passed']:
39 print(f" {result['test_id']}: {result['input'][:50]}...")1def summarize_results(results):
2 """
3 Generate a summary of test results.
4 """
5 total = len(results)
6 passed = sum(1 for r in results if r['passed'])
7 failed = total - passed
8
9 print(f"\n{'='*50}")
10 print(f"Test Summary")
11 print(f"{'='*50}")
12 print(f"Total tests: {total}")
13 print(f"Passed: {passed} ({passed/total*100:.1f}%)")
14 print(f"Failed: {failed} ({failed/total*100:.1f}%)")
15
16 # Break down by category
17 categories = {}
18 for result in results:
19 # Find the test case to get its category
20 test_case = next((tc for tc in test_suite if tc['id'] == result['test_id']), None)
21 if test_case:
22 cat = test_case['category']
23 if cat not in categories:
24 categories[cat] = {'total': 0, 'passed': 0}
25 categories[cat]['total'] += 1
26 if result['passed']:
27 categories[cat]['passed'] += 1
28
29 print(f"\nBy Category:")
30 for cat, stats in categories.items():
31 success_rate = stats['passed'] / stats['total'] * 100
32 print(f" {cat}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
33
34 # List failed tests
35 if failed > 0:
36 print(f"\nFailed Tests:")
37 for result in results:
38 if not result['passed']:
39 print(f" {result['test_id']}: {result['input'][:50]}...")This summary shows you not just the overall pass rate, but also breaks down performance by category. You might discover that your agent is excellent at tool use but struggles with planning tasks. That tells you exactly where to focus your improvement efforts.
When you run this summary on our example results, you'll see:
1==================================================
2Test Summary
3==================================================
4Total tests: 5
5Passed: 4 (80.0%)
6Failed: 1 (20.0%)
7
8By Category:
9 tool_use: 1/1 (100.0%)
10 memory: 2/2 (100.0%)
11 reasoning: 1/1 (100.0%)
12 planning: 0/1 (0.0%)
13
14Failed Tests:
15 planning_001: Find today's weather and recommend what I should we...1==================================================
2Test Summary
3==================================================
4Total tests: 5
5Passed: 4 (80.0%)
6Failed: 1 (20.0%)
7
8By Category:
9 tool_use: 1/1 (100.0%)
10 memory: 2/2 (100.0%)
11 reasoning: 1/1 (100.0%)
12 planning: 0/1 (0.0%)
13
14Failed Tests:
15 planning_001: Find today's weather and recommend what I should we...This tells us that our agent handles basic capabilities well but needs work on planning tasks. That's actionable information.
Tracking Performance Over Time
Testing once is useful, but testing regularly is powerful. By running the same test suite after each improvement to your agent, you can see whether your changes actually help or hurt performance.
Here's a simple way to track results over time:
1import json
2from datetime import datetime
3
4def save_test_run(results, version_name):
5 """
6 Save test results with a timestamp for historical tracking.
7 """
8 run_data = {
9 "timestamp": datetime.now().isoformat(),
10 "version": version_name,
11 "results": results,
12 "summary": {
13 "total": len(results),
14 "passed": sum(1 for r in results if r['passed']),
15 "pass_rate": sum(1 for r in results if r['passed']) / len(results)
16 }
17 }
18
19 filename = f"test_results_{version_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
20 with open(filename, 'w') as f:
21 json.dump(run_data, f, indent=2)
22
23 print(f"Results saved to {filename}")
24 return filename1import json
2from datetime import datetime
3
4def save_test_run(results, version_name):
5 """
6 Save test results with a timestamp for historical tracking.
7 """
8 run_data = {
9 "timestamp": datetime.now().isoformat(),
10 "version": version_name,
11 "results": results,
12 "summary": {
13 "total": len(results),
14 "passed": sum(1 for r in results if r['passed']),
15 "pass_rate": sum(1 for r in results if r['passed']) / len(results)
16 }
17 }
18
19 filename = f"test_results_{version_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
20 with open(filename, 'w') as f:
21 json.dump(run_data, f, indent=2)
22
23 print(f"Results saved to {filename}")
24 return filenameNow you can track how your agent improves. Maybe you started with a pass rate of 60%, then after adding better tool selection logic you're at 75%, and after refining prompts you hit 90%. Each test run gives you confidence that you're moving in the right direction.
You can even create a simple visualization:
1import matplotlib.pyplot as plt
2
3def plot_progress(test_history_files):
4 """
5 Plot the agent's test performance over time.
6 """
7 versions = []
8 pass_rates = []
9
10 for file in test_history_files:
11 with open(file, 'r') as f:
12 data = json.load(f)
13 versions.append(data['version'])
14 pass_rates.append(data['summary']['pass_rate'] * 100)
15
16 plt.figure(figsize=(10, 6))
17 plt.plot(versions, pass_rates, marker='o', linewidth=2, markersize=8)
18 plt.xlabel('Version')
19 plt.ylabel('Pass Rate (%)')
20 plt.title('Agent Test Performance Over Time')
21 plt.ylim(0, 100)
22 plt.grid(True, alpha=0.3)
23 plt.tight_layout()
24 plt.savefig('agent_progress.png')
25 print("Progress chart saved to agent_progress.png")1import matplotlib.pyplot as plt
2
3def plot_progress(test_history_files):
4 """
5 Plot the agent's test performance over time.
6 """
7 versions = []
8 pass_rates = []
9
10 for file in test_history_files:
11 with open(file, 'r') as f:
12 data = json.load(f)
13 versions.append(data['version'])
14 pass_rates.append(data['summary']['pass_rate'] * 100)
15
16 plt.figure(figsize=(10, 6))
17 plt.plot(versions, pass_rates, marker='o', linewidth=2, markersize=8)
18 plt.xlabel('Version')
19 plt.ylabel('Pass Rate (%)')
20 plt.title('Agent Test Performance Over Time')
21 plt.ylim(0, 100)
22 plt.grid(True, alpha=0.3)
23 plt.tight_layout()
24 plt.savefig('agent_progress.png')
25 print("Progress chart saved to agent_progress.png")Seeing your agent's performance improve over time is motivating. It transforms testing from a chore into a feedback loop that drives continuous improvement.
Testing Frameworks and Tools
While we've built our own simple test runner, you don't have to start from scratch. Several tools can help you test AI agents more effectively.
Python Testing Frameworks:
The standard Python testing frameworks work well for agent testing. You can use pytest or unittest to structure your tests in a familiar way:
1import pytest
2
3class TestAgentCapabilities:
4 def test_calculator_tool(self, agent):
5 response = agent.process("What's 1,234 multiplied by 5,678?")
6 assert "7,006,652" in response['text']
7 assert 'calculator' in response['actions_taken']
8
9 def test_memory_storage(self, agent):
10 agent.process("Remember that my birthday is July 20th.")
11 response = agent.process("When is my birthday?")
12 assert "July 20" in response['text'].lower()1import pytest
2
3class TestAgentCapabilities:
4 def test_calculator_tool(self, agent):
5 response = agent.process("What's 1,234 multiplied by 5,678?")
6 assert "7,006,652" in response['text']
7 assert 'calculator' in response['actions_taken']
8
9 def test_memory_storage(self, agent):
10 agent.process("Remember that my birthday is July 20th.")
11 response = agent.process("When is my birthday?")
12 assert "July 20" in response['text'].lower()Using pytest gives you powerful features like fixtures (for setting up test agents), parametrized tests (for running the same test with different inputs), and detailed failure reports. You can run your tests with pytest tests/ and get professional-grade test output.
LLM-Specific Testing Tools:
Several tools have emerged specifically for testing language model applications:
LangSmith (from LangChain) gives you tracing, evaluation, and testing in one package. You can create test datasets, run evaluations, and track your agent's performance across versions. If you're already using LangChain, this integrates naturally. Learn more at https://docs.smith.langchain.com/
LangFuse is an open-source observability and evaluation platform. You can trace execution, create test cases, and run automated evaluations. It works with various LLM frameworks and includes a dashboard for monitoring results. See https://langfuse.com/docs
Promptfoo specializes in testing LLM outputs. You define test cases in YAML files, run batch evaluations, and compare different prompts or models. This works well when you're iterating on prompt design. Visit https://promptfoo.dev/docs/intro
OpenAI Evals offers standardized benchmarks and custom evaluation tools for GPT-5 and other OpenAI models. You can build your own evals or use existing ones to measure specific capabilities. Check out https://github.com/openai/evals
Braintrust helps you evaluate and improve AI applications with tools for creating datasets, running experiments, and tracking metrics over time. See https://www.braintrust.dev/docs
These frameworks handle much of the infrastructure we built manually: running tests in parallel, tracking results over time, comparing versions, and generating reports. If you're building a production agent, they're worth exploring. However, understanding how testing works at the fundamental level (as we've done here) helps you use these tools more effectively.
Making Tests Maintainable
As your agent evolves, your test suite needs to keep pace. Here are some practices that keep tests useful over time:
Use clear test IDs and descriptions. Six months from now, test_001 won't mean anything to you. But memory_birthday_storage tells you exactly what's being tested.
Group related tests. Keep all memory tests together, all tool tests together. This makes it easier to find and update tests when you modify a specific capability.
Document edge cases. When you discover a bug or unexpected behavior, write a test for it. Your test suite becomes a record of problems you've solved.
Keep expected outputs realistic. Don't require exact text matches. Language models vary their responses, and that's okay. Test for the presence of key information, not specific phrasings.
Review failed tests carefully. Sometimes a failed test means your agent is broken. But sometimes it means your test expectations were wrong. Be willing to update tests as your understanding improves.
A Complete Testing Example
Let's put everything together. Here's a complete, runnable example of testing our personal assistant:
1## Using Claude Sonnet 4.5 for comprehensive agent testing
2import json
3from datetime import datetime
4from anthropic import Anthropic
5
6## Initialize your agent (implementation from previous chapters)
7agent = PersonalAssistant(
8 model="claude-sonnet-4.5",
9 tools=[calculator, weather_api, memory_manager]
10)
11
12## Define test suite
13test_suite = [
14 {
15 "id": "basic_question",
16 "input": "What is the capital of France?",
17 "expected_output": "Paris",
18 "category": "knowledge"
19 },
20 {
21 "id": "tool_calculator",
22 "input": "Calculate 789 times 456",
23 "expected_behavior": "uses_calculator",
24 "expected_output": "359,784",
25 "category": "tool_use"
26 },
27 {
28 "id": "memory_store",
29 "input": "Remember that I prefer coffee in the morning",
30 "expected_behavior": "stores_to_memory",
31 "category": "memory"
32 },
33 {
34 "id": "memory_retrieve",
35 "input": "What do I like to drink in the morning?",
36 "expected_output": "coffee",
37 "category": "memory",
38 "requires": ["memory_store"]
39 }
40]
41
42## Run tests
43results = run_test_suite(agent, test_suite)
44
45## Generate summary
46summarize_results(results)
47
48## Save for historical tracking
49save_test_run(results, version_name="v1.2")1## Using Claude Sonnet 4.5 for comprehensive agent testing
2import json
3from datetime import datetime
4from anthropic import Anthropic
5
6## Initialize your agent (implementation from previous chapters)
7agent = PersonalAssistant(
8 model="claude-sonnet-4.5",
9 tools=[calculator, weather_api, memory_manager]
10)
11
12## Define test suite
13test_suite = [
14 {
15 "id": "basic_question",
16 "input": "What is the capital of France?",
17 "expected_output": "Paris",
18 "category": "knowledge"
19 },
20 {
21 "id": "tool_calculator",
22 "input": "Calculate 789 times 456",
23 "expected_behavior": "uses_calculator",
24 "expected_output": "359,784",
25 "category": "tool_use"
26 },
27 {
28 "id": "memory_store",
29 "input": "Remember that I prefer coffee in the morning",
30 "expected_behavior": "stores_to_memory",
31 "category": "memory"
32 },
33 {
34 "id": "memory_retrieve",
35 "input": "What do I like to drink in the morning?",
36 "expected_output": "coffee",
37 "category": "memory",
38 "requires": ["memory_store"]
39 }
40]
41
42## Run tests
43results = run_test_suite(agent, test_suite)
44
45## Generate summary
46summarize_results(results)
47
48## Save for historical tracking
49save_test_run(results, version_name="v1.2")When you run this, you'll get immediate feedback on how your agent performs, a summary of results by category, and a saved record you can compare against future test runs.
What You've Built
You now have a practical framework for testing your AI agent. You can create test cases that cover all your agent's capabilities, run those tests systematically, and track performance over time. More importantly, you understand what makes a good test and how to interpret the results.
Testing isn't just about finding bugs. It's about understanding your agent's strengths and weaknesses so you can make informed decisions about where to improve. Every test run gives you data to work with, and that data guides your development.
In the next section, we'll look at how to create a feedback loop that uses these test results to continuously improve your agent. Testing measures performance. The feedback loop turns those measurements into action.
Glossary
Test Case: A specific input, expected behavior, and expected output used to verify that an agent works correctly. Think of it like a practice problem that tests whether your agent learned a particular skill.
Test Suite: A collection of related test cases that together verify multiple aspects of an agent's functionality. A comprehensive test suite covers all the major capabilities your agent needs.
Pass Rate: The percentage of tests that succeed in a test run. A pass rate of 80% means 8 out of 10 tests passed, giving you a quantitative measure of your agent's performance.
Test Dependency: When one test requires another test to run successfully first. For example, testing memory retrieval requires that memory storage worked first.
Test Fixture: Setup code that prepares the environment for testing, such as initializing an agent or loading test data. Fixtures help you avoid repeating setup code in every test.
Regression: When a change to your agent causes previously passing tests to fail. Tracking regressions helps ensure that improvements don't accidentally break existing functionality.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about testing AI agents with examples.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Scaling Up without Breaking the Bank: AI Agent Performance & Cost Optimization at Scale
Learn how to scale AI agents from single users to thousands while maintaining performance and controlling costs. Covers horizontal scaling, load balancing, monitoring, cost controls, and prompt optimization strategies.

Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies
Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.

Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times
Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

