Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking
Back to Writing

Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking

Michael Brenndoerfer•November 10, 2025•11 min read•1,574 words•Interactive

Learn how to create and use test cases to evaluate AI agent performance. Build comprehensive test suites, track results over time, and use testing frameworks like pytest, LangSmith, LangFuse, and Promptfoo to measure your agent's capabilities systematically.

AI Agent Handbook Cover
Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

Testing the Agent with Examples

You've defined what success looks like for your agent. Now it's time to put those criteria into practice by actually testing the agent. Just as students take quizzes to demonstrate what they've learned, your agent needs a set of test cases to show whether it can handle the tasks you've designed it for.

Testing with examples isn't about catching your agent doing something wrong. It's about learning where the agent performs well and where it needs improvement. By the end of this chapter, you'll have a practical testing framework you can use to measure your assistant's performance and track its progress over time.

Building Your Test Suite

A good test suite covers the full range of your agent's capabilities. For our personal assistant, that means creating examples that exercise each of the skills we've built: answering questions, using tools, remembering information, and planning multi-step tasks.

Let's start simple. Here's what a basic test case looks like:

1test_case = {
2    "id": "math_001",
3    "input": "What's 1,234 multiplied by 5,678?",
4    "expected_behavior": "uses_calculator",
5    "expected_output": "7,006,652",
6    "category": "tool_use"
7}

Each test case captures three essential pieces:

  • What you're asking the agent to do (the input)
  • How you expect it to approach the task (the behavior)
  • What result you want to see (the output)

Notice that we're not just checking if the agent gets the right answer. We're also verifying that it uses the right approach. For a math problem, we expect the agent to use its calculator tool rather than guessing. This distinction matters because it tells us whether the agent is reasoning correctly, not just getting lucky.

Let's build a small test suite that covers different aspects of our assistant's capabilities:

1## Using Claude Sonnet 4.5 for its superior agent reasoning and tool use
2test_suite = [
3    {
4        "id": "math_001",
5        "input": "What's 1,234 multiplied by 5,678?",
6        "expected_behavior": "uses_calculator",
7        "expected_output": "7,006,652",
8        "category": "tool_use"
9    },
10    {
11        "id": "memory_001",
12        "input": "Remember that my birthday is July 20th.",
13        "expected_behavior": "stores_to_memory",
14        "expected_output": "confirmation of storage",
15        "category": "memory"
16    },
17    {
18        "id": "memory_002",
19        "input": "When is my birthday?",
20        "expected_behavior": "retrieves_from_memory",
21        "expected_output": "July 20th",
22        "category": "memory",
23        "requires": ["memory_001"]  # This test depends on a previous one
24    },
25    {
26        "id": "reasoning_001",
27        "input": "If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?",
28        "expected_behavior": "step_by_step_reasoning",
29        "expected_output": "1:15 PM",
30        "category": "reasoning"
31    },
32    {
33        "id": "planning_001",
34        "input": "Find today's weather and recommend what I should wear.",
35        "expected_behavior": "uses_weather_tool_then_makes_recommendation",
36        "expected_output": "weather report followed by clothing suggestion",
37        "category": "planning"
38    }
39]

This suite tests five different capabilities. We have a math calculation that requires tool use, memory storage and retrieval, a reasoning problem, and a planning task that involves multiple steps. Notice how some tests depend on others. The second memory test only works if the first one successfully stored the birthday information.

Running the Tests

With our test suite ready, we need a way to run each test and check whether the agent passes. Let's build a simple test runner:

1def run_test(agent, test_case):
2    """
3    Run a single test case and return the results.
4    """
5    print(f"\nRunning test: {test_case['id']}")
6    print(f"Input: {test_case['input']}")
7    
8    # Run the agent with the test input
9    response = agent.process(test_case['input'])
10    
11    # Record what happened
12    result = {
13        "test_id": test_case['id'],
14        "input": test_case['input'],
15        "output": response['text'],
16        "behavior": response.get('actions_taken', []),
17        "passed": False,
18        "notes": []
19    }
20    
21    # Check if the agent's behavior matches expectations
22    if test_case['expected_behavior'] in str(response.get('actions_taken', [])):
23        result['notes'].append(f"✓ Correct behavior: {test_case['expected_behavior']}")
24    else:
25        result['notes'].append(f"✗ Expected {test_case['expected_behavior']}, got {response.get('actions_taken', [])}")
26    
27    # Check if the output matches expectations
28    if test_case['expected_output'].lower() in response['text'].lower():
29        result['notes'].append(f"✓ Correct output")
30        result['passed'] = True
31    else:
32        result['notes'].append(f"✗ Output doesn't match expected: {test_case['expected_output']}")
33    
34    return result

This test runner does a few important things. First, it runs the agent with the test input and captures the response. Then it checks two things: did the agent use the right approach (behavior), and did it produce the right answer (output)? Both checks need to pass for the test to succeed.

The checking logic here is intentionally simple. We're looking for substrings and checking if certain actions were taken. In a production system, you might want more sophisticated matching, but this is enough to get started.

Let's run our entire test suite:

1def run_test_suite(agent, test_suite):
2    """
3    Run all tests and generate a report.
4    """
5    results = []
6    
7    for test_case in test_suite:
8        # Check if this test has dependencies
9        if 'requires' in test_case:
10            # Make sure required tests ran first
11            for required_id in test_case['requires']:
12                required_result = next((r for r in results if r['test_id'] == required_id), None)
13                if not required_result or not required_result['passed']:
14                    print(f"Skipping {test_case['id']} - dependency {required_id} failed")
15                    continue
16        
17        result = run_test(agent, test_case)
18        results.append(result)
19        
20        # Print immediate feedback
21        status = "PASS" if result['passed'] else "FAIL"
22        print(f"Result: {status}")
23        for note in result['notes']:
24            print(f"  {note}")
25    
26    return results

When you run this, you'll see output like this:

1Running test: math_001
2Input: What's 1,234 multiplied by 5,678?
3Result: PASS
4  ✓ Correct behavior: uses_calculator
5  ✓ Correct output
6
7Running test: memory_001
8Input: Remember that my birthday is July 20th.
9Result: PASS
10  ✓ Correct behavior: stores_to_memory
11  ✓ Correct output
12
13Running test: memory_002
14Input: When is my birthday?
15Result: PASS
16  ✓ Correct behavior: retrieves_from_memory
17  ✓ Correct output
18
19Running test: reasoning_001
20Input: If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?
21Result: PASS
22  ✓ Correct behavior: step_by_step_reasoning
23  ✓ Correct output
24
25Running test: planning_001
26Input: Find today's weather and recommend what I should wear.
27Result: FAIL
28  ✓ Correct behavior: uses_weather_tool_then_makes_recommendation
29  ✗ Output doesn't match expected: weather report followed by clothing suggestion

In this example, four tests passed and one failed. The planning test shows the right behavior (it called the weather tool and made a recommendation), but the output format didn't match what we expected. This is valuable feedback. Maybe our expected output was too strict, or maybe the agent needs to format its responses more consistently.

Analyzing Test Results

Once you've run your tests, you need to understand what they're telling you. Let's create a simple summary function:

1def summarize_results(results):
2    """
3    Generate a summary of test results.
4    """
5    total = len(results)
6    passed = sum(1 for r in results if r['passed'])
7    failed = total - passed
8    
9    print(f"\n{'='*50}")
10    print(f"Test Summary")
11    print(f"{'='*50}")
12    print(f"Total tests: {total}")
13    print(f"Passed: {passed} ({passed/total*100:.1f}%)")
14    print(f"Failed: {failed} ({failed/total*100:.1f}%)")
15    
16    # Break down by category
17    categories = {}
18    for result in results:
19        # Find the test case to get its category
20        test_case = next((tc for tc in test_suite if tc['id'] == result['test_id']), None)
21        if test_case:
22            cat = test_case['category']
23            if cat not in categories:
24                categories[cat] = {'total': 0, 'passed': 0}
25            categories[cat]['total'] += 1
26            if result['passed']:
27                categories[cat]['passed'] += 1
28    
29    print(f"\nBy Category:")
30    for cat, stats in categories.items():
31        success_rate = stats['passed'] / stats['total'] * 100
32        print(f"  {cat}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
33    
34    # List failed tests
35    if failed > 0:
36        print(f"\nFailed Tests:")
37        for result in results:
38            if not result['passed']:
39                print(f"  {result['test_id']}: {result['input'][:50]}...")

This summary shows you not just the overall pass rate, but also breaks down performance by category. You might discover that your agent is excellent at tool use but struggles with planning tasks. That tells you exactly where to focus your improvement efforts.

When you run this summary on our example results, you'll see:

1==================================================
2Test Summary
3==================================================
4Total tests: 5
5Passed: 4 (80.0%)
6Failed: 1 (20.0%)
7
8By Category:
9  tool_use: 1/1 (100.0%)
10  memory: 2/2 (100.0%)
11  reasoning: 1/1 (100.0%)
12  planning: 0/1 (0.0%)
13
14Failed Tests:
15  planning_001: Find today's weather and recommend what I should we...

This tells us that our agent handles basic capabilities well but needs work on planning tasks. That's actionable information.

Tracking Performance Over Time

Testing once is useful, but testing regularly is powerful. By running the same test suite after each improvement to your agent, you can see whether your changes actually help or hurt performance.

Here's a simple way to track results over time:

1import json
2from datetime import datetime
3
4def save_test_run(results, version_name):
5    """
6    Save test results with a timestamp for historical tracking.
7    """
8    run_data = {
9        "timestamp": datetime.now().isoformat(),
10        "version": version_name,
11        "results": results,
12        "summary": {
13            "total": len(results),
14            "passed": sum(1 for r in results if r['passed']),
15            "pass_rate": sum(1 for r in results if r['passed']) / len(results)
16        }
17    }
18    
19    filename = f"test_results_{version_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
20    with open(filename, 'w') as f:
21        json.dump(run_data, f, indent=2)
22    
23    print(f"Results saved to {filename}")
24    return filename

Now you can track how your agent improves. Maybe you started with a pass rate of 60%, then after adding better tool selection logic you're at 75%, and after refining prompts you hit 90%. Each test run gives you confidence that you're moving in the right direction.

You can even create a simple visualization:

1import matplotlib.pyplot as plt
2
3def plot_progress(test_history_files):
4    """
5    Plot the agent's test performance over time.
6    """
7    versions = []
8    pass_rates = []
9    
10    for file in test_history_files:
11        with open(file, 'r') as f:
12            data = json.load(f)
13            versions.append(data['version'])
14            pass_rates.append(data['summary']['pass_rate'] * 100)
15    
16    plt.figure(figsize=(10, 6))
17    plt.plot(versions, pass_rates, marker='o', linewidth=2, markersize=8)
18    plt.xlabel('Version')
19    plt.ylabel('Pass Rate (%)')
20    plt.title('Agent Test Performance Over Time')
21    plt.ylim(0, 100)
22    plt.grid(True, alpha=0.3)
23    plt.tight_layout()
24    plt.savefig('agent_progress.png')
25    print("Progress chart saved to agent_progress.png")

Seeing your agent's performance improve over time is motivating. It transforms testing from a chore into a feedback loop that drives continuous improvement.

Testing Frameworks and Tools

While we've built our own simple test runner, you don't have to start from scratch. Several tools can help you test AI agents more effectively.

Python Testing Frameworks:

The standard Python testing frameworks work well for agent testing. You can use pytest or unittest to structure your tests in a familiar way:

1import pytest
2
3class TestAgentCapabilities:
4    def test_calculator_tool(self, agent):
5        response = agent.process("What's 1,234 multiplied by 5,678?")
6        assert "7,006,652" in response['text']
7        assert 'calculator' in response['actions_taken']
8    
9    def test_memory_storage(self, agent):
10        agent.process("Remember that my birthday is July 20th.")
11        response = agent.process("When is my birthday?")
12        assert "July 20" in response['text'].lower()

Using pytest gives you powerful features like fixtures (for setting up test agents), parametrized tests (for running the same test with different inputs), and detailed failure reports. You can run your tests with pytest tests/ and get professional-grade test output.

LLM-Specific Testing Tools:

Several tools have emerged specifically for testing language model applications:

LangSmith (from LangChain) gives you tracing, evaluation, and testing in one package. You can create test datasets, run evaluations, and track your agent's performance across versions. If you're already using LangChain, this integrates naturally. Learn more at https://docs.smith.langchain.com/

LangFuse is an open-source observability and evaluation platform. You can trace execution, create test cases, and run automated evaluations. It works with various LLM frameworks and includes a dashboard for monitoring results. See https://langfuse.com/docs

Promptfoo specializes in testing LLM outputs. You define test cases in YAML files, run batch evaluations, and compare different prompts or models. This works well when you're iterating on prompt design. Visit https://promptfoo.dev/docs/intro

OpenAI Evals offers standardized benchmarks and custom evaluation tools for GPT-5 and other OpenAI models. You can build your own evals or use existing ones to measure specific capabilities. Check out https://github.com/openai/evals

Braintrust helps you evaluate and improve AI applications with tools for creating datasets, running experiments, and tracking metrics over time. See https://www.braintrust.dev/docs

These frameworks handle much of the infrastructure we built manually: running tests in parallel, tracking results over time, comparing versions, and generating reports. If you're building a production agent, they're worth exploring. However, understanding how testing works at the fundamental level (as we've done here) helps you use these tools more effectively.

Making Tests Maintainable

As your agent evolves, your test suite needs to keep pace. Here are some practices that keep tests useful over time:

Use clear test IDs and descriptions. Six months from now, test_001 won't mean anything to you. But memory_birthday_storage tells you exactly what's being tested.

Group related tests. Keep all memory tests together, all tool tests together. This makes it easier to find and update tests when you modify a specific capability.

Document edge cases. When you discover a bug or unexpected behavior, write a test for it. Your test suite becomes a record of problems you've solved.

Keep expected outputs realistic. Don't require exact text matches. Language models vary their responses, and that's okay. Test for the presence of key information, not specific phrasings.

Review failed tests carefully. Sometimes a failed test means your agent is broken. But sometimes it means your test expectations were wrong. Be willing to update tests as your understanding improves.

A Complete Testing Example

Let's put everything together. Here's a complete, runnable example of testing our personal assistant:

1## Using Claude Sonnet 4.5 for comprehensive agent testing
2import json
3from datetime import datetime
4from anthropic import Anthropic
5
6## Initialize your agent (implementation from previous chapters)
7agent = PersonalAssistant(
8    model="claude-sonnet-4.5",
9    tools=[calculator, weather_api, memory_manager]
10)
11
12## Define test suite
13test_suite = [
14    {
15        "id": "basic_question",
16        "input": "What is the capital of France?",
17        "expected_output": "Paris",
18        "category": "knowledge"
19    },
20    {
21        "id": "tool_calculator",
22        "input": "Calculate 789 times 456",
23        "expected_behavior": "uses_calculator",
24        "expected_output": "359,784",
25        "category": "tool_use"
26    },
27    {
28        "id": "memory_store",
29        "input": "Remember that I prefer coffee in the morning",
30        "expected_behavior": "stores_to_memory",
31        "category": "memory"
32    },
33    {
34        "id": "memory_retrieve",
35        "input": "What do I like to drink in the morning?",
36        "expected_output": "coffee",
37        "category": "memory",
38        "requires": ["memory_store"]
39    }
40]
41
42## Run tests
43results = run_test_suite(agent, test_suite)
44
45## Generate summary
46summarize_results(results)
47
48## Save for historical tracking
49save_test_run(results, version_name="v1.2")

When you run this, you'll get immediate feedback on how your agent performs, a summary of results by category, and a saved record you can compare against future test runs.

What You've Built

You now have a practical framework for testing your AI agent. You can create test cases that cover all your agent's capabilities, run those tests systematically, and track performance over time. More importantly, you understand what makes a good test and how to interpret the results.

Testing isn't just about finding bugs. It's about understanding your agent's strengths and weaknesses so you can make informed decisions about where to improve. Every test run gives you data to work with, and that data guides your development.

In the next section, we'll look at how to create a feedback loop that uses these test results to continuously improve your agent. Testing measures performance. The feedback loop turns those measurements into action.

Glossary

Test Case: A specific input, expected behavior, and expected output used to verify that an agent works correctly. Think of it like a practice problem that tests whether your agent learned a particular skill.

Test Suite: A collection of related test cases that together verify multiple aspects of an agent's functionality. A comprehensive test suite covers all the major capabilities your agent needs.

Pass Rate: The percentage of tests that succeed in a test run. A pass rate of 80% means 8 out of 10 tests passed, giving you a quantitative measure of your agent's performance.

Test Dependency: When one test requires another test to run successfully first. For example, testing memory retrieval requires that memory storage worked first.

Test Fixture: Setup code that prepares the environment for testing, such as initializing an agent or loading test data. Fixtures help you avoid repeating setup code in every test.

Regression: When a change to your agent causes previously passing tests to fail. Tracking regressions helps ensure that improvements don't accidentally break existing functionality.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about testing AI agents with examples.

Loading component...

Reference

BIBTEXAcademic
@misc{testingaiagentswithexamplesbuildingtestsuitesforevaluationperformancetracking, author = {Michael Brenndoerfer}, title = {Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking}, year = {2025}, url = {https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-10} }
APAAcademic
Michael Brenndoerfer (2025). Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking. Retrieved from https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples
MLAAcademic
Michael Brenndoerfer. "Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking." 2025. Web. 11/10/2025. <https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples>.
CHICAGOAcademic
Michael Brenndoerfer. "Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking." Accessed 11/10/2025. https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking'. Available at: https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples (Accessed: 11/10/2025).
SimpleBasic
Michael Brenndoerfer (2025). Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking. https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.