Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning AI Agent Handbook

Learn how to create and use test cases to evaluate AI agent performance. Build comprehensive test suites, track results over time, and use testing frameworks like pytest, LangSmith, LangFuse, and Promptfoo to measure your agent's capabilities systematically.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Testing the Agent with ExamplesLink Copied

You've defined what success looks like for your agent. Now it's time to put those criteria into practice by actually testing the agent. Just as students take quizzes to demonstrate what they've learned, your agent needs a set of test cases to show whether it can handle the tasks you've designed it for.

Testing with examples isn't about catching your agent doing something wrong. It's about learning where the agent performs well and where it needs improvement. By the end of this chapter, you'll have a practical testing framework you can use to measure your assistant's performance and track its progress over time.

Building Your Test SuiteLink Copied

A good test suite covers the full range of your agent's capabilities. For our personal assistant, that means creating examples that exercise each of the skills we've built: answering questions, using tools, remembering information, and planning multi-step tasks.

Let's start simple. Here's what a basic test case looks like:

In[3]:

Code

test_case = {
    "id": "math_001",
    "input": "What's 1,234 multiplied by 5,678?",
    "expected_behavior": "uses_calculator",
    "expected_output": "7,006,652",
    "category": "tool_use"
}

test_case = {
    "id": "math_001",
    "input": "What's 1,234 multiplied by 5,678?",
    "expected_behavior": "uses_calculator",
    "expected_output": "7,006,652",
    "category": "tool_use"
}

Each test case captures three essential pieces:

What you're asking the agent to do (the input)
How you expect it to approach the task (the behavior)
What result you want to see (the output)

Notice that we're not just checking if the agent gets the right answer. We're also verifying that it uses the right approach. For a math problem, we expect the agent to use its calculator tool rather than guessing. This distinction matters because it tells us whether the agent is reasoning correctly, not just getting lucky.

Let's build a small test suite that covers different aspects of our assistant's capabilities:

In[4]:

Code

## Using Claude Sonnet 4.5 for its superior agent reasoning and tool use
test_suite = [
    {
        "id": "math_001",
        "input": "What's 1,234 multiplied by 5,678?",
        "expected_behavior": "uses_calculator",
        "expected_output": "7,006,652",
        "category": "tool_use"
    },
    {
        "id": "memory_001",
        "input": "Remember that my birthday is July 20th.",
        "expected_behavior": "stores_to_memory",
        "expected_output": "confirmation of storage",
        "category": "memory"
    },
    {
        "id": "memory_002",
        "input": "When is my birthday?",
        "expected_behavior": "retrieves_from_memory",
        "expected_output": "July 20th",
        "category": "memory",
        "requires": ["memory_001"]  # This test depends on a previous one
    },
    {
        "id": "reasoning_001",
        "input": "If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?",
        "expected_behavior": "step_by_step_reasoning",
        "expected_output": "1:15 PM",
        "category": "reasoning"
    },
    {
        "id": "planning_001",
        "input": "Find today's weather and recommend what I should wear.",
        "expected_behavior": "uses_weather_tool_then_makes_recommendation",
        "expected_output": "weather report followed by clothing suggestion",
        "category": "planning"
    }
]

## Using Claude Sonnet 4.5 for its superior agent reasoning and tool use
test_suite = [
    {
        "id": "math_001",
        "input": "What's 1,234 multiplied by 5,678?",
        "expected_behavior": "uses_calculator",
        "expected_output": "7,006,652",
        "category": "tool_use"
    },
    {
        "id": "memory_001",
        "input": "Remember that my birthday is July 20th.",
        "expected_behavior": "stores_to_memory",
        "expected_output": "confirmation of storage",
        "category": "memory"
    },
    {
        "id": "memory_002",
        "input": "When is my birthday?",
        "expected_behavior": "retrieves_from_memory",
        "expected_output": "July 20th",
        "category": "memory",
        "requires": ["memory_001"]  # This test depends on a previous one
    },
    {
        "id": "reasoning_001",
        "input": "If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?",
        "expected_behavior": "step_by_step_reasoning",
        "expected_output": "1:15 PM",
        "category": "reasoning"
    },
    {
        "id": "planning_001",
        "input": "Find today's weather and recommend what I should wear.",
        "expected_behavior": "uses_weather_tool_then_makes_recommendation",
        "expected_output": "weather report followed by clothing suggestion",
        "category": "planning"
    }
]

This suite tests five different capabilities. We have a math calculation that requires tool use, memory storage and retrieval, a reasoning problem, and a planning task that involves multiple steps. Notice how some tests depend on others. The second memory test only works if the first one successfully stored the birthday information.

Running the TestsLink Copied

With our test suite ready, we need a way to run each test and check whether the agent passes. Let's build a simple test runner:

In[5]:

Code

def run_test(agent, test_case):
    """
    Run a single test case and return the results.
    """
    print(f"\nRunning test: {test_case['id']}")
    print(f"Input: {test_case['input']}")
    
    # Run the agent with the test input
    response = agent.process(test_case['input'])
    
    # Record what happened
    result = {
        "test_id": test_case['id'],
        "input": test_case['input'],
        "output": response['text'],
        "behavior": response.get('actions_taken', []),
        "passed": False,
        "notes": []
    }
    
    # Check if the agent's behavior matches expectations
    if test_case['expected_behavior'] in str(response.get('actions_taken', [])):
        result['notes'].append(f"✓ Correct behavior: {test_case['expected_behavior']}")
    else:
        result['notes'].append(f"✗ Expected {test_case['expected_behavior']}, got {response.get('actions_taken', [])}")
    
    # Check if the output matches expectations
    if test_case['expected_output'].lower() in response['text'].lower():
        result['notes'].append(f"✓ Correct output")
        result['passed'] = True
    else:
        result['notes'].append(f"✗ Output doesn't match expected: {test_case['expected_output']}")
    
    return result

def run_test(agent, test_case):
    """
    Run a single test case and return the results.
    """
    print(f"\nRunning test: {test_case['id']}")
    print(f"Input: {test_case['input']}")
    
    # Run the agent with the test input
    response = agent.process(test_case['input'])
    
    # Record what happened
    result = {
        "test_id": test_case['id'],
        "input": test_case['input'],
        "output": response['text'],
        "behavior": response.get('actions_taken', []),
        "passed": False,
        "notes": []
    }
    
    # Check if the agent's behavior matches expectations
    if test_case['expected_behavior'] in str(response.get('actions_taken', [])):
        result['notes'].append(f"✓ Correct behavior: {test_case['expected_behavior']}")
    else:
        result['notes'].append(f"✗ Expected {test_case['expected_behavior']}, got {response.get('actions_taken', [])}")
    
    # Check if the output matches expectations
    if test_case['expected_output'].lower() in response['text'].lower():
        result['notes'].append(f"✓ Correct output")
        result['passed'] = True
    else:
        result['notes'].append(f"✗ Output doesn't match expected: {test_case['expected_output']}")
    
    return result

This test runner does a few important things. First, it runs the agent with the test input and captures the response. Then it checks two things: did the agent use the right approach (behavior), and did it produce the right answer (output)? Both checks need to pass for the test to succeed.

The checking logic here is intentionally simple. We're looking for substrings and checking if certain actions were taken. In a production system, you might want more sophisticated matching, but this is enough to get started.

Let's run our entire test suite:

In[6]:

Code

def run_test_suite(agent, test_suite):
    """
    Run all tests and generate a report.
    """
    results = []
    
    for test_case in test_suite:
        # Check if this test has dependencies
        if 'requires' in test_case:
            # Make sure required tests ran first
            for required_id in test_case['requires']:
                required_result = next((r for r in results if r['test_id'] == required_id), None)
                if not required_result or not required_result['passed']:
                    print(f"Skipping {test_case['id']} - dependency {required_id} failed")
                    continue
        
        result = run_test(agent, test_case)
        results.append(result)
        
        # Print immediate feedback
        status = "PASS" if result['passed'] else "FAIL"
        print(f"Result: {status}")
        for note in result['notes']:
            print(f"  {note}")
    
    return results

def run_test_suite(agent, test_suite):
    """
    Run all tests and generate a report.
    """
    results = []
    
    for test_case in test_suite:
        # Check if this test has dependencies
        if 'requires' in test_case:
            # Make sure required tests ran first
            for required_id in test_case['requires']:
                required_result = next((r for r in results if r['test_id'] == required_id), None)
                if not required_result or not required_result['passed']:
                    print(f"Skipping {test_case['id']} - dependency {required_id} failed")
                    continue
        
        result = run_test(agent, test_case)
        results.append(result)
        
        # Print immediate feedback
        status = "PASS" if result['passed'] else "FAIL"
        print(f"Result: {status}")
        for note in result['notes']:
            print(f"  {note}")
    
    return results

When you run this, you'll see output like this:

Running test: math_001
Input: What's 1,234 multiplied by 5,678?
Result: PASS
  ✓ Correct behavior: uses_calculator
  ✓ Correct output

Running test: memory_001
Input: Remember that my birthday is July 20th.
Result: PASS
  ✓ Correct behavior: stores_to_memory
  ✓ Correct output

Running test: memory_002
Input: When is my birthday?
Result: PASS
  ✓ Correct behavior: retrieves_from_memory
  ✓ Correct output

Running test: reasoning_001
Input: If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?
Result: PASS
  ✓ Correct behavior: step_by_step_reasoning
  ✓ Correct output

Running test: planning_001
Input: Find today's weather and recommend what I should wear.
Result: FAIL
  ✓ Correct behavior: uses_weather_tool_then_makes_recommendation
  ✗ Output doesn't match expected: weather report followed by clothing suggestion

Running test: math_001
Input: What's 1,234 multiplied by 5,678?
Result: PASS
  ✓ Correct behavior: uses_calculator
  ✓ Correct output

Running test: memory_001
Input: Remember that my birthday is July 20th.
Result: PASS
  ✓ Correct behavior: stores_to_memory
  ✓ Correct output

Running test: memory_002
Input: When is my birthday?
Result: PASS
  ✓ Correct behavior: retrieves_from_memory
  ✓ Correct output

Running test: reasoning_001
Input: If I have a meeting at 2 PM and it takes 45 minutes to get there, when should I leave?
Result: PASS
  ✓ Correct behavior: step_by_step_reasoning
  ✓ Correct output

Running test: planning_001
Input: Find today's weather and recommend what I should wear.
Result: FAIL
  ✓ Correct behavior: uses_weather_tool_then_makes_recommendation
  ✗ Output doesn't match expected: weather report followed by clothing suggestion

In this example, four tests passed and one failed. The planning test shows the right behavior (it called the weather tool and made a recommendation), but the output format didn't match what we expected. This is valuable feedback. Maybe our expected output was too strict, or maybe the agent needs to format its responses more consistently.

Analyzing Test ResultsLink Copied

Once you've run your tests, you need to understand what they're telling you. Let's create a simple summary function:

In[7]:

Code

def summarize_results(results):
    """
    Generate a summary of test results.
    """
    total = len(results)
    passed = sum(1 for r in results if r['passed'])
    failed = total - passed
    
    print(f"\n{'='*50}")
    print(f"Test Summary")
    print(f"{'='*50}")
    print(f"Total tests: {total}")
    print(f"Passed: {passed} ({passed/total*100:.1f}%)")
    print(f"Failed: {failed} ({failed/total*100:.1f}%)")
    
    # Break down by category
    categories = {}
    for result in results:
        # Find the test case to get its category
        test_case = next((tc for tc in test_suite if tc['id'] == result['test_id']), None)
        if test_case:
            cat = test_case['category']
            if cat not in categories:
                categories[cat] = {'total': 0, 'passed': 0}
            categories[cat]['total'] += 1
            if result['passed']:
                categories[cat]['passed'] += 1
    
    print(f"\nBy Category:")
    for cat, stats in categories.items():
        success_rate = stats['passed'] / stats['total'] * 100
        print(f"  {cat}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
    
    # List failed tests
    if failed > 0:
        print(f"\nFailed Tests:")
        for result in results:
            if not result['passed']:
                print(f"  {result['test_id']}: {result['input'][:50]}...")

def summarize_results(results):
    """
    Generate a summary of test results.
    """
    total = len(results)
    passed = sum(1 for r in results if r['passed'])
    failed = total - passed
    
    print(f"\n{'='*50}")
    print(f"Test Summary")
    print(f"{'='*50}")
    print(f"Total tests: {total}")
    print(f"Passed: {passed} ({passed/total*100:.1f}%)")
    print(f"Failed: {failed} ({failed/total*100:.1f}%)")
    
    # Break down by category
    categories = {}
    for result in results:
        # Find the test case to get its category
        test_case = next((tc for tc in test_suite if tc['id'] == result['test_id']), None)
        if test_case:
            cat = test_case['category']
            if cat not in categories:
                categories[cat] = {'total': 0, 'passed': 0}
            categories[cat]['total'] += 1
            if result['passed']:
                categories[cat]['passed'] += 1
    
    print(f"\nBy Category:")
    for cat, stats in categories.items():
        success_rate = stats['passed'] / stats['total'] * 100
        print(f"  {cat}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
    
    # List failed tests
    if failed > 0:
        print(f"\nFailed Tests:")
        for result in results:
            if not result['passed']:
                print(f"  {result['test_id']}: {result['input'][:50]}...")

This summary shows you not just the overall pass rate, but also breaks down performance by category. You might discover that your agent is excellent at tool use but struggles with planning tasks. That tells you exactly where to focus your improvement efforts.

When you run this summary on our example results, you'll see:

==================================================
Test Summary
==================================================
Total tests: 5
Passed: 4 (80.0%)
Failed: 1 (20.0%)

By Category:
  tool_use: 1/1 (100.0%)
  memory: 2/2 (100.0%)
  reasoning: 1/1 (100.0%)
  planning: 0/1 (0.0%)

Failed Tests:
  planning_001: Find today's weather and recommend what I should we...

==================================================
Test Summary
==================================================
Total tests: 5
Passed: 4 (80.0%)
Failed: 1 (20.0%)

By Category:
  tool_use: 1/1 (100.0%)
  memory: 2/2 (100.0%)
  reasoning: 1/1 (100.0%)
  planning: 0/1 (0.0%)

Failed Tests:
  planning_001: Find today's weather and recommend what I should we...

This tells us that our agent handles basic capabilities well but needs work on planning tasks. That's actionable information.

Tracking Performance Over TimeLink Copied

Testing once is useful, but testing regularly is powerful. By running the same test suite after each improvement to your agent, you can see whether your changes actually help or hurt performance.

Here's a simple way to track results over time:

In[8]:

Code

import json
from datetime import datetime

def save_test_run(results, version_name):
    """
    Save test results with a timestamp for historical tracking.
    """
    run_data = {
        "timestamp": datetime.now().isoformat(),
        "version": version_name,
        "results": results,
        "summary": {
            "total": len(results),
            "passed": sum(1 for r in results if r['passed']),
            "pass_rate": sum(1 for r in results if r['passed']) / len(results)
        }
    }
    
    filename = f"test_results_{version_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(filename, 'w') as f:
        json.dump(run_data, f, indent=2)
    
    print(f"Results saved to {filename}")
    return filename

import json
from datetime import datetime

def save_test_run(results, version_name):
    """
    Save test results with a timestamp for historical tracking.
    """
    run_data = {
        "timestamp": datetime.now().isoformat(),
        "version": version_name,
        "results": results,
        "summary": {
            "total": len(results),
            "passed": sum(1 for r in results if r['passed']),
            "pass_rate": sum(1 for r in results if r['passed']) / len(results)
        }
    }
    
    filename = f"test_results_{version_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(filename, 'w') as f:
        json.dump(run_data, f, indent=2)
    
    print(f"Results saved to {filename}")
    return filename

Now you can track how your agent improves. Maybe you started with a pass rate of 60%, then after adding better tool selection logic you're at 75%, and after refining prompts you hit 90%. Each test run gives you confidence that you're moving in the right direction.

You can even create a simple visualization:

In[9]:

Code

import matplotlib.pyplot as plt

def plot_progress(test_history_files):
    """
    Plot the agent's test performance over time.
    """
    versions = []
    pass_rates = []
    
    for file in test_history_files:
        with open(file, 'r') as f:
            data = json.load(f)
            versions.append(data['version'])
            pass_rates.append(data['summary']['pass_rate'] * 100)
    
    plt.figure(figsize=(10, 6))
    plt.plot(versions, pass_rates, marker='o', linewidth=2, markersize=8)
    plt.xlabel('Version')
    plt.ylabel('Pass Rate (%)')
    plt.title('Agent Test Performance Over Time')
    plt.ylim(0, 100)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('agent_progress.png')
    print("Progress chart saved to agent_progress.png")

import matplotlib.pyplot as plt

def plot_progress(test_history_files):
    """
    Plot the agent's test performance over time.
    """
    versions = []
    pass_rates = []
    
    for file in test_history_files:
        with open(file, 'r') as f:
            data = json.load(f)
            versions.append(data['version'])
            pass_rates.append(data['summary']['pass_rate'] * 100)
    
    plt.figure(figsize=(10, 6))
    plt.plot(versions, pass_rates, marker='o', linewidth=2, markersize=8)
    plt.xlabel('Version')
    plt.ylabel('Pass Rate (%)')
    plt.title('Agent Test Performance Over Time')
    plt.ylim(0, 100)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('agent_progress.png')
    print("Progress chart saved to agent_progress.png")

Seeing your agent's performance improve over time is motivating. It transforms testing from a chore into a feedback loop that drives continuous improvement.

Testing Frameworks and ToolsLink Copied

While we've built our own simple test runner, you don't have to start from scratch. Several tools can help you test AI agents more effectively.

Python Testing Frameworks:

The standard Python testing frameworks work well for agent testing. You can use pytest or unittest to structure your tests in a familiar way:

In[18]:

Code

import pytest

class TestAgentCapabilities:
    def test_calculator_tool(self, agent):
        response = agent.process("What's 1,234 multiplied by 5,678?")
        assert "7,006,652" in response['text']
        assert 'calculator' in response['actions_taken']
    
    def test_memory_storage(self, agent):
        agent.process("Remember that my birthday is July 20th.")
        response = agent.process("When is my birthday?")
        assert "July 20" in response['text'].lower()

import pytest

class TestAgentCapabilities:
    def test_calculator_tool(self, agent):
        response = agent.process("What's 1,234 multiplied by 5,678?")
        assert "7,006,652" in response['text']
        assert 'calculator' in response['actions_taken']
    
    def test_memory_storage(self, agent):
        agent.process("Remember that my birthday is July 20th.")
        response = agent.process("When is my birthday?")
        assert "July 20" in response['text'].lower()

Using pytest gives you powerful features like fixtures (for setting up test agents), parametrized tests (for running the same test with different inputs), and detailed failure reports. You can run your tests with pytest tests/ and get professional-grade test output.

LLM-Specific Testing Tools:

Several tools have emerged specifically for testing language model applications:

LangSmith (from LangChain) gives you tracing, evaluation, and testing in one package. You can create test datasets, run evaluations, and track your agent's performance across versions. If you're already using LangChain, this integrates naturally. Learn more at https://docs.smith.langchain.com/

LangFuse is an open-source observability and evaluation platform. You can trace execution, create test cases, and run automated evaluations. It works with various LLM frameworks and includes a dashboard for monitoring results. See https://langfuse.com/docs

Promptfoo specializes in testing LLM outputs. You define test cases in YAML files, run batch evaluations, and compare different prompts or models. This works well when you're iterating on prompt design. Visit https://promptfoo.dev/docs/intro

OpenAI Evals offers standardized benchmarks and custom evaluation tools for GPT-5 and other OpenAI models. You can build your own evals or use existing ones to measure specific capabilities. Check out https://github.com/openai/evals

Braintrust helps you evaluate and improve AI applications with tools for creating datasets, running experiments, and tracking metrics over time. See https://www.braintrust.dev/docs

These frameworks handle much of the infrastructure we built manually: running tests in parallel, tracking results over time, comparing versions, and generating reports. If you're building a production agent, they're worth exploring. However, understanding how testing works at the fundamental level (as we've done here) helps you use these tools more effectively.

Making Tests MaintainableLink Copied

As your agent evolves, your test suite needs to keep pace. Here are some practices that keep tests useful over time:

Use clear test IDs and descriptions. Six months from now, test_001 won't mean anything to you. But memory_birthday_storage tells you exactly what's being tested.

Group related tests. Keep all memory tests together, all tool tests together. This makes it easier to find and update tests when you modify a specific capability.

Document edge cases. When you discover a bug or unexpected behavior, write a test for it. Your test suite becomes a record of problems you've solved.

Keep expected outputs realistic. Don't require exact text matches. Language models vary their responses, and that's okay. Test for the presence of key information, not specific phrasings.

Review failed tests carefully. Sometimes a failed test means your agent is broken. But sometimes it means your test expectations were wrong. Be willing to update tests as your understanding improves.

A Complete Testing ExampleLink Copied

Let's put everything together. Here's a complete, runnable example of testing our personal assistant:

In[20]:

Code

## Using Claude Sonnet 4.5 for comprehensive agent testing
import json
from datetime import datetime
import os
from anthropic import Anthropic

## Initialize your agent (implementation from previous chapters)
agent = PersonalAssistant(
    model="claude-sonnet-4-5",
    tools=[calculator, weather_api, memory_manager]
)

## Define test suite
test_suite = [
    {
        "id": "basic_question",
        "input": "What is the capital of France?",
        "expected_output": "Paris",
        "category": "knowledge"
    },
    {
        "id": "tool_calculator",
        "input": "Calculate 789 times 456",
        "expected_behavior": "uses_calculator",
        "expected_output": "359,784",
        "category": "tool_use"
    },
    {
        "id": "memory_store",
        "input": "Remember that I prefer coffee in the morning",
        "expected_behavior": "stores_to_memory",
        "category": "memory"
    },
    {
        "id": "memory_retrieve",
        "input": "What do I like to drink in the morning?",
        "expected_output": "coffee",
        "category": "memory",
        "requires": ["memory_store"]
    }
]

## Run tests
results = run_test_suite(agent, test_suite)

## Generate summary
summarize_results(results)

## Save for historical tracking
save_test_run(results, version_name="v1.2")

## Using Claude Sonnet 4.5 for comprehensive agent testing
import json
from datetime import datetime
import os
from anthropic import Anthropic

## Initialize your agent (implementation from previous chapters)
agent = PersonalAssistant(
    model="claude-sonnet-4-5",
    tools=[calculator, weather_api, memory_manager]
)

## Define test suite
test_suite = [
    {
        "id": "basic_question",
        "input": "What is the capital of France?",
        "expected_output": "Paris",
        "category": "knowledge"
    },
    {
        "id": "tool_calculator",
        "input": "Calculate 789 times 456",
        "expected_behavior": "uses_calculator",
        "expected_output": "359,784",
        "category": "tool_use"
    },
    {
        "id": "memory_store",
        "input": "Remember that I prefer coffee in the morning",
        "expected_behavior": "stores_to_memory",
        "category": "memory"
    },
    {
        "id": "memory_retrieve",
        "input": "What do I like to drink in the morning?",
        "expected_output": "coffee",
        "category": "memory",
        "requires": ["memory_store"]
    }
]

## Run tests
results = run_test_suite(agent, test_suite)

## Generate summary
summarize_results(results)

## Save for historical tracking
save_test_run(results, version_name="v1.2")

When you run this, you'll get immediate feedback on how your agent performs, a summary of results by category, and a saved record you can compare against future test runs.

What You've BuiltLink Copied

You now have a practical framework for testing your AI agent. You can create test cases that cover all your agent's capabilities, run those tests systematically, and track performance over time. More importantly, you understand what makes a good test and how to interpret the results.

Testing isn't just about finding bugs. It's about understanding your agent's strengths and weaknesses so you can make informed decisions about where to improve. Every test run gives you data to work with, and that data guides your development.

In the next section, we'll look at how to create a feedback loop that uses these test results to continuously improve your agent. Testing measures performance. The feedback loop turns those measurements into action.

GlossaryLink Copied

Test Case: A specific input, expected behavior, and expected output used to verify that an agent works correctly. Think of it like a practice problem that tests whether your agent learned a particular skill.

Test Suite: A collection of related test cases that together verify multiple aspects of an agent's functionality. A comprehensive test suite covers all the major capabilities your agent needs.

Pass Rate: The percentage of tests that succeed in a test run. A pass rate of 80% means 8 out of 10 tests passed, giving you a quantitative measure of your agent's performance.

Test Dependency: When one test requires another test to run successfully first. For example, testing memory retrieval requires that memory storage worked first.

Test Fixture: Setup code that prepares the environment for testing, such as initializing an agent or loading test data. Fixtures help you avoid repeating setup code in every test.

Regression: When a change to your agent causes previously passing tests to fail. Tracking regressions helps ensure that improvements don't accidentally break existing functionality.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about testing AI agents with examples.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to AI Agent Handbook

Previous Chapter

Setting Goals and Success Criteria

Next Chapter

Continuous Feedback and Improvement

Reference

BIBTEXAcademic

@misc{testingaiagentswithexamplesbuildingtestsuitesforevaluationperformancetracking, author = {Michael Brenndoerfer}, title = {Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking}, year = {2025}, url = {https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking. Retrieved from https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples

MLAAcademic

Michael Brenndoerfer. "Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples>.

CHICAGOAcademic

Michael Brenndoerfer. "Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking'. Available at: https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking. https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples

Direct link:

https://mbrenndoerfer.com/writing/testing-ai-agents-with-examples

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking

Testing the Agent with ExamplesLink Copied

Building Your Test SuiteLink Copied

Running the TestsLink Copied

Analyzing Test ResultsLink Copied

Tracking Performance Over TimeLink Copied

Testing Frameworks and ToolsLink Copied

Making Tests MaintainableLink Copied

A Complete Testing ExampleLink Copied

What You've BuiltLink Copied

GlossaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Numerical Methods in Finance: Algorithms for Pricing & Risk

Continuous Feedback and Improvement: Building Better AI Agents Through Iteration

Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Numerical Methods in Finance: Algorithms for Pricing & Risk

Continuous Feedback and Improvement: Building Better AI Agents Through Iteration

Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent

Stay updated