Learn how to define clear, measurable success criteria for AI agents including correctness, reliability, efficiency, safety, and user experience metrics to guide evaluation and improvement.

This article is part of the free-to-read AI Agent Handbook
Setting Goals and Success Criteria
You've built an AI agent that can reason, use tools, maintain memory, make plans, and even coordinate with other agents. But here's a question that should nag at you: How do you know if it actually works well?
It's tempting to judge by feel. "The agent seems smart," or "It usually gets things right," or "My colleagues were impressed in the demo." But relying on intuition alone is risky. Without clear success criteria, you can't systematically improve your agent, catch regressions when you make changes, or confidently deploy it to users.
Before you write a single test or collect any feedback, you need to answer a fundamental question: What does success look like for your agent? This chapter teaches you how to define clear, measurable goals that will guide your evaluation efforts and help you build a better agent.
Why Success Criteria Matter
Imagine you've added planning capabilities to your personal assistant. A user asks, "Schedule a team meeting next Tuesday at 2 PM and send the agenda to everyone." The agent goes through its plan, makes several tool calls, and responds, "Done!"
Is that success? Maybe. But what if the meeting was scheduled for the wrong time? What if the agenda was sent to the wrong people? What if it worked perfectly but took five minutes to complete? Without predefined success criteria, you're left guessing.
Clear success criteria serve several purposes:
- They make success concrete. Instead of "works well," you have "correctly schedules meetings 95% of the time."
- They guide development. When you know what matters, you can prioritize improvements that move the needle.
- They enable systematic testing. You can create test cases that directly measure whether you're meeting your goals.
- They catch regressions. When you change something, you can verify you haven't broken what already worked.
- They build confidence. Concrete measurements let you deploy with evidence, not hope.
Let's explore how to set these criteria for your agent.
Start with User Goals
The best success criteria emerge from understanding what users actually need from your agent. Our personal assistant exists to help users accomplish tasks, so we should define success in terms of those tasks.
Start by listing the main capabilities you've given your agent. For our assistant, that might include:
- Answering factual questions
- Performing calculations
- Managing calendar events
- Retrieving information from memory
- Planning and executing multi-step tasks
- Searching the web for current information
For each capability, ask: What would a successful interaction look like from the user's perspective?
Example: Calendar Management
When a user says, "Add a dentist appointment next Friday at 3 PM," success means:
- The event is created in the calendar
- The date is correct (next Friday, not this Friday or some other day)
- The time is correct (3 PM, not 3 AM or 2 PM)
- The description is accurate ("dentist appointment")
- The user receives confirmation
Notice how specific this is. We're not just saying "the agent should handle calendar requests." We're defining exactly what "handle" means in measurable terms.
Types of Success Criteria
Different capabilities call for different kinds of success criteria. Let's explore the main categories.
Correctness Criteria
Correctness measures whether the agent produces the right answer or performs the right action. This is your most fundamental criterion.
For our assistant:
- Factual questions: Does the agent provide accurate information? If asked "What's the capital of France?", does it say "Paris"?
- Calculations: Does the agent compute the correct result? If asked "What's 15% of 36"?
- Tool use: Does the agent call the right tool with the right parameters? If asked to "check the weather," does it invoke the weather API with the correct location?
- Task completion: Does the agent accomplish what the user requested? If asked to "schedule a meeting," is the meeting actually scheduled?
For our personal assistant, we might set a correctness goal: "The agent should provide correct answers or complete tasks successfully in at least 90% of interactions."
Reliability Criteria
Reliability measures consistency. An agent that works sometimes isn't good enough.
Consider these reliability dimensions:
- Consistency: Does the agent give the same answer to the same question asked twice?
- Robustness: Does the agent handle variations in phrasing? If a user says "What's tomorrow's weather?" versus "Tell me the weather forecast for tomorrow," does it work both times?
- Error handling: When something goes wrong (a tool fails, information is missing), does the agent handle it gracefully?
For our assistant, a reliability goal might be: "The agent should handle at least 95% of paraphrased queries correctly" or "The agent should gracefully handle tool failures without crashing or giving confusing responses."
Efficiency Criteria
Even correct answers lose value if they take too long. Efficiency criteria measure speed and resource usage.
For our assistant:
- Response time: How quickly does the agent respond? For a simple question, 2 seconds might be acceptable, but 30 seconds is not.
- Tool usage: Does the agent make unnecessary tool calls? If it can answer from memory, it shouldn't search the web.
- Cost: How much does each interaction cost in API calls? If you're using a paid model, efficiency directly affects your budget.
An efficiency goal might be: "90% of simple queries should receive responses within 3 seconds" or "The agent should complete multi-step tasks using no more than 5 tool calls on average."
Safety and Compliance Criteria
Your agent should refuse inappropriate requests and respect boundaries.
Safety criteria for our assistant:
- Refusal handling: Does the agent politely decline requests it shouldn't fulfill?
- Privacy: Does the agent protect sensitive information and not leak it in responses?
- Permissions: Does the agent respect access controls? It shouldn't read files or access data it's not authorized to see.
A safety goal: "The agent should refuse 100% of requests that violate safety policies" or "The agent should never expose API keys or passwords in its responses."
User Experience Criteria
Sometimes success isn't just about correctness but about how the interaction feels.
- Clarity: Are the agent's responses clear and helpful? Or confusing and verbose?
- Tone: Is the agent appropriately professional, friendly, or conversational?
- Confirmation: For destructive or important actions, does the agent ask for confirmation before proceeding?
A UX goal: "95% of users should rate the agent's responses as 'clear and helpful' on a post-interaction survey."
Making Criteria Measurable
Good success criteria are specific and measurable. "The agent should be good" isn't useful. "The agent should correctly answer 90% of factual questions from our test set" is.
Here's how to transform vague goals into measurable criteria:
Vague: "The agent should usually work."
Measurable: "The agent should successfully complete tasks in at least 85% of test cases."
Vague: "Responses should be fast."
Measurable: "90% of single-turn queries should receive responses within 3 seconds."
Vague: "The agent should handle errors well."
Measurable: "When a tool call fails, the agent should provide a helpful error message to the user and suggest alternatives 100% of the time."
Notice the pattern: measurable criteria include numbers and clear conditions. You should be able to look at an interaction and definitively say whether it meets the criterion.
Setting Thresholds
Once you know what to measure, you need to decide what counts as success. This means setting thresholds.
Start with aspirational but realistic targets. If you're just beginning to evaluate your agent, you might not hit 95% correctness immediately, and that's okay. The point is to know where you stand.
For our personal assistant, we might set these initial thresholds:
- Correctness: ≥85% correct responses (goal: improve to 95%)
- Reliability: ≥90% consistency on paraphrased queries
- Efficiency: ≥80% of simple queries under 3 seconds
- Safety: 100% refusal of unsafe requests (no compromise here)
As you improve your agent, you can raise these thresholds. But start with levels that let you make progress without getting discouraged.
Prioritizing Criteria
You probably can't optimize for everything at once. Some criteria will be more important than others.
For our assistant, we might prioritize like this:
- Safety (highest priority): The agent must never do something harmful or expose sensitive data.
- Correctness: The agent should get answers right. An agent that's fast but wrong isn't useful.
- Reliability: The agent should work consistently. Users won't trust an agent that's flaky.
- Efficiency: The agent should be reasonably fast, but we'd accept slower responses if it means higher correctness.
- User experience: Responses should be clear, but this is less critical than the above.
This hierarchy guides trade-offs. If improving speed reduces correctness, you don't do it. If making responses clearer adds a bit of latency, you might.
Your priorities will depend on your use case. A customer support agent might prioritize response time over exhaustive research. A medical diagnosis assistant should prioritize correctness above all else.
Example: Success Criteria for Our Personal Assistant
Let's put this together with a concrete example. Here are the success criteria we'll use for our personal assistant:
Core Functionality
- Factual accuracy: ≥90% of factual questions answered correctly
- Calculation accuracy: 100% of mathematical calculations correct (no room for error here)
- Task completion: ≥85% of multi-step tasks completed successfully
- Tool selection: ≥95% of tool calls use the correct tool with correct parameters
Reliability
- Paraphrase handling: ≥90% of paraphrased queries handled correctly
- Error recovery: 100% of tool failures result in helpful error messages (not crashes)
Efficiency
- Response time (simple): ≥90% of single-turn queries respond within 3 seconds
- Response time (complex): ≥80% of multi-step tasks complete within 15 seconds
- Tool efficiency: Multi-step tasks use ≤6 tool calls on average
Safety
- Refusal rate: 100% of unsafe requests refused politely
- Data privacy: 0% of responses leak API keys, passwords, or sensitive user data
User Experience
- Clarity: ≥90% of responses rated "clear" in user surveys
- Confirmation: 100% of destructive actions (delete, send email) require user confirmation
These criteria are specific, measurable, and tied to real user needs. They give us clear targets for testing and improvement.
Documenting Your Criteria
Once you've defined success criteria, write them down. This might seem obvious, but it's easy to let criteria remain implicit. That leads to confusion and inconsistency.
Create a simple document that lists:
- Capability: What the agent should do
- Success criterion: How you'll measure success
- Threshold: The target value
- Priority: How important this is relative to other criteria
Here's a simple format:
1CAPABILITY: Answer factual questions
2SUCCESS CRITERION: Percentage of correct answers on test set
3THRESHOLD: ≥90%
4PRIORITY: High
5
6CAPABILITY: Schedule calendar events
7SUCCESS CRITERION: Percentage of events correctly scheduled (right date, time, description)
8THRESHOLD: ≥95%
9PRIORITY: High
10
11CAPABILITY: Respond to simple queries quickly
12SUCCESS CRITERION: Percentage of responses under 3 seconds
13THRESHOLD: ≥90%
14PRIORITY: Medium1CAPABILITY: Answer factual questions
2SUCCESS CRITERION: Percentage of correct answers on test set
3THRESHOLD: ≥90%
4PRIORITY: High
5
6CAPABILITY: Schedule calendar events
7SUCCESS CRITERION: Percentage of events correctly scheduled (right date, time, description)
8THRESHOLD: ≥95%
9PRIORITY: High
10
11CAPABILITY: Respond to simple queries quickly
12SUCCESS CRITERION: Percentage of responses under 3 seconds
13THRESHOLD: ≥90%
14PRIORITY: MediumShare this document with your team. When you make changes to the agent, refer back to these criteria to ensure you're improving what matters.
Iterating on Success Criteria
Your success criteria aren't set in stone. As you learn more about how users interact with your agent, you'll refine them.
You might discover that some criteria were too strict or too lenient. You might find that you forgot important dimensions. You might realize that a criterion you thought mattered doesn't actually affect user satisfaction.
That's fine. Treat your success criteria as a living document. After each round of testing or each deployment cycle, revisit them:
- Are these criteria still relevant?
- Are the thresholds appropriate?
- Are we missing important dimensions of success?
- Have our priorities changed?
The goal isn't perfection on the first try. The goal is to have clear, shared criteria that evolve as your understanding deepens.
What We've Covered
Before you test your agent, you need to know what you're testing for. Success criteria transform vague notions of "working well" into concrete, measurable targets.
We've explored how to define success criteria by starting with user goals and identifying multiple dimensions: correctness, reliability, efficiency, safety, and user experience. We've seen how to make criteria measurable by setting specific thresholds and how to prioritize when you can't optimize for everything at once.
For our personal assistant, we've established a comprehensive set of success criteria covering everything from factual accuracy to response time to safety. These criteria will guide the testing and improvement work in the chapters ahead.
You now have a framework for defining what success means for your agent. In the next sections, we'll explore how to actually measure whether you're meeting these criteria and how to use that information to make your agent better.
Glossary
Correctness: A measure of whether an agent produces accurate answers or successfully completes requested tasks. This is the most fundamental evaluation criterion.
Efficiency Criteria: Standards that measure how quickly an agent responds and how many resources (tool calls, API requests, compute time) it uses. Efficiency matters for both user experience and cost.
Reliability: A measure of how consistently an agent performs across different phrasings, contexts, and edge cases. A reliable agent gives correct answers not just sometimes, but predictably.
Safety Criteria: Standards ensuring an agent refuses inappropriate requests, protects sensitive information, and operates within defined boundaries. For many applications, safety criteria should have no tolerance for failure.
Success Criteria: Specific, measurable standards that define what counts as successful agent performance. They transform vague goals like "works well" into concrete targets like "answers correctly 90% of the time."
Threshold: The minimum acceptable value for a success criterion. For example, "response time under 3 seconds" or "correctness above 85%." Thresholds make success criteria actionable.
User Experience (UX) Criteria: Standards that measure qualitative aspects of agent interactions, such as clarity, tone, and helpfulness. These criteria often require user feedback to evaluate.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about setting goals and success criteria for AI agents.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Scaling Up without Breaking the Bank: AI Agent Performance & Cost Optimization at Scale
Learn how to scale AI agents from single users to thousands while maintaining performance and controlling costs. Covers horizontal scaling, load balancing, monitoring, cost controls, and prompt optimization strategies.

Managing and Reducing AI Agent Costs: Complete Guide to Cost Optimization Strategies
Learn how to dramatically reduce AI agent API costs without sacrificing capability. Covers model selection, caching, batching, prompt optimization, and budget controls with practical Python examples.

Speeding Up AI Agents: Performance Optimization Techniques for Faster Response Times
Learn practical techniques to make AI agents respond faster, including model selection strategies, response caching, streaming, parallel execution, and prompt optimization for reduced latency.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

