Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent

Michael Brenndoerfer

AI Agent Handbook Machine Learning Data, Analytics & AI Software Engineering

Learn how to define clear, measurable success criteria for AI agents including correctness, reliability, efficiency, safety, and user experience metrics to guide evaluation and improvement.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Setting Goals and Success CriteriaLink Copied

You've built an AI agent that can reason, use tools, maintain memory, make plans, and even coordinate with other agents. But here's a question that should nag at you: How do you know if it actually works well?

It's tempting to judge by feel. "The agent seems smart," or "It usually gets things right," or "My colleagues were impressed in the demo." But relying on intuition alone is risky. Without clear success criteria, you can't systematically improve your agent, catch regressions when you make changes, or confidently deploy it to users.

Before you write a single test or collect any feedback, you need to answer a fundamental question: What does success look like for your agent? This chapter teaches you how to define clear, measurable goals that will guide your evaluation efforts and help you build a better agent.

Why Success Criteria MatterLink Copied

Imagine you've added planning capabilities to your personal assistant. A user asks, "Schedule a team meeting next Tuesday at 2 PM and send the agenda to everyone." The agent goes through its plan, makes several tool calls, and responds, "Done!"

Is that success? Maybe. But what if the meeting was scheduled for the wrong time? What if the agenda was sent to the wrong people? What if it worked perfectly but took five minutes to complete? Without predefined success criteria, you're left guessing.

Clear success criteria serve several purposes:

They make success concrete. Instead of "works well," you have "correctly schedules meetings 95% of the time."
They guide development. When you know what matters, you can prioritize improvements that move the needle.
They enable systematic testing. You can create test cases that directly measure whether you're meeting your goals.
They catch regressions. When you change something, you can verify you haven't broken what already worked.
They build confidence. Concrete measurements let you deploy with evidence, not hope.

Let's explore how to set these criteria for your agent.

Start with User GoalsLink Copied

The best success criteria emerge from understanding what users actually need from your agent. Our personal assistant exists to help users accomplish tasks, so we should define success in terms of those tasks.

Start by listing the main capabilities you've given your agent. For our assistant, that might include:

Answering factual questions
Performing calculations
Managing calendar events
Retrieving information from memory
Planning and executing multi-step tasks
Searching the web for current information

For each capability, ask: What would a successful interaction look like from the user's perspective?

Example: Calendar Management

When a user says, "Add a dentist appointment next Friday at 3 PM," success means:

The event is created in the calendar
The date is correct (next Friday, not this Friday or some other day)
The time is correct (3 PM, not 3 AM or 2 PM)
The description is accurate ("dentist appointment")
The user receives confirmation

Notice how specific this is. We're not just saying "the agent should handle calendar requests." We're defining exactly what "handle" means in measurable terms.

Types of Success CriteriaLink Copied

Different capabilities call for different kinds of success criteria. Let's explore the main categories.

Correctness CriteriaLink Copied

Correctness measures whether the agent produces the right answer or performs the right action. This is your most fundamental criterion.

For our assistant:

Factual questions: Does the agent provide accurate information? If asked "What's the capital of France?", does it say "Paris"?
Calculations: Does the agent compute the correct result? If asked "What's 15% of $240?", does it return "$ 36"?
Tool use: Does the agent call the right tool with the right parameters? If asked to "check the weather," does it invoke the weather API with the correct location?
Task completion: Does the agent accomplish what the user requested? If asked to "schedule a meeting," is the meeting actually scheduled?

For our personal assistant, we might set a correctness goal: "The agent should provide correct answers or complete tasks successfully in at least 90% of interactions."

Reliability CriteriaLink Copied

Reliability measures consistency. An agent that works sometimes isn't good enough.

Consider these reliability dimensions:

Consistency: Does the agent give the same answer to the same question asked twice?
Robustness: Does the agent handle variations in phrasing? If a user says "What's tomorrow's weather?" versus "Tell me the weather forecast for tomorrow," does it work both times?
Error handling: When something goes wrong (a tool fails, information is missing), does the agent handle it gracefully?

For our assistant, a reliability goal might be: "The agent should handle at least 95% of paraphrased queries correctly" or "The agent should gracefully handle tool failures without crashing or giving confusing responses."

Efficiency CriteriaLink Copied

Even correct answers lose value if they take too long. Efficiency criteria measure speed and resource usage.

For our assistant:

Response time: How quickly does the agent respond? For a simple question, 2 seconds might be acceptable, but 30 seconds is not.
Tool usage: Does the agent make unnecessary tool calls? If it can answer from memory, it shouldn't search the web.
Cost: How much does each interaction cost in API calls? If you're using a paid model, efficiency directly affects your budget.

An efficiency goal might be: "90% of simple queries should receive responses within 3 seconds" or "The agent should complete multi-step tasks using no more than 5 tool calls on average."

Safety and Compliance CriteriaLink Copied

Your agent should refuse inappropriate requests and respect boundaries.

Safety criteria for our assistant:

Refusal handling: Does the agent politely decline requests it shouldn't fulfill?
Privacy: Does the agent protect sensitive information and not leak it in responses?
Permissions: Does the agent respect access controls? It shouldn't read files or access data it's not authorized to see.

A safety goal: "The agent should refuse 100% of requests that violate safety policies" or "The agent should never expose API keys or passwords in its responses."

User Experience CriteriaLink Copied

Sometimes success isn't just about correctness but about how the interaction feels.

Clarity: Are the agent's responses clear and helpful? Or confusing and verbose?
Tone: Is the agent appropriately professional, friendly, or conversational?
Confirmation: For destructive or important actions, does the agent ask for confirmation before proceeding?

A UX goal: "95% of users should rate the agent's responses as 'clear and helpful' on a post-interaction survey."

Making Criteria MeasurableLink Copied

Good success criteria are specific and measurable. "The agent should be good" isn't useful. "The agent should correctly answer 90% of factual questions from our test set" is.

Here's how to transform vague goals into measurable criteria:

Vague: "The agent should usually work."

Measurable: "The agent should successfully complete tasks in at least 85% of test cases."

Vague: "Responses should be fast."

Measurable: "90% of single-turn queries should receive responses within 3 seconds."

Vague: "The agent should handle errors well."

Measurable: "When a tool call fails, the agent should provide a helpful error message to the user and suggest alternatives 100% of the time."

Notice the pattern: measurable criteria include numbers and clear conditions. You should be able to look at an interaction and definitively say whether it meets the criterion.

Setting ThresholdsLink Copied

Once you know what to measure, you need to decide what counts as success. This means setting thresholds.

Start with aspirational but realistic targets. If you're just beginning to evaluate your agent, you might not hit 95% correctness immediately, and that's okay. The point is to know where you stand.

For our personal assistant, we might set these initial thresholds:

Correctness: ≥85% correct responses (goal: improve to 95%)
Reliability: ≥90% consistency on paraphrased queries
Efficiency: ≥80% of simple queries under 3 seconds
Safety: 100% refusal of unsafe requests (no compromise here)

As you improve your agent, you can raise these thresholds. But start with levels that let you make progress without getting discouraged.

Prioritizing CriteriaLink Copied

You probably can't optimize for everything at once. Some criteria will be more important than others.

For our assistant, we might prioritize like this:

Safety (highest priority): The agent must never do something harmful or expose sensitive data.
Correctness: The agent should get answers right. An agent that's fast but wrong isn't useful.
Reliability: The agent should work consistently. Users won't trust an agent that's flaky.
Efficiency: The agent should be reasonably fast, but we'd accept slower responses if it means higher correctness.
User experience: Responses should be clear, but this is less critical than the above.

This hierarchy guides trade-offs. If improving speed reduces correctness, you don't do it. If making responses clearer adds a bit of latency, you might.

Your priorities will depend on your use case. A customer support agent might prioritize response time over exhaustive research. A medical diagnosis assistant should prioritize correctness above all else.

Example: Success Criteria for Our Personal AssistantLink Copied

Let's put this together with a concrete example. Here are the success criteria we'll use for our personal assistant:

Core Functionality

Factual accuracy: ≥90% of factual questions answered correctly
Calculation accuracy: 100% of mathematical calculations correct (no room for error here)
Task completion: ≥85% of multi-step tasks completed successfully
Tool selection: ≥95% of tool calls use the correct tool with correct parameters

Reliability

Paraphrase handling: ≥90% of paraphrased queries handled correctly
Error recovery: 100% of tool failures result in helpful error messages (not crashes)

Efficiency

Response time (simple): ≥90% of single-turn queries respond within 3 seconds
Response time (complex): ≥80% of multi-step tasks complete within 15 seconds
Tool efficiency: Multi-step tasks use ≤6 tool calls on average

Safety

Refusal rate: 100% of unsafe requests refused politely
Data privacy: 0% of responses leak API keys, passwords, or sensitive user data

User Experience

Clarity: ≥90% of responses rated "clear" in user surveys
Confirmation: 100% of destructive actions (delete, send email) require user confirmation

These criteria are specific, measurable, and tied to real user needs. They give us clear targets for testing and improvement.

Documenting Your CriteriaLink Copied

Once you've defined success criteria, write them down. This might seem obvious, but it's easy to let criteria remain implicit. That leads to confusion and inconsistency.

Create a simple document that lists:

Capability: What the agent should do
Success criterion: How you'll measure success
Threshold: The target value
Priority: How important this is relative to other criteria

Here's a simple format:

CAPABILITY: Answer factual questions
SUCCESS CRITERION: Percentage of correct answers on test set
THRESHOLD: ≥90%
PRIORITY: High

CAPABILITY: Schedule calendar events
SUCCESS CRITERION: Percentage of events correctly scheduled (right date, time, description)
THRESHOLD: ≥95%
PRIORITY: High

CAPABILITY: Respond to simple queries quickly
SUCCESS CRITERION: Percentage of responses under 3 seconds
THRESHOLD: ≥90%
PRIORITY: Medium

CAPABILITY: Answer factual questions
SUCCESS CRITERION: Percentage of correct answers on test set
THRESHOLD: ≥90%
PRIORITY: High

CAPABILITY: Schedule calendar events
SUCCESS CRITERION: Percentage of events correctly scheduled (right date, time, description)
THRESHOLD: ≥95%
PRIORITY: High

CAPABILITY: Respond to simple queries quickly
SUCCESS CRITERION: Percentage of responses under 3 seconds
THRESHOLD: ≥90%
PRIORITY: Medium

Share this document with your team. When you make changes to the agent, refer back to these criteria to ensure you're improving what matters.

Iterating on Success CriteriaLink Copied

Your success criteria aren't set in stone. As you learn more about how users interact with your agent, you'll refine them.

You might discover that some criteria were too strict or too lenient. You might find that you forgot important dimensions. You might realize that a criterion you thought mattered doesn't actually affect user satisfaction.

That's fine. Treat your success criteria as a living document. After each round of testing or each deployment cycle, revisit them:

Are these criteria still relevant?
Are the thresholds appropriate?
Are we missing important dimensions of success?
Have our priorities changed?

The goal isn't perfection on the first try. The goal is to have clear, shared criteria that evolve as your understanding deepens.

What We've CoveredLink Copied

Before you test your agent, you need to know what you're testing for. Success criteria transform vague notions of "working well" into concrete, measurable targets.

We've explored how to define success criteria by starting with user goals and identifying multiple dimensions: correctness, reliability, efficiency, safety, and user experience. We've seen how to make criteria measurable by setting specific thresholds and how to prioritize when you can't optimize for everything at once.

For our personal assistant, we've established a comprehensive set of success criteria covering everything from factual accuracy to response time to safety. These criteria will guide the testing and improvement work in the chapters ahead.

You now have a framework for defining what success means for your agent. In the next sections, we'll explore how to actually measure whether you're meeting these criteria and how to use that information to make your agent better.

GlossaryLink Copied

Correctness: A measure of whether an agent produces accurate answers or successfully completes requested tasks. This is the most fundamental evaluation criterion.

Efficiency Criteria: Standards that measure how quickly an agent responds and how many resources (tool calls, API requests, compute time) it uses. Efficiency matters for both user experience and cost.

Reliability: A measure of how consistently an agent performs across different phrasings, contexts, and edge cases. A reliable agent gives correct answers not just sometimes, but predictably.

Safety Criteria: Standards ensuring an agent refuses inappropriate requests, protects sensitive information, and operates within defined boundaries. For many applications, safety criteria should have no tolerance for failure.

Success Criteria: Specific, measurable standards that define what counts as successful agent performance. They transform vague goals like "works well" into concrete targets like "answers correctly 90% of the time."

Threshold: The minimum acceptable value for a success criterion. For example, "response time under 3 seconds" or "correctness above 85%." Thresholds make success criteria actionable.

User Experience (UX) Criteria: Standards that measure qualitative aspects of agent interactions, such as clarity, tone, and helpfulness. These criteria often require user feedback to evaluate.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about setting goals and success criteria for AI agents.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to AI Agent Handbook

Previous Chapter

Benefits and Challenges of Multi-Agent Systems

Next Chapter

Testing the Agent with Examples

Reference

BIBTEXAcademic

@misc{settinggoalsandsuccesscriteriahowtodefinewhatsuccessmeansforyouraiagent, author = {Michael Brenndoerfer}, title = {Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent}, year = {2025}, url = {https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent. Retrieved from https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation

MLAAcademic

Michael Brenndoerfer. "Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation>.

CHICAGOAcademic

Michael Brenndoerfer. "Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent'. Available at: https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent. https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation

Direct link:

https://mbrenndoerfer.com/writing/setting-goals-and-success-criteria-ai-agent-evaluation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Setting Goals and Success Criteria: How to Define What Success Means for Your AI Agent

Setting Goals and Success CriteriaLink Copied

Why Success Criteria MatterLink Copied

Start with User GoalsLink Copied

Types of Success CriteriaLink Copied

Correctness CriteriaLink Copied

Reliability CriteriaLink Copied

Efficiency CriteriaLink Copied

Safety and Compliance CriteriaLink Copied

User Experience CriteriaLink Copied

Making Criteria MeasurableLink Copied

Setting ThresholdsLink Copied

Prioritizing CriteriaLink Copied

Example: Success Criteria for Our Personal AssistantLink Copied

Documenting Your CriteriaLink Copied

Iterating on Success CriteriaLink Copied

What We've CoveredLink Copied

GlossaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Numerical Methods in Finance: Algorithms for Pricing & Risk

Continuous Feedback and Improvement: Building Better AI Agents Through Iteration

Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Numerical Methods in Finance: Algorithms for Pricing & Risk

Continuous Feedback and Improvement: Building Better AI Agents Through Iteration

Testing AI Agents with Examples: Building Test Suites for Evaluation & Performance Tracking

Stay updated