Continuous Feedback and Improvement: Building Better AI Agents Through Iteration
Back to Writing

Continuous Feedback and Improvement: Building Better AI Agents Through Iteration

Michael Brenndoerfer•November 10, 2025•14 min read•2,441 words•Interactive

Learn how to create feedback loops that continuously improve your AI agent through real-world usage data, pattern analysis, and targeted improvements.

AI Agent Handbook Cover
Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

Continuous Feedback and Improvement

In the previous chapters, you learned how to set success criteria and test your agent with examples. But here's the truth: evaluation isn't something you do once and forget. The best AI agents get better over time because their developers treat evaluation as an ongoing conversation, not a final exam.

Think about how you improve at anything. You try something, see what works and what doesn't, adjust your approach, and try again. Your agent deserves the same treatment. In this chapter, we'll explore how to create a feedback loop that continuously improves your assistant based on real-world use.

Why Continuous Improvement Matters

Let's say you built our personal assistant and tested it thoroughly. It passed all your test cases with flying colors. You deploy it, feeling confident. Then users start interacting with it, and something unexpected happens.

A user asks: "What's the weather like for my trip next week?" Your assistant has access to a weather tool and a calendar tool. It should check the calendar for the trip dates, then look up the weather. But instead, it assumes "next week" means exactly seven days from now and returns weather for the wrong dates. Your test cases never caught this because they always used explicit dates.

This is exactly why continuous improvement matters. Real users will surprise you with questions, phrasings, and scenarios you never anticipated. Each unexpected interaction is a gift: it shows you where your agent needs to grow.

Building a Feedback Loop

A feedback loop is simple in concept: use what you learn from the agent's performance to make it better. Here's how this looks in practice.

Capture What Goes Wrong

First, you need to know when things go wrong. There are several ways to collect this information:

User feedback: The most direct approach. When users interact with your assistant, give them a way to signal if something went wrong. This could be as simple as a thumbs up or thumbs down button, or a "Report an issue" link. Some developers include a quick feedback form: "Was this answer helpful? If not, what went wrong?"

Automatic failure detection: Your agent can monitor itself for certain red flags. Did a tool call fail? Did the agent say "I don't know" when it should have known? Did it take an unusually long time to respond? Log these events automatically.

Manual review: Periodically review a sample of conversations. You might pick random interactions, or focus on edge cases (very short queries, very long conversations, or sessions where the user asked follow-up questions repeatedly).

Here's a simple example of capturing feedback in code:

1## Using Claude Sonnet 4.5 for its superior agent reasoning capabilities
2import anthropic
3import json
4from datetime import datetime
5
6client = anthropic.Anthropic()
7
8def log_interaction(user_query, agent_response, success, feedback=None):
9    """Log each interaction for later review."""
10    log_entry = {
11        "timestamp": datetime.now().isoformat(),
12        "query": user_query,
13        "response": agent_response,
14        "success": success,
15        "feedback": feedback
16    }
17    
18    # In practice, you'd write this to a database or file
19    with open("agent_feedback.jsonl", "a") as f:
20        f.write(json.dumps(log_entry) + "\n")
21
22## Example usage
23user_query = "What's the weather like for my trip next week?"
24response = "The weather forecast for next week shows sunny skies with temperatures around 75°F."
25
26## User indicates this was unhelpful
27log_interaction(user_query, response, success=False, feedback="Wrong dates - didn't check my calendar")

This logs every interaction along with whether it succeeded. Over time, you'll build a dataset of what works and what doesn't.

Analyze Patterns

Once you're collecting feedback, look for patterns. This is where the learning happens. You're not trying to fix every single mistake individually. Instead, you want to understand categories of problems.

Let's say after a week of use, you review your logs and notice:

  • 15 queries where the agent misunderstood relative time phrases ("next week," "tomorrow," "in a few days")
  • 8 queries where the agent used a tool but got an error and didn't recover gracefully
  • 5 queries where the agent gave overly long answers when the user clearly wanted something brief

These patterns tell you where to focus your improvement efforts. The time phrases are the biggest issue, affecting the most users. That's your priority.

Make Targeted Improvements

Now you can improve your agent based on what you learned. For the time phrase issue, you might:

Refine the prompt: Add explicit guidance about handling relative time. For example: "When the user mentions relative time phrases like 'next week' or 'tomorrow,' always check the calendar tool first to determine the exact dates they mean."

Add examples: Include few-shot examples in your prompt showing how to handle these cases correctly.

Improve tool descriptions: Make sure your calendar tool description explicitly mentions it can resolve relative dates.

Here's how you might update the system prompt:

1## Using Claude Sonnet 4.5 for agent tasks
2import anthropic
3
4client = anthropic.Anthropic()
5
6## Original prompt (had issues with relative time)
7original_prompt = """You are a helpful personal assistant with access to tools.
8Help the user with their requests using the available tools when needed."""
9
10## Improved prompt based on feedback
11improved_prompt = """You are a helpful personal assistant with access to tools.
12
13When handling time-related requests:
14- If the user mentions relative time ("next week", "tomorrow", "in a few days"), ALWAYS check the calendar tool first to determine exact dates
15- Never assume relative time without checking the current date and the user's schedule
16
17Example:
18User: "What's the weather for my trip next week?"
19Your thinking: I need to check the calendar to see when "next week" is and if there's a trip scheduled.
20Action: Use calendar_tool to find trips in the next 7-14 days
21Then: Use weather_tool with the specific dates found
22
23Help the user with their requests using the available tools when needed."""
24
25## Now when you make calls, use the improved prompt
26message = client.messages.create(
27    model="claude-sonnet-4.5",
28    max_tokens=1024,
29    system=improved_prompt,
30    messages=[
31        {"role": "user", "content": "What's the weather like for my trip next week?"}
32    ]
33)

The key is making changes that address categories of issues, not one-off fixes.

Test the Improvements

After making changes, you need to verify they actually help. This is where your test cases from Chapter 11.2 come back into play.

First, add new test cases based on the problems you found. For the relative time issue, you'd add tests like:

1test_cases = [
2    {
3        "query": "What's the weather for my trip next week?",
4        "expected_behavior": "Should check calendar first, then check weather for specific dates found"
5    },
6    {
7        "query": "Remind me about my meeting tomorrow",
8        "expected_behavior": "Should identify tomorrow's date before searching calendar"
9    },
10    {
11        "query": "What do I have planned in a few days?",
12        "expected_behavior": "Should interpret 'a few days' as 2-4 days from now"
13    }
14]

Run these new tests to verify your changes fixed the problem. Also run your existing tests to make sure you didn't break anything that was working before. This is called regression testing, and it's essential. Sometimes a fix for one problem introduces a new problem elsewhere.

Deploy and Monitor

Once you're confident in your improvements, deploy the updated agent. But your work isn't done. Continue monitoring to see if the changes actually helped in production.

Check your feedback logs after the update. Did the number of relative-time errors decrease? Are users giving more positive feedback? You're looking for evidence that your changes made a real difference.

Sometimes you'll discover that your fix didn't fully solve the problem, or it introduced a new edge case. That's fine. You'll capture that feedback and loop back to the improvement step.

Using AI to Evaluate AI

Here's an advanced technique: you can use another AI model to help evaluate your agent's responses. This is particularly useful when you have many interactions to review and can't manually check them all.

The idea is simple. You take your agent's response and ask a second model: "Was this a good answer?" This second model acts as an AI critic, scoring or judging the quality of the first model's output.

Let's look at how this works:

1## Using GPT-5 as an evaluator for our agent's responses
2from openai import OpenAI
3
4evaluator = OpenAI()
5
6def evaluate_response(user_query, agent_response, expected_behavior):
7    """Use an AI model to evaluate the agent's response quality."""
8    
9    evaluation_prompt = f"""You are evaluating an AI assistant's response for quality and accuracy.
10
11User Query: {user_query}
12
13Agent Response: {agent_response}
14
15Expected Behavior: {expected_behavior}
16
17Please evaluate the response on these criteria:
181. Did it address the user's actual question?
192. Was it accurate and helpful?
203. Did it follow the expected behavior?
214. Were there any mistakes or issues?
22
23Provide a score from 1-5 (5 being excellent) and explain your reasoning."""
24
25    evaluation = evaluator.chat.completions.create(
26        model="gpt-5",
27        messages=[
28            {"role": "system", "content": "You are a careful evaluator of AI assistant responses."},
29            {"role": "user", "content": evaluation_prompt}
30        ]
31    )
32    
33    return evaluation.choices[0].message.content
34
35## Example usage
36query = "What's the weather like for my trip next week?"
37response = "I checked your calendar and found you have a trip to Seattle from November 15-17. The forecast shows rain with temperatures around 52°F."
38expected = "Should check calendar first, then check weather for specific dates found"
39
40evaluation = evaluate_response(query, response, expected)
41print(evaluation)

The evaluator might respond with something like:

1Score: 5/5
2
3Reasoning: The agent correctly followed the expected behavior by checking the calendar first to identify the specific dates ("November 15-17") rather than making assumptions about "next week." It then provided weather information for those exact dates. The response was accurate, helpful, and directly addressed the user's question. The agent also mentioned the location (Seattle), adding useful context. Excellent response.

This AI-based evaluation is powerful because it scales. You can evaluate hundreds or thousands of interactions quickly. However, it has important limitations.

The Limits of AI Evaluation

AI evaluators aren't perfect. They can miss subtle issues, and they might disagree with human judgment. Always treat AI evaluation as a supplement to human review, not a replacement.

Here are some cases where human judgment is essential:

Subjective quality: Is the response's tone appropriate? Does it sound natural? AI evaluators might focus only on factual accuracy and miss tone issues.

Domain expertise: For specialized topics, you need human experts to verify the agent got the details right. An AI evaluator might not catch domain-specific errors.

Ethical concerns: If the agent's response touches on sensitive topics, safety issues, or potential harm, human oversight is crucial.

Context understanding: Humans understand subtle context that evaluators might miss. For example, if a user asks "Should I cancel my plans?" the appropriateness of the agent's answer depends heavily on what those plans are and the user's situation.

A good practice is to use AI evaluation for initial screening (identifying potentially problematic responses for human review) and random sampling (evaluating a subset of interactions automatically), while reserving human review for high-stakes decisions, edge cases, and periodic quality checks.

Creating Your Feedback Loop

Let's put this all together into a systematic process you can follow. Here's a practical week-by-week rhythm for continuous improvement:

Week 1-2: Collect data

  • Deploy your agent and let it interact with users
  • Log every interaction with success metrics
  • Collect user feedback when available
  • Monitor for automatic failure signals

Week 3: Analyze and prioritize

  • Review your logs and identify patterns
  • Categorize issues by type and frequency
  • Pick the top 2-3 issues to address (focus on high-impact problems)
  • Create new test cases for these issues

Week 4: Improve and test

  • Make targeted changes to address the issues
  • Run your full test suite including new tests
  • Do a small pilot deployment if changes are significant
  • Verify improvements with both automated tests and manual review

Week 5: Deploy and monitor

  • Deploy the updated agent
  • Monitor closely for the first few days
  • Check if the issues decreased
  • Watch for any new issues introduced

Then repeat the cycle. You're not trying to achieve perfection in one iteration. You're building a habit of continuous learning and refinement.

Tracking Progress Over Time

It helps to track metrics over time so you can see if your agent is actually getting better. Here are some metrics worth monitoring:

Success rate: What percentage of interactions end with positive user feedback or successful task completion? Track this weekly or monthly. You want this number going up.

Error frequency: How often do specific types of errors occur? For example, "tool call failures," "misunderstood queries," or "incomplete responses." You want these going down.

Response quality scores: If you're using AI evaluation, track the average quality score over time.

User satisfaction: If you collect user ratings, track average satisfaction over time.

Here's a simple script to track these metrics:

1import json
2from collections import defaultdict
3from datetime import datetime, timedelta
4
5def analyze_weekly_metrics(log_file, weeks=4):
6    """Analyze metrics over the past N weeks."""
7    
8    # Group interactions by week
9    weekly_data = defaultdict(lambda: {"total": 0, "successful": 0, "errors": defaultdict(int)})
10    
11    with open(log_file, "r") as f:
12        for line in f:
13            entry = json.loads(line)
14            timestamp = datetime.fromisoformat(entry["timestamp"])
15            week = timestamp.strftime("%Y-W%W")
16            
17            weekly_data[week]["total"] += 1
18            if entry["success"]:
19                weekly_data[week]["successful"] += 1
20            
21            if "error_type" in entry:
22                weekly_data[week]["errors"][entry["error_type"]] += 1
23    
24    # Print summary
25    print("Weekly Performance Metrics\n")
26    for week in sorted(weekly_data.keys())[-weeks:]:
27        data = weekly_data[week]
28        success_rate = (data["successful"] / data["total"] * 100) if data["total"] > 0 else 0
29        print(f"{week}:")
30        print(f"  Total interactions: {data['total']}")
31        print(f"  Success rate: {success_rate:.1f}%")
32        if data["errors"]:
33            print(f"  Most common error: {max(data['errors'].items(), key=lambda x: x[1])}")
34        print()
35
36## Run the analysis
37analyze_weekly_metrics("agent_feedback.jsonl")

This might output something like:

1Weekly Performance Metrics
2
32025-W44:
4  Total interactions: 127
5  Success rate: 78.0%
6  Most common error: ('relative_time_error', 15)
7
82025-W45:
9  Total interactions: 143
10  Success rate: 84.6%
11  Most common error: ('tool_call_failure', 8)
12
132025-W46:
14  Total interactions: 156
15  Success rate: 89.1%
16  Most common error: ('tool_call_failure', 7)
17
182025-W47:
19  Total interactions: 168
20  Success rate: 91.7%
21  Most common error: ('response_too_long', 5)

You can see the agent improving. Success rate went from 78% to 91.7% over four weeks. The relative time errors (which were most common in week 44) decreased after improvements were made. Now a different issue (responses being too long) is emerging as something to address next.

Learning from Success Too

Don't only focus on failures. Pay attention to what your agent does well. When users give positive feedback or an interaction goes particularly smoothly, that's valuable information too.

Maybe you notice that your agent handles scheduling requests exceptionally well, but struggles with information lookup tasks. This tells you where the agent's strengths are. You might decide to emphasize the scheduling capabilities in how you present the assistant to users, and focus improvement efforts on information lookup.

Or perhaps you find that when the agent explains its reasoning to users ("I'm checking your calendar first, then I'll look up the weather"), users are more satisfied even when the final answer takes longer. This suggests adding more transparency about the agent's process.

Success patterns can guide you toward doing more of what works, not just fixing what doesn't.

When to Stop Improving

Here's a question you might be wondering: when is the agent "good enough"? When can you stop this cycle?

The honest answer is: probably never, if you want the agent to remain useful over time. User needs evolve, the world changes, and new edge cases will always emerge. But that doesn't mean you need to invest the same level of effort forever.

In the early days, you'll make frequent, substantial improvements. The agent will have obvious gaps and issues. This is the heavy lifting phase.

Over time, improvements become more incremental. You're polishing edges, handling increasingly rare edge cases, and making small refinements. You might shift from weekly improvement cycles to monthly or quarterly ones.

The goal isn't perfection. It's to have an agent that handles common cases reliably and degrades gracefully on unusual ones. If your success rate is high (say, above 90%) and staying stable, and users are generally satisfied, you've reached a good equilibrium.

But stay vigilant. Check in periodically. The moment you stop paying attention is often when issues start creeping back in.

Practical Tips for Sustainable Improvement

Let's close with some practical advice for maintaining this feedback loop long-term:

Start simple: Don't build complex evaluation systems right away. Start with basic logging and manual review. Add sophistication as you need it.

Prioritize ruthlessly: You can't fix everything. Focus on high-frequency issues and high-impact problems. A bug that affects 1% of users is less urgent than one affecting 20%.

Make small changes: When possible, make incremental improvements rather than large overhauls. Small changes are easier to test, less risky to deploy, and you'll learn what works more quickly.

Document your changes: Keep a log of what you changed and why. When you review metrics later, you'll want to connect improvements (or regressions) to specific changes you made.

Involve real users: If possible, get feedback from actual users, not just your own testing. Real users will use your agent in ways you never imagined.

Balance speed and care: Move quickly enough to improve regularly, but carefully enough not to introduce new problems. Rushed changes often backfire.

Celebrate progress: When your metrics improve, acknowledge it. Building AI agents is hard work, and incremental progress is worth recognizing.

Bringing It All Together

You've now learned the complete evaluation cycle. In Chapter 11.1, you set clear goals and success criteria. In Chapter 11.2, you created test cases to measure performance. And in this chapter, you've learned how to close the loop: gather feedback from real use, identify patterns, make improvements, test them, and deploy with confidence.

This cycle is what separates a prototype from a production-ready agent. It's the difference between "it works in the demo" and "it works reliably for real users, day after day."

Our personal assistant has come a long way. It started as a simple model that could respond to questions. Now it has memory, tools, reasoning capabilities, and a systematic process for getting better over time. You've built something that learns and grows.

As you continue working with your agent, remember that evaluation isn't overhead or a nice-to-have. It's central to building AI systems that truly help people. Every interaction is an opportunity to learn. Every improvement makes the agent more useful. Every cycle of the feedback loop brings you closer to an assistant that understands your needs and reliably delivers value.

Build, test, learn, improve. Keep that cycle running, and your agent will continue to get better for as long as you maintain it.

Glossary

AI Critic: A second AI model used to evaluate the quality of another AI's outputs. Acts as an automated judge to help scale evaluation efforts.

Error Frequency: The rate at which specific types of errors occur over time. Tracking this metric helps identify which problems are most common and whether improvements are reducing errors.

Feedback Loop: A cyclical process where you use evaluation results to identify weaknesses, make improvements, test them, deploy, and then gather new evaluation data. The foundation of continuous improvement.

Pattern Analysis: The process of reviewing multiple failures or issues to identify common categories or themes, rather than treating each problem as unique. Helps prioritize improvement efforts.

Regression Testing: Running existing test cases after making changes to ensure that fixes for new problems didn't break functionality that was previously working.

Success Rate: The percentage of agent interactions that complete successfully, typically measured through user feedback or task completion metrics. A key indicator of overall agent performance.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about continuous feedback and improvement for AI agents.

Loading component...

Reference

BIBTEXAcademic
@misc{continuousfeedbackandimprovementbuildingbetteraiagentsthroughiteration, author = {Michael Brenndoerfer}, title = {Continuous Feedback and Improvement: Building Better AI Agents Through Iteration}, year = {2025}, url = {https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-10} }
APAAcademic
Michael Brenndoerfer (2025). Continuous Feedback and Improvement: Building Better AI Agents Through Iteration. Retrieved from https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents
MLAAcademic
Michael Brenndoerfer. "Continuous Feedback and Improvement: Building Better AI Agents Through Iteration." 2025. Web. 11/10/2025. <https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents>.
CHICAGOAcademic
Michael Brenndoerfer. "Continuous Feedback and Improvement: Building Better AI Agents Through Iteration." Accessed 11/10/2025. https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Continuous Feedback and Improvement: Building Better AI Agents Through Iteration'. Available at: https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents (Accessed: 11/10/2025).
SimpleBasic
Michael Brenndoerfer (2025). Continuous Feedback and Improvement: Building Better AI Agents Through Iteration. https://mbrenndoerfer.com/writing/continuous-feedback-and-improvement-ai-agents
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.