Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis

Michael BrenndoerferAugust 10, 202513 min read

Learn how to use observability for continuous agent improvement. Discover patterns in logs, turn observations into targeted improvements, track quantitative metrics, and build a feedback loop that makes your AI agent smarter over time.

Refining the Agent Using Observability

In the previous sections, we added logs to our assistant and learned how to debug its behavior when things go wrong. But observability isn't just for finding bugs. It's also a powerful tool for making your agent better over time.

Think of it this way: when you watch someone learn a new skill, you notice patterns. Maybe they always stumble on the same step, or they've developed a habit that works but could be more efficient. The same applies to your AI agent. By regularly reviewing its logs, you can spot patterns that reveal opportunities for improvement.

Let's explore how to use observability to continuously refine your assistant, making it smarter and more reliable with each iteration.

Discovering Patterns in Agent Behavior

When you monitor your agent's logs over time, patterns emerge. These patterns tell you stories about how your agent actually works in practice, not just how you think it works.

Here are some common patterns you might discover:

Repeated "I don't know" responses: If your agent frequently says it doesn't know the answer to certain types of questions, that's a signal. Maybe it needs access to a new tool, or perhaps its knowledge base is missing key information.

Unnecessary tool calls: Sometimes an agent calls a tool when it doesn't need to. For example, it might search the web for information it already has in memory, or it might call a calculator for simple arithmetic it could handle directly.

Inefficient reasoning chains: You might notice the agent takes a long, winding path to reach conclusions that could be more direct. This wastes time and tokens.

Repeated wrong assumptions: If the agent consistently makes the same incorrect assumption in its chain-of-thought, that's a clear signal that something in its prompt or reasoning process needs adjustment.

Let's look at a practical example. Suppose you've been running your assistant for a week, and you review the logs:

In[4]:
Code
## Example: Analyzing a week's worth of logs
import json
from collections import Counter

def analyze_logs(log_file):
    """Analyze patterns in agent logs."""
    tool_calls = []
    unknown_responses = []
    reasoning_lengths = []
    
    with open(log_file, 'r') as f:
        for line in f:
            log = json.loads(line)
            
            # Track tool usage
            if 'tool_call' in log:
                tool_calls.append(log['tool_call']['name'])
            
            # Track "I don't know" responses
            if "I don't know" in log.get('response', ''):
                unknown_responses.append(log['query'])
            
            # Track reasoning chain length
            if 'reasoning_steps' in log:
                reasoning_lengths.append(len(log['reasoning_steps']))
    
    return {
        'tool_usage': Counter(tool_calls),
        'unknown_queries': unknown_responses,
        'avg_reasoning_length': sum(reasoning_lengths) / len(reasoning_lengths) if reasoning_lengths else 0
    }

## Analyze the logs
patterns = analyze_logs('assistant_logs.jsonl')

print("Tool Usage:")
for tool, count in patterns['tool_usage'].most_common():
    print(f"  {tool}: {count} times")

print(f"\nAverage reasoning chain length: {patterns['avg_reasoning_length']:.1f} steps")

print(f"\nQueries resulting in 'I don't know': {len(patterns['unknown_queries'])}")
for query in patterns['unknown_queries'][:5]:  # Show first 5
    print(f"  - {query}")

This simple analysis might reveal that your assistant:

  • Called the weather API 50 times but the calculator only 3 times
  • Has an average reasoning chain of 4.2 steps
  • Said "I don't know" to 12 queries, mostly about recent news events

Each of these observations suggests a potential improvement.

Turning Observations into Improvements

Once you've identified patterns, you can make targeted improvements. Let's walk through some examples.

Example 1: Addressing Knowledge Gaps

Suppose your logs show the agent frequently says "I don't know" when asked about recent events. The pattern is clear: it lacks access to current information.

The fix: Add a news search tool or update its knowledge base more frequently.

In[3]:
Code
## Using Claude Sonnet 4.5 for its superior tool-use capabilities
from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

## Define a news search tool
tools = [
    {
        "name": "search_news",
        "description": "Search for recent news articles about a topic. Use this when the user asks about current events or recent developments.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query for news articles"
                }
            },
            "required": ["query"]
        }
    }
]

def search_news(query):
    """Simulate a news search (in practice, call a real news API)."""
    # This would call a real news API
    return f"Recent articles about {query}: [Article 1], [Article 2], [Article 3]"

## Now when the agent encounters a question about recent events,
## it has a tool to help instead of saying "I don't know"
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[
        {"role": "user", "content": "What happened in the tech industry this week?"}
    ]
)

print(response.content)
Out[3]:
Console
[ToolUseBlock(id='toolu_01XrkeejvGDWWmvY5ddjgBAB', input={'query': 'tech industry this week'}, name='search_news', type='tool_use')]

After adding this tool, you'd monitor the logs again. You should see fewer "I don't know" responses to news-related questions and more successful tool calls to search_news.

Example 2: Optimizing Tool Usage

Your logs might show the agent calling the web search tool for information it already has in its conversation history. This wastes time and costs money.

The fix: Adjust the prompt to encourage the agent to check its memory first.

In[4]:
Code
## Improved system prompt that emphasizes checking memory first
system_prompt = """You are a helpful personal assistant.

Before using any tools, always check if you already have the information you need:

1. Review the conversation history
2. Check if the user has already provided this information
3. Only use tools if you genuinely need new information

This saves time and provides faster responses."""

## The agent will now be more thoughtful about when to use tools

Example 3: Streamlining Reasoning

If your logs show the agent's reasoning chains are consistently long and winding, you can guide it toward more efficient thinking.

The fix: Add examples of concise reasoning to the prompt.

In[5]:
Code
system_prompt = """You are a helpful personal assistant.

When reasoning through problems, be concise and direct:

Good example:
User asks: "What's 15% of 80?"
Reasoning: Need to calculate 15% of 80. That's $0.15 \times 80 = 12$.
Answer: 12

Avoid overthinking simple problems. Break down complex problems, but keep each step clear and necessary."""

Quantitative Metrics from Logs

Logs aren't just qualitative stories. They can provide quantitative metrics that help you track your agent's performance over time. These metrics connect directly to the evaluation concepts we covered in Chapter 11.

Here are some useful metrics you can extract from logs:

Tool retry rate: How often does the agent need to retry a tool call because it failed the first time? A high retry rate might indicate unreliable tools or poor error handling.

In[12]:
Code
def calculate_retry_rate(logs):
    """Calculate how often tool calls need to be retried."""
    total_tool_calls = 0
    retried_calls = 0
    
    for log in logs:
        if 'tool_call' in log:
            total_tool_calls += 1
            if log.get('retry_count', 0) > 0:
                retried_calls += 1
    
    return retried_calls / total_tool_calls if total_tool_calls > 0 else 0

retry_rate = calculate_retry_rate(logs)
print(f"Tool retry rate: {retry_rate:.1%}")

Average reasoning length: How many steps does the agent typically take to reach a conclusion? This can indicate efficiency.

In[14]:
Code
def calculate_avg_reasoning_length(logs):
    """Calculate average number of reasoning steps."""
    lengths = [len(log.get('reasoning_steps', [])) for log in logs if 'reasoning_steps' in log]
    return sum(lengths) / len(lengths) if lengths else 0

avg_length = calculate_avg_reasoning_length(logs)
print(f"Average reasoning chain length: {avg_length:.1f} steps")

Response time by query type: How long does the agent take to respond to different types of queries? This helps identify bottlenecks.

In[16]:
Code
def analyze_response_times(logs):
    """Break down response times by query type."""
    from collections import defaultdict
    
    times_by_type = defaultdict(list)
    
    for log in logs:
        query_type = log.get('query_type', 'unknown')
        response_time = log.get('response_time_ms', 0)
        times_by_type[query_type].append(response_time)
    
    # Calculate averages
    avg_times = {
        qtype: sum(times) / len(times)
        for qtype, times in times_by_type.items()
    }
    
    return avg_times

response_times = analyze_response_times(logs)
for query_type, avg_time in sorted(response_times.items(), key=lambda x: x[1], reverse=True):
    print(f"{query_type}: {avg_time:.0f}ms")

These metrics give you concrete numbers to track. You can set goals like "reduce average reasoning length from 4.2 to 3.5 steps" or "decrease tool retry rate from 8% to under 5%."

Building a Continuous Improvement Loop

The real power of observability comes from making it a habit. Here's a simple process for continuous improvement:

1. Monitor regularly: Set aside time each week to review your agent's logs. Even 15 minutes can reveal valuable patterns.

2. Identify one improvement: Don't try to fix everything at once. Pick the most impactful issue you've observed.

3. Make a targeted change: Adjust a prompt, add a tool, or modify the reasoning process. Keep the change small and focused.

4. Measure the impact: After deploying your change, monitor the same metrics to see if the improvement worked.

5. Repeat: Once you've validated one improvement, move on to the next pattern you've observed.

Let's see this in action with a complete example:

In[6]:
Code
## Week 1: Initial observation
## Logs show: Agent uses web search 40 times, but 15 of those searches
## are for information already in the conversation history

## Week 2: Make a change
## Update the system prompt to emphasize checking memory first
new_prompt = """Before searching the web, always check:

1. Has the user already mentioned this in our conversation?
2. Did I already look this up earlier in this session?

Only search if you genuinely need new information."""

## Week 3: Measure impact
## New logs show: Agent uses web search 28 times, with only 2 redundant searches
## Success! Redundant searches dropped from 37.5% to 7.1%

## Week 4: Move to next improvement
## Now focus on the next pattern: reasoning chains are too long for simple questions

This cycle of observe, improve, and measure creates a feedback loop that makes your agent better over time.

Connecting to Evaluation

Remember the evaluation framework from Chapter 11? Observability feeds directly into that process. Your logs provide the data you need to evaluate your agent's real-world performance.

For example, if you set a success criterion that "the agent should complete tasks in under 5 seconds," your logs tell you whether you're meeting that goal:

In[20]:
Code
def check_performance_goal(logs, max_response_time_ms=5000):
    """Check if agent meets response time goal."""
    response_times = [log.get('response_time_ms', 0) for log in logs]
    
    within_goal = sum(1 for t in response_times if t <= max_response_time_ms)
    total = len(response_times)
    
    success_rate = within_goal / total if total > 0 else 0
    
    print(f"Response time goal: {success_rate:.1%} of responses under {max_response_time_ms}ms")
    print(f"Average response time: {sum(response_times) / total:.0f}ms")
    
    return success_rate

success_rate = check_performance_goal(logs)

if success_rate < 0.90:  # Goal: 90% of responses under 5 seconds
    print("Need to optimize response time!")

Your logs also help you create better test cases. When you see a query that the agent handled poorly, add it to your test suite. This ensures you don't regress when making future changes.

Observability in Production

Once your agent is deployed and real users are interacting with it, observability becomes even more important. You're no longer just testing with your own queries. You're seeing how diverse users with different needs interact with your assistant.

In production, you want to:

Track aggregate metrics: Monitor overall performance, not just individual queries. Are response times trending up? Is the error rate increasing?

Set up alerts: If something goes seriously wrong (error rate spikes, response times exceed a threshold), you want to know immediately.

In[22]:
Code
def check_for_anomalies(logs, window_size=100):
    """Simple anomaly detection for production logs."""
    recent_logs = logs[-window_size:]
    
    # Check error rate
    errors = sum(1 for log in recent_logs if log.get('error'))
    error_rate = errors / len(recent_logs)
    
    if error_rate > 0.05:  # More than 5% errors
        print(f"⚠️  Alert: Error rate is {error_rate:.1%} (threshold: 5%)")
    
    # Check response times
    response_times = [log.get('response_time_ms', 0) for log in recent_logs]
    avg_time = sum(response_times) / len(response_times)
    
    if avg_time > 8000:  # Slower than 8 seconds
        print(f"⚠️  Alert: Average response time is {avg_time:.0f}ms (threshold: 8000ms)")

Respect user privacy: When logging in production, be careful about what you record. Redact sensitive information like personal data, passwords, or confidential business information.

In[24]:
Code
import re
from datetime import datetime

def redact_sensitive_info(text):
    """Remove sensitive information from logs."""
    # Redact email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    
    # Redact phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    
    # Redact credit card numbers
    text = re.sub(r'\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b', '[CARD]', text)
    
    return text

## Use this when logging user queries
log_entry = {
    'query': redact_sensitive_info(user_query),
    'response': redact_sensitive_info(agent_response),
    'timestamp': datetime.now().isoformat()
}

Making Observability a Habit

The best way to ensure continuous improvement is to make observability part of your regular workflow. Here are some practical tips:

Schedule regular reviews: Block 30 minutes each week to review logs. Make it a recurring calendar event.

Create a dashboard: Build a simple dashboard that shows key metrics at a glance. It doesn't need to be fancy. Even a script that prints a summary is helpful.

In[26]:
Code
def print_weekly_summary(logs):
    """Print a summary of the week's agent performance."""
    print("=== Weekly Agent Performance Summary ===\n")
    
    print(f"Total queries: {len(logs)}")
    
    # Tool usage
    tool_calls = [log.get('tool_call', {}).get('name') for log in logs if 'tool_call' in log]
    print(f"Tool calls: {len(tool_calls)}")
    
    # Success rate
    errors = sum(1 for log in logs if log.get('error'))
    success_rate = (len(logs) - errors) / len(logs) if logs else 0
    print(f"Success rate: {success_rate:.1%}")
    
    # Average response time
    times = [log.get('response_time_ms', 0) for log in logs]
    avg_time = sum(times) / len(times) if times else 0
    print(f"Avg response time: {avg_time:.0f}ms")
    
    print("\n" + "="*40)

## Run this every Monday morning
print_weekly_summary(last_week_logs)

Keep an improvement log: Maintain a simple document where you record what you observed, what you changed, and what impact it had. This creates a history of your agent's evolution.

Share insights with your team: If you're working with others, share interesting patterns you've discovered. Someone else might have ideas for improvements you haven't considered.

Putting It All Together

Let's walk through a complete example of using observability to refine our assistant over a month:

Week 1: You notice the agent says "I don't know" to 15% of queries about local businesses. You add a local search tool.

Week 2: After adding the tool, "I don't know" responses drop to 5% for those queries. But you notice the agent now takes longer to respond because it searches even when the user just mentioned the business name.

Week 3: You update the prompt to check the conversation history before searching. Response times improve by 20% for local business queries.

Week 4: You review the metrics and confirm both improvements are stable. You move on to the next pattern: the agent's reasoning chains for math problems are unnecessarily long.

This iterative process, guided by observability, transforms your agent from a prototype into a polished, reliable assistant.

Summary

Observability isn't just about debugging. It's a tool for continuous improvement. By regularly monitoring your agent's logs, you can discover patterns in how your agent actually behaves, identify opportunities for improvement, and make targeted changes based on real data. Track quantitative metrics like retry rates, reasoning length, and response times. Build a continuous improvement loop: observe, improve, measure, repeat. Connect observability to evaluation by using logs to validate success criteria.

The key is making observability a habit. Set aside time regularly to review your agent's behavior. Each observation is an opportunity to make your assistant smarter, faster, and more reliable.

Your agent's logs tell the story of how it really works in practice. By listening to that story and acting on what you learn, you create an agent that gets better over time.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about refining AI agents using observability.

Loading component...

Reference

BIBTEXAcademic
@misc{refiningaiagentsusingobservabilitycontinuousimprovementthroughloganalysis, author = {Michael Brenndoerfer}, title = {Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis}, year = {2025}, url = {https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis. Retrieved from https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability
MLAAcademic
Michael Brenndoerfer. "Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability>.
CHICAGOAcademic
Michael Brenndoerfer. "Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis'. Available at: https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). Refining AI Agents Using Observability: Continuous Improvement Through Log Analysis. https://mbrenndoerfer.com/writing/refining-ai-agents-using-observability