Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues

Michael Brenndoerfer

AI Agent Handbook Machine Learning Software Engineering

Learn how to read agent logs, trace reasoning chains, identify common problems, and systematically debug AI agents. Master the art of understanding what your agent is thinking and why.

Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

Understanding and Debugging Agent Behavior

Now that you've added logging to your agent, you have a window into its decision-making process. But logs are only useful if you know how to read them and use them to find problems. When your agent gives an unexpected answer or behaves strangely, the logs become your debugging tool. They let you trace back through what the agent did, step by step, until you find where things went wrong.

Debugging an AI agent might sound intimidating at first. After all, you're debugging something that makes its own decisions based on language understanding. But here's the good news: with proper logs, debugging an agent becomes remarkably similar to debugging any other program. You follow the execution path, check the inputs and outputs at each step, and look for where the logic diverged from what you expected.

Let's learn how to read logs effectively and use them to diagnose and fix common agent problems.

Reading Agent Logs

When you look at your agent's logs, you're essentially reading a story of what happened during a request. Each log entry is a sentence in that story. Your job is to follow the narrative and spot where it goes off track.

Start by identifying the boundaries of a single request. Find the log entry where the user's query came in, then follow the entries until you see the final response. Everything between those two points shows you what the agent did to process that request.

12025-11-10 15:42:10 - assistant - INFO - User query: Schedule a meeting with Alice tomorrow at 2pm
22025-11-10 15:42:10 - assistant - INFO - Tool required: True
32025-11-10 15:42:10 - assistant - INFO - Selected tool: calendar
42025-11-10 15:42:11 - assistant - INFO - Calling calendar with: create_event(attendee='Alice', date='2025-11-11', time='14:00')
52025-11-10 15:42:11 - assistant - INFO - Calendar returned: Event created successfully
62025-11-10 15:42:11 - assistant - INFO - Assistant response: I've scheduled a meeting with Alice for tomorrow at 2pm.

12025-11-10 15:42:10 - assistant - INFO - User query: Schedule a meeting with Alice tomorrow at 2pm
22025-11-10 15:42:10 - assistant - INFO - Tool required: True
32025-11-10 15:42:10 - assistant - INFO - Selected tool: calendar
42025-11-10 15:42:11 - assistant - INFO - Calling calendar with: create_event(attendee='Alice', date='2025-11-11', time='14:00')
52025-11-10 15:42:11 - assistant - INFO - Calendar returned: Event created successfully
62025-11-10 15:42:11 - assistant - INFO - Assistant response: I've scheduled a meeting with Alice for tomorrow at 2pm.

This log sequence tells a straightforward story. The agent received a scheduling request, recognized it needed the calendar tool, called the tool with the right parameters, got confirmation, and told the user it worked. Nothing went wrong here.

Now look at a case where something did go wrong:

12025-11-10 15:45:33 - assistant - INFO - User query: What's 15% of 240?
22025-11-10 15:45:33 - assistant - INFO - Tool required: False
32025-11-10 15:45:34 - assistant - INFO - Generating direct response (no tool)
42025-11-10 15:45:34 - assistant - INFO - Assistant response: Approximately 35.

12025-11-10 15:45:33 - assistant - INFO - User query: What's 15% of 240?
22025-11-10 15:45:33 - assistant - INFO - Tool required: False
32025-11-10 15:45:34 - assistant - INFO - Generating direct response (no tool)
42025-11-10 15:45:34 - assistant - INFO - Assistant response: Approximately 35.

The agent gave the wrong answer (15% of 240 is 36, not 35). By reading the logs, you can see exactly what went wrong. The agent decided it didn't need a tool and tried to calculate the answer directly. It got close but made a small error. The fix is clear: improve the logic that decides when to use the calculator tool, so it recognizes percentage calculations as something that needs precise computation.

Tracing Reasoning Chains

When your agent uses chain-of-thought reasoning, the logs become even more valuable. They let you follow the agent's logic step by step and see where its thinking went astray.

Imagine your agent is trying to answer: "If a train travels 60 miles in 45 minutes, how far will it travel in 2 hours?"

The logs might look like this:

12025-11-10 16:10:22 - assistant - INFO - User query: If a train travels 60 miles in 45 minutes, how far will it travel in 2 hours?
22025-11-10 16:10:22 - assistant - INFO - Starting reasoning chain
32025-11-10 16:10:23 - assistant - INFO - Reasoning step 1: First, I need to find the train's speed
42025-11-10 16:10:23 - assistant - INFO - Reasoning step 2: Speed = 60 miles / 45 minutes = 1.33 miles per minute
52025-11-10 16:10:24 - assistant - INFO - Reasoning step 3: Convert 2 hours to minutes: 2 * 60 = 120 minutes
62025-11-10 16:10:24 - assistant - INFO - Reasoning step 4: Distance = speed * time = 1.33 * 120 = 160 miles
72025-11-10 16:10:24 - assistant - INFO - Assistant response: The train will travel 160 miles in 2 hours.

12025-11-10 16:10:22 - assistant - INFO - User query: If a train travels 60 miles in 45 minutes, how far will it travel in 2 hours?
22025-11-10 16:10:22 - assistant - INFO - Starting reasoning chain
32025-11-10 16:10:23 - assistant - INFO - Reasoning step 1: First, I need to find the train's speed
42025-11-10 16:10:23 - assistant - INFO - Reasoning step 2: Speed = 60 miles / 45 minutes = 1.33 miles per minute
52025-11-10 16:10:24 - assistant - INFO - Reasoning step 3: Convert 2 hours to minutes: 2 * 60 = 120 minutes
62025-11-10 16:10:24 - assistant - INFO - Reasoning step 4: Distance = speed * time = 1.33 * 120 = 160 miles
72025-11-10 16:10:24 - assistant - INFO - Assistant response: The train will travel 160 miles in 2 hours.

Here the reasoning is correct, and you can verify each step makes sense. But what if the agent made an error?

12025-11-10 16:15:10 - assistant - INFO - User query: If a train travels 60 miles in 45 minutes, how far will it travel in 2 hours?
22025-11-10 16:15:10 - assistant - INFO - Starting reasoning chain
32025-11-10 16:15:11 - assistant - INFO - Reasoning step 1: The train goes 60 miles in 45 minutes
42025-11-10 16:15:11 - assistant - INFO - Reasoning step 2: 2 hours is about 2.5 times longer than 45 minutes
52025-11-10 16:15:12 - assistant - INFO - Reasoning step 3: So the distance is 60 * 2.5 = 150 miles
62025-11-10 16:15:12 - assistant - INFO - Assistant response: The train will travel 150 miles in 2 hours.

12025-11-10 16:15:10 - assistant - INFO - User query: If a train travels 60 miles in 45 minutes, how far will it travel in 2 hours?
22025-11-10 16:15:10 - assistant - INFO - Starting reasoning chain
32025-11-10 16:15:11 - assistant - INFO - Reasoning step 1: The train goes 60 miles in 45 minutes
42025-11-10 16:15:11 - assistant - INFO - Reasoning step 2: 2 hours is about 2.5 times longer than 45 minutes
52025-11-10 16:15:12 - assistant - INFO - Reasoning step 3: So the distance is 60 * 2.5 = 150 miles
62025-11-10 16:15:12 - assistant - INFO - Assistant response: The train will travel 150 miles in 2 hours.

Now you can see the problem. In step 2, the agent estimated that 2 hours (120 minutes) is "about 2.5 times" 45 minutes. But 120 / 45 is actually 2.67, not 2.5. The agent's approximation introduced an error. The logs pinpoint exactly where the reasoning went wrong, making it clear what needs to be fixed. You might adjust the prompt to encourage more precise calculations or have the agent use the calculator tool for division.

Common Agent Problems and How to Spot Them

Through experience debugging agents, certain patterns of problems emerge repeatedly. Knowing what to look for helps you diagnose issues faster.

Wrong Tool Selection: The agent picks a tool that doesn't match the task.

Look for logs where the selected tool doesn't make sense for the query. For example:

12025-11-10 17:20:15 - assistant - INFO - User query: What's the capital of France?
22025-11-10 17:20:15 - assistant - INFO - Selected tool: calculator

12025-11-10 17:20:15 - assistant - INFO - User query: What's the capital of France?
22025-11-10 17:20:15 - assistant - INFO - Selected tool: calculator

The agent chose the calculator for a geography question. This suggests the tool selection logic needs improvement, or the agent needs clearer descriptions of what each tool does.

Missing Context: The agent doesn't retrieve or use relevant information from memory.

Look for cases where the agent should have remembered something but didn't:

12025-11-10 17:25:30 - assistant - INFO - User query: What did I tell you my favorite color was?
22025-11-10 17:25:30 - assistant - INFO - Searching memory for: favorite color
32025-11-10 17:25:30 - assistant - INFO - Found 0 relevant items in memory
42025-11-10 17:25:31 - assistant - INFO - Assistant response: I don't have that information.

12025-11-10 17:25:30 - assistant - INFO - User query: What did I tell you my favorite color was?
22025-11-10 17:25:30 - assistant - INFO - Searching memory for: favorite color
32025-11-10 17:25:30 - assistant - INFO - Found 0 relevant items in memory
42025-11-10 17:25:31 - assistant - INFO - Assistant response: I don't have that information.

If the user previously mentioned their favorite color, the memory search should have found it. This indicates either the information wasn't stored properly, or the search query didn't match how it was stored.

Incorrect Reasoning: The agent makes a logical error in its thinking.

We saw this in the train example above. Look for reasoning steps that don't follow logically from previous steps, or where the agent makes incorrect assumptions.

Tool Call Failures: A tool returns an error or unexpected result.

12025-11-10 17:30:45 - assistant - INFO - Calling weather_api with: get_weather(city='Nowhere')
22025-11-10 17:30:46 - assistant - ERROR - Weather API failed: City not found
32025-11-10 17:30:46 - assistant - INFO - Assistant response: I couldn't get the weather information.

12025-11-10 17:30:45 - assistant - INFO - Calling weather_api with: get_weather(city='Nowhere')
22025-11-10 17:30:46 - assistant - ERROR - Weather API failed: City not found
32025-11-10 17:30:46 - assistant - INFO - Assistant response: I couldn't get the weather information.

The tool failed because the city doesn't exist. This might mean the agent needs to validate inputs before calling tools, or handle errors more gracefully by asking the user for clarification.

Hallucination: The agent makes up information instead of admitting it doesn't know.

This is harder to spot in logs alone because the agent's logs might look reasonable even though the information is wrong. You typically catch this by comparing the agent's response to ground truth. Once you identify a hallucination, check the logs to see if the agent should have used a tool to look up the information but didn't.

A Systematic Debugging Approach

When your agent misbehaves, follow this systematic process to diagnose and fix the problem.

Step 1: Reproduce the Issue

Try to make the problem happen again with the same or similar input. If you can reproduce it consistently, you can test whether your fix works.

Step 2: Examine the Logs

Find the log entries for the problematic request. Read through them from start to finish, checking each decision point:

Did the agent correctly identify what type of request this was?
Did it choose the right tool (or correctly decide not to use a tool)?
Did it retrieve the right information from memory?
Did its reasoning steps make sense?
Did any tools return errors or unexpected results?

Step 3: Identify the Root Cause

Based on the logs, pinpoint where things went wrong. Was it a bad decision early in the process that led to downstream problems? Was it a single reasoning step that made an error? Was it a tool that failed?

Step 4: Form a Hypothesis

Develop a theory about why the problem occurred. For example: "The agent chose the wrong tool because the tool descriptions are too vague" or "The reasoning failed because the agent tried to do complex math without using the calculator."

Step 5: Test a Fix

Make a change that should address the root cause. This might mean:

Improving prompts to guide better decisions
Adjusting the logic for tool selection
Fixing how information is stored in or retrieved from memory
Adding error handling for tool failures
Encouraging the agent to use tools for calculations

Step 6: Verify the Fix

Run the same request again and check the logs. Did the agent behave correctly this time? Try a few variations to make sure the fix works generally, not just for that specific case.

Debugging Example: The Wrong Answer

Let's walk through a complete debugging session. A user reports that when they asked "How many days until Christmas?", the agent gave the wrong number.

You check the logs:

12025-11-10 18:00:00 - assistant - INFO - User query: How many days until Christmas?
22025-11-10 18:00:00 - assistant - INFO - Tool required: False
32025-11-10 18:00:01 - assistant - INFO - Generating direct response (no tool)
42025-11-10 18:00:01 - assistant - INFO - Assistant response: There are 45 days until Christmas.

12025-11-10 18:00:00 - assistant - INFO - User query: How many days until Christmas?
22025-11-10 18:00:00 - assistant - INFO - Tool required: False
32025-11-10 18:00:01 - assistant - INFO - Generating direct response (no tool)
42025-11-10 18:00:01 - assistant - INFO - Assistant response: There are 45 days until Christmas.

Today is November 10, and Christmas is December 25. That's actually 45 days, so the answer looks correct. But wait, you realize the agent is giving a static answer. If someone asks tomorrow, it will still say 45 days.

The root cause: The agent tried to calculate the answer directly instead of using a date calculation tool. It happened to get lucky with the right answer today, but the answer will be wrong tomorrow.

The fix: Adjust the tool selection logic to recognize date-related questions as requiring the date calculator tool.

After the fix, the logs look like this:

12025-11-10 18:05:00 - assistant - INFO - User query: How many days until Christmas?
22025-11-10 18:05:00 - assistant - INFO - Tool required: True
32025-11-10 18:05:00 - assistant - INFO - Selected tool: date_calculator
42025-11-10 18:05:01 - assistant - INFO - Calling date_calculator with: days_between(today='2025-11-10', target='2025-12-25')
52025-11-10 18:05:01 - assistant - INFO - Date calculator returned: 45
62025-11-10 18:05:01 - assistant - INFO - Assistant response: There are 45 days until Christmas.

12025-11-10 18:05:00 - assistant - INFO - User query: How many days until Christmas?
22025-11-10 18:05:00 - assistant - INFO - Tool required: True
32025-11-10 18:05:00 - assistant - INFO - Selected tool: date_calculator
42025-11-10 18:05:01 - assistant - INFO - Calling date_calculator with: days_between(today='2025-11-10', target='2025-12-25')
52025-11-10 18:05:01 - assistant - INFO - Date calculator returned: 45
62025-11-10 18:05:01 - assistant - INFO - Assistant response: There are 45 days until Christmas.

Now the agent uses the date calculator tool, which will always compute the correct answer based on the current date. Problem solved.

Using Logs to Improve Your Agent

Debugging isn't just about fixing errors. Logs also help you identify patterns that suggest improvements.

Performance Bottlenecks: If you see certain operations taking a long time, you might optimize them or cache results.

12025-11-10 19:00:00 - assistant - INFO - Searching memory for: previous conversations
22025-11-10 19:00:05 - assistant - INFO - Found 150 relevant items in memory

12025-11-10 19:00:00 - assistant - INFO - Searching memory for: previous conversations
22025-11-10 19:00:05 - assistant - INFO - Found 150 relevant items in memory

A five-second memory search is slow. You might add an index or limit the search scope to speed this up.

Unnecessary Tool Calls: If the agent frequently calls tools when it doesn't need to, you're wasting time and potentially money (if the tools cost per call).

12025-11-10 19:10:00 - assistant - INFO - User query: Hello!
22025-11-10 19:10:00 - assistant - INFO - Selected tool: search_engine
32025-11-10 19:10:01 - assistant - INFO - Search returned: [results]
42025-11-10 19:10:01 - assistant - INFO - Assistant response: Hello! How can I help you?

12025-11-10 19:10:00 - assistant - INFO - User query: Hello!
22025-11-10 19:10:00 - assistant - INFO - Selected tool: search_engine
32025-11-10 19:10:01 - assistant - INFO - Search returned: [results]
42025-11-10 19:10:01 - assistant - INFO - Assistant response: Hello! How can I help you?

The agent called the search engine for a simple greeting. That's wasteful. Improve the tool selection logic to avoid unnecessary calls.

Repeated Patterns: If you see the same type of error happening frequently, that's a signal to address it systematically rather than fixing individual instances.

Frameworks and Tools for Observability

While we've been building our observability from first principles using Python's logging module, several frameworks can help with agent debugging and monitoring in production:

LangSmith (from LangChain): Provides tracing and debugging tools specifically designed for LLM applications. It automatically captures prompts, completions, and chain execution. https://docs.smith.langchain.com/

Weights & Biases: Offers experiment tracking and model monitoring that works well for agents. You can log agent runs, compare different prompt versions, and track performance metrics. https://wandb.ai/

Arize AI: Specializes in ML observability with features for monitoring LLM applications, including prompt tracking and performance analysis. https://arize.com/

Phoenix (from Arize): An open-source tool for LLM observability that helps you trace and debug agent behavior. https://github.com/Arize-ai/phoenix

These tools build on the same principles we've covered: logging decisions, capturing intermediate steps, and making agent behavior observable. They add features like visualization, automatic prompt tracking, and production monitoring. We're learning the fundamentals so you understand what's happening under the hood, but these frameworks can save significant time when you're ready to deploy your agent.

Glossary

Root Cause: The fundamental reason why a problem occurred, as opposed to its symptoms. Finding the root cause means identifying what needs to be fixed so the problem doesn't happen again.

Debugging: The process of finding and fixing problems in a program. For agents, debugging involves using logs to trace through what the agent did and identify where its behavior diverged from what was expected.

Hallucination: When an AI agent generates information that sounds plausible but is actually incorrect or made up. Hallucinations are particularly problematic because the agent presents false information confidently.

Reproduction: The ability to make a problem happen again consistently. If you can reproduce an issue, you can test whether your attempted fix actually solves it.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about debugging agent behavior.

Loading component...

Back to AI Agent Handbook

Previous Chapter

Adding Logs to the Agent

Next Chapter

Refining the Agent Using Observability

Reference

BIBTEXAcademic

@misc{understandinganddebuggingagentbehaviorcompleteguidetoreadinglogsfixingaiissues, author = {Michael Brenndoerfer}, title = {Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues}, year = {2025}, url = {https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-10} }

APAAcademic

Michael Brenndoerfer (2025). Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues. Retrieved from https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior

MLAAcademic

Michael Brenndoerfer. "Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues." 2025. Web. 11/10/2025. <https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior>.

CHICAGOAcademic

Michael Brenndoerfer. "Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues." Accessed 11/10/2025. https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues'. Available at: https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior (Accessed: 11/10/2025).

SimpleBasic

Michael Brenndoerfer (2025). Understanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues. https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior

Direct link:

https://mbrenndoerfer.com/writing/understanding-and-debugging-agent-behavior

Part of AI Agent Handbook

This article is part of the free-to-read AI Agent Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveUnderstanding and Debugging Agent Behavior: Complete Guide to Reading Logs & Fixing AI Issues