Learn how to guide AI agents to verify and refine their reasoning through self-checking techniques. Discover practical methods for catching errors, improving accuracy, and building more reliable AI systems.

This article is part of the free-to-read AI Agent Handbook
Checking and Refining the Agent's Reasoning
In the previous chapter, you learned how chain-of-thought prompting helps agents break down complex problems step by step. But thinking through a problem once isn't always enough. Sometimes the agent makes a mistake in its reasoning, overlooks an important detail, or jumps to a conclusion too quickly.
What if you could teach your agent to double-check its own work? To pause and ask itself, "Wait, does this actually make sense?" This chapter explores how to guide agents to verify and refine their reasoning, catching errors before they become wrong answers.
Why Agents Need to Check Their Work
Imagine asking your personal assistant to calculate the best time to leave for the airport. It reasons through the problem: "Flight is at 3 PM, need to arrive 2 hours early, drive takes 30 minutes, so leave at 12:30 PM." Sounds good, right?
But what if it's Friday afternoon and traffic will be terrible? What if you need to return a rental car first? What if the airport is known for long security lines? A single pass through the reasoning might miss these factors.
Language models, despite their impressive capabilities, can make similar oversights. They might:
- Make arithmetic errors in multi-step calculations
- Misinterpret a key detail in the problem
- Apply the wrong formula or approach
- Forget to consider edge cases or constraints
- Reach a conclusion that contradicts their own reasoning
The good news? You can prompt the agent to review its thinking, just like you might ask a colleague, "Are you sure about that?" or "Can you walk me through your reasoning again?"
The Power of Self-Verification
Let's start with a simple example. Here's an agent solving a word problem:
Problem: "A store has 15 apples. They sell 8 apples in the morning and 6 apples in the afternoon. How many apples are left?"
First attempt:
This looks reasonable. But what if we prompt the agent to verify its answer?
Example (GPT-5)
Initial answer: - Start with 15 apples. - After selling 8 in the morning: 15 − 8 = 7 apples left. - After selling 6 in the afternoon: 7 − 6 = 1 apple left. Answer: 1 apple.
Verification: 1) Check each calculation: - Morning: 15 − 8 = 7. Correct. - Afternoon: 7 − 6 = 1. Correct. - Combined check: 15 − (8 + 6) = 15 − 14 = 1. Matches. 2) Logic check: - They had enough apples for each sale (7 left after morning, so selling 6 in the afternoon is feasible). - Sequential subtraction or combined subtraction both yield the same result. 3) Consider what might be missing: - No mention of restocking, spoilage, or returns, so subtracting the total sold from the initial stock is appropriate. Conclusion: The answer is correct. 1 apple left. I’m confident in this result.
The verification step might reveal:
In this case, the answer was correct. But the verification process adds confidence. The agent didn't just solve the problem; it confirmed the solution makes sense.
Techniques for Verification
You can guide agents to check their work in several ways. Each technique serves a different purpose.
Ask for Confirmation
The simplest approach: explicitly ask the agent if it's sure.
This prompt nudges the agent to review its work without dictating how to do it. Sometimes that's all you need.
Request Alternative Approaches
Ask the agent to solve the problem a different way, then compare results.
If both approaches yield the same result, you can be more confident. If they differ, something went wrong.
Prompt for Explanation
Ask the agent to explain its reasoning in more detail.
When the agent has to justify its reasoning, it often catches its own mistakes. This is similar to how explaining a problem to someone else helps you spot errors in your own thinking.
Check Against Constraints
Remind the agent of any constraints or requirements, then ask if its answer satisfies them.
This structured check helps catch violations the agent might have overlooked.
A More Complex Example
Let's see verification in action with a trickier problem.
Problem: "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons of water?"
This is a classic puzzle that requires creative thinking. Let's see how an agent might solve it, then verify its solution.
Example (Claude Sonnet 4.5)
Initial solution: I need to measure exactly 4 gallons using a 3-gallon jug and a 5-gallon jug. Let me work through this step-by-step: **Solution:** 1. **Fill the 5-gallon jug completely** - 3-gal jug: 0 gallons - 5-gal jug: 5 gallons 2. **Pour from the 5-gallon jug into the 3-gallon jug** - 3-gal jug: 3 gallons (full) - 5-gal jug: 2 gallons (remaining) 3. **Empty the 3-gallon jug** - 3-gal jug: 0 gallons - 5-gal jug: 2 gallons 4. **Pour the 2 gallons from the 5-gallon jug into the 3-gallon jug** - 3-gal jug: 2 gallons - 5-gal jug: 0 gallons 5. **Fill the 5-gallon jug completely again** - 3-gal jug: 2 gallons - 5-gal jug: 5 gallons 6. **Pour from the 5-gallon jug into the 3-gallon jug (which already has 2 gallons)** - The 3-gallon jug can only take 1 more gallon - 3-gal jug: 3 gallons (full) - 5-gal jug: **4 gallons** ✓ **Result:** The 5-gallon jug now contains exactly 4 gallons!
Verification: # Verification of the Jug Problem Solution Let me carefully track each step: ## Step-by-Step Verification | Step | Action | 3-gal jug | 5-gal jug | ✓/✗ | |------|--------|-----------|-----------|-----| | Start | Initial state | 0 | 0 | ✓ | | 1 | Fill 5-gal jug | 0 | 5 | ✓ | | 2 | Pour 5-gal → 3-gal (pour 3) | 3 | 2 | ✓ | | 3 | Empty 3-gal jug | 0 | 2 | ✓ | | 4 | Pour 5-gal → 3-gal (pour 2) | 2 | 0 | ✓ | | 5 | Fill 5-gal jug | 2 | 5 | ✓ | | 6 | Pour 5-gal → 3-gal (pour 1) | 3 | 4 | ✓ | ## Verification Checks **1. Each step is valid:** - ✓ Step 2: 3-gal jug empty, can accept 3 gallons from 5-gal - ✓ Step 4: Pouring 2 gallons into empty 3-gal jug works - ✓ Step 6: 3-gal jug has 2 gallons, can only accept 1 more (5 - 1 = 4) **2. Final result:** - ✓ **5-gallon jug contains exactly 4 gallons** **3. Simpler solution?** Let me check an alternative: | Step | Action | 3-gal jug | 5-gal jug | |------|--------|-----------|-----------| | 1 | Fill 3-gal | 3 | 0 | | 2 | Pour 3-gal → 5-gal | 0 | 3 | | 3 | Fill 3-gal again | 3 | 3 | | 4 | Pour 3-gal → 5-gal (fills it) | 1 | 5 | | 5 | Empty 5-gal | 1 | 0 | | 6 | Pour 3-gal → 5-gal | 0 | 1 | | 7 | Fill 3-gal | 3 | 1 | | 8 | Pour 3-gal → 5-gal | 0 | 4 | This takes **8 steps** vs the original **6 steps**. ## Conclusion ✅ **The original solution is VERIFIED and CORRECT** - It successfully produces exactly 4 gallons in the 5-gallon jug - It uses only 6 steps - It appears to be the optimal (shortest) solution
The agent might initially propose:
During verification, the agent walks through each step:
This detailed verification confirms the solution works. The agent caught no errors here, but the systematic check builds confidence.
When Verification Catches Mistakes
Verification really shines when the initial reasoning has flaws. Consider this scenario:
Problem: "A train travels 60 miles in 1 hour. At this rate, how far will it travel in 90 minutes?"
Initial answer: "60 miles in 60 minutes, so 90 miles in 90 minutes."
This is wrong. The agent incorrectly assumed a 1:1 ratio between minutes and miles. Let's see how verification helps:
The agent might respond:
In this case, the agent's initial answer was actually right (my example was misleading). But the verification process forced it to show its work clearly, making the reasoning transparent.
Let's try a problem where the initial answer is genuinely wrong:
Problem: "If 5 machines can produce 5 widgets in 5 minutes, how many machines are needed to produce 100 widgets in 100 minutes?"
Initial answer: "20 machines (scaling up proportionally)."
This is a classic trick question. Let's verify:
The verification reveals:
The verification caught the mistake. By breaking down the problem into smaller questions, the agent reconsidered its initial (incorrect) assumption.
Iterative Refinement
Sometimes one verification pass isn't enough. You can create a refinement loop where the agent repeatedly improves its answer.
Here's a pattern for iterative refinement:
This pattern lets the agent improve its answer over multiple passes, catching progressively subtler issues.
For intermediate readers: This iterative refinement pattern is related to several advanced techniques in AI research. Self-consistency checking (running the same problem multiple times and comparing results) and self-critique (having the model evaluate its own outputs) are active research areas. The key insight is that language models can often recognize errors in reasoning when prompted appropriately, even if they made those errors initially. This works because the verification task is different from the generation task. During generation, the model is sampling from a probability distribution. During verification, it's evaluating a concrete proposal, which can activate different reasoning patterns. However, this isn't foolproof. Models can still miss errors or even introduce new ones during refinement. In production systems, you might combine self-verification with external checks (like running code, querying databases, or using specialized verification models).
Practical Applications
Let's apply these verification techniques to our personal assistant.
Scenario: Planning a Budget
Your assistant helps you plan monthly expenses. You want it to check its own calculations.
Example (GPT-5)
Initial budget: Here’s a simple monthly budget based on $4,000 take-home pay: - Retirement savings (20%): $800 - Rent (up to 30%): $1,200 - Groceries: $400 - Entertainment: $200 - Utilities: $150 Total allocated (assuming rent at the 30% cap): $2,750 Left for other expenses: $1,250 Notes: - This leaves 31.25% of your income for other expenses (transportation, insurance, debt payments, phone/internet, healthcare, clothing, misc, or extra savings). - If your rent is less than $1,200, the leftover amount will be higher by the difference.
Verification: Verification of the proposed budget (assuming rent at the 30% cap): 1) Add up all allocations - Retirement: $800 - Rent: $1,200 - Groceries: $400 - Entertainment: $200 - Utilities: $150 Total allocated = 800 + 1,200 + 400 + 200 + 150 = $2,750 Check against income: $2,750 ≤ $4,000 (OK) 2) Check percentage calculations - Retirement: 20% of $4,000 = 0.20 × 4,000 = $800 (matches) - Rent: 30% of $4,000 = 0.30 × 4,000 = $1,200 (matches) - Groceries as %: 400 / 4,000 = 10.0% - Entertainment as %: 200 / 4,000 = 5.0% - Utilities as %: 150 / 4,000 = 3.75% - Total % allocated = 20% + 30% + 10% + 5% + 3.75% = 68.75% - Remaining % = 100% − 68.75% = 31.25% 3) Confirm remaining amount - Remaining dollars = $4,000 − $2,750 = $1,250 - Cross-check via %: 31.25% × $4,000 = 0.3125 × 4,000 = $1,250 (matches) 4) Notes on realism - Rent at $1,200 (30%) may be challenging in high-cost areas but is a solid target if feasible. - Utilities at $150 could be low if it needs to include electric, gas, water, trash, internet, and phone; you may want to budget internet/phone separately in “other expenses.” - Groceries at $400 is reasonable for one person; for multiple people it may be tight. - Retirement at 20% of take-home is ambitious and excellent if cash flow allows; ensure you still cover essentials, insurance, and any debt. Result: Left for other expenses = $1,250. If your actual rent is below $1,200, the remaining amount increases dollar-for-dollar.
The verification might reveal:
This systematic verification ensures the budget is accurate before you rely on it.
Limitations and Considerations
While verification is powerful, it's not magic. Keep these limitations in mind:
Verification isn't perfect: The agent can still miss errors, especially subtle ones. It's checking its own work with the same reasoning capabilities that produced the initial answer.
It adds cost and latency: Each verification pass means another API call, which takes time and costs money. Use verification judiciously for important decisions, not every trivial query.
Over-verification can confuse: Asking the agent to verify too many times might lead it to second-guess correct answers or introduce new errors.
Some errors are hard to catch: If the agent fundamentally misunderstands the problem, verification might not help. It will just verify the wrong approach more confidently.
Think of verification as a safety net, not a guarantee. It significantly improves reliability, but it doesn't eliminate the need for human oversight on important decisions.
When to Use Verification
Use verification strategically:
High-stakes decisions: When the cost of an error is high (financial calculations, medical information, legal advice), always verify.
Complex reasoning: Multi-step problems with many opportunities for errors benefit from verification.
Unfamiliar domains: When the agent is working in an area where it might lack knowledge, verification helps catch knowledge gaps.
User-facing outputs: Before presenting an answer to a user, especially in professional contexts, verification adds polish.
Skip verification for: Simple queries, creative tasks where there's no "right" answer, or when speed matters more than perfect accuracy.
Combining Verification with Chain-of-Thought
Verification works even better when combined with chain-of-thought reasoning from the previous chapter. Here's the pattern:
- Think step by step (chain-of-thought): Break down the problem
- Solve: Work through each step
- Verify: Check the reasoning and calculations
- Refine: Correct any errors found
This three-stage process (think, solve, verify) creates a robust reasoning pipeline for your agent.
Example (Claude Sonnet 4.5)
Solution: # Solving the Garden Path Problem Let me work through this step by step. ## Step 1: Dimensions of the garden The garden is a rectangle with: - **Length = 12 meters** - **Width = 8 meters** - **Area of garden = 12 × 8 = 96 m²** ## Step 2: Dimensions including the path The path is 1 meter wide and goes around the **entire** garden, which means it adds 1 meter on **all sides**. So we add: - 1 meter on the left + 1 meter on the right = 2 meters to the width - 1 meter on the top + 1 meter on the bottom = 2 meters to the length New dimensions (garden + path): - **Length = 12 + 2 = 14 meters** - **Width = 8 + 2 = 10 meters** - **Total area = 14 × 10 = 140 m²** ## Step 3: Calculate the path area The path area is the difference between the total area and the garden area: **Path area = Total area - Garden area** **Path area = 140 - 96 = 44 m²** ## Answer The area of the path is **44 square meters**.
Verification: Your solution is **completely correct**! Let me verify each step: ## Verification ### Step 1: Garden dimensions ✓ - Length = 12 meters - Width = 8 meters - Garden area = 12 × 8 = 96 m² ✓ ### Step 2: Outer dimensions (garden + path) ✓ You correctly identified that a 1-meter path around the **entire** garden adds: - 1 meter on each side (left AND right) = +2 meters to width - 1 meter on each end (top AND bottom) = +2 meters to length Outer dimensions: - Length = 12 + 2 = 14 meters ✓ - Width = 8 + 2 = 10 meters ✓ - Total area = 14 × 10 = 140 m² ✓ ### Step 3: Path area calculation ✓ Path area = Total area - Garden area Path area = 140 - 96 = **44 m²** ✓ ## Alternative verification method: The path can be thought of as four rectangles (with overlapping corners): - Two long sides: 2 × (12 × 1) = 24 m² - Two short sides: 2 × (8 × 1) = 16 m² - Four corners: 4 × (1 × 1) = 4 m² - Total: 24 + 16 + 4 = 44 m² ✓ **Your answer of 44 square meters is correct!** 🎉
This combined approach gives you both the benefits of structured thinking and the safety of verification.
Building Verification Into Your Agent
As you develop your personal assistant, consider building verification into its core workflow for critical tasks. Here's a simple pattern:
This pattern makes verification easy to enable or disable based on the task's importance.
Key Takeaways
- Verification improves accuracy: Prompting agents to check their work catches many errors
- Multiple techniques exist: Confirmation, alternative approaches, explanation, and constraint checking all help
- Iterative refinement: Multiple verification passes can progressively improve answers
- Combine with chain-of-thought: Verification works best alongside structured reasoning
- Use strategically: Apply verification to high-stakes or complex problems, not every query
- Not foolproof: Verification helps but doesn't guarantee correctness
With verification techniques in your toolkit, your agent becomes more reliable and trustworthy. It doesn't just solve problems; it double-checks its work, catching errors before they reach you.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about checking and refining agent reasoning.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Skip-gram Model: Learning Word Embeddings by Predicting Context
A comprehensive guide to the Skip-gram model from Word2Vec, covering architecture, objective function, training data generation, and implementation from scratch.

Reasoning: Teaching AI Agents to Think Step-by-Step with Chain-of-Thought Prompting
Learn how to use chain-of-thought prompting to get AI agents to reason through problems step by step, improving accuracy and transparency for complex questions, math problems, and decision-making tasks.

Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents
Learn how to teach AI agents to think through problems step by step using chain-of-thought reasoning. Discover practical techniques for improving accuracy and transparency in complex tasks.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments