Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy

Michael Brenndoerfer

AI Agent Handbook Machine Learning Data, Analytics & AI

Learn how to guide AI agents to verify and refine their reasoning through self-checking techniques. Discover practical methods for catching errors, improving accuracy, and building more reliable AI systems.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Checking and Refining the Agent's ReasoningLink Copied

In the previous chapter, you learned how chain-of-thought prompting helps agents break down complex problems step by step. But thinking through a problem once isn't always enough. Sometimes the agent makes a mistake in its reasoning, overlooks an important detail, or jumps to a conclusion too quickly.

What if you could teach your agent to double-check its own work? To pause and ask itself, "Wait, does this actually make sense?" This chapter explores how to guide agents to verify and refine their reasoning, catching errors before they become wrong answers.

Why Agents Need to Check Their WorkLink Copied

Imagine asking your personal assistant to calculate the best time to leave for the airport. It reasons through the problem: "Flight is at 3 PM, need to arrive 2 hours early, drive takes 30 minutes, so leave at 12:30 PM." Sounds good, right?

But what if it's Friday afternoon and traffic will be terrible? What if you need to return a rental car first? What if the airport is known for long security lines? A single pass through the reasoning might miss these factors.

Language models, despite their impressive capabilities, can make similar oversights. They might:

Make arithmetic errors in multi-step calculations
Misinterpret a key detail in the problem
Apply the wrong formula or approach
Forget to consider edge cases or constraints
Reach a conclusion that contradicts their own reasoning

The good news? You can prompt the agent to review its thinking, just like you might ask a colleague, "Are you sure about that?" or "Can you walk me through your reasoning again?"

The Power of Self-VerificationLink Copied

Let's start with a simple example. Here's an agent solving a word problem:

Problem: "A store has 15 apples. They sell 8 apples in the morning and 6 apples in the afternoon. How many apples are left?"

First attempt:

Let me solve this step by step:

1. Start with 15 apples
2. Sell 8 in the morning: 15 - 8 = 7 apples left
3. Sell 6 in the afternoon: 7 - 6 = 1 apple left

Answer: 1 apple remains.

Let me solve this step by step:

1. Start with 15 apples
2. Sell 8 in the morning: 15 - 8 = 7 apples left
3. Sell 6 in the afternoon: 7 - 6 = 1 apple left

Answer: 1 apple remains.

This looks reasonable. But what if we prompt the agent to verify its answer?

Example (GPT-5)Link Copied

In[3]:

Code

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Initial problem solving
problem = """A store has 15 apples. They sell 8 apples in the morning 
and 6 apples in the afternoon. How many apples are left?"""

initial_response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": f"{problem}\n\nSolve this step by step."}],
)

initial_answer = initial_response.choices[0].message.content
print("Initial answer:", initial_answer)

## Now ask the agent to verify
verification_prompt = f"""You solved this problem:
{problem}

Your answer was:
{initial_answer}

Please verify your answer by:

1. Checking each calculation
2. Confirming the logic makes sense
3. Considering if you missed anything

Are you confident in this answer? If you find any issues, provide a corrected solution."""

verification_response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": verification_prompt}],
)

print("\nVerification:", verification_response.choices[0].message.content)

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Initial problem solving
problem = """A store has 15 apples. They sell 8 apples in the morning 
and 6 apples in the afternoon. How many apples are left?"""

initial_response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": f"{problem}\n\nSolve this step by step."}],
)

initial_answer = initial_response.choices[0].message.content
print("Initial answer:", initial_answer)

## Now ask the agent to verify
verification_prompt = f"""You solved this problem:
{problem}

Your answer was:
{initial_answer}

Please verify your answer by:

1. Checking each calculation
2. Confirming the logic makes sense
3. Considering if you missed anything

Are you confident in this answer? If you find any issues, provide a corrected solution."""

verification_response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": verification_prompt}],
)

print("\nVerification:", verification_response.choices[0].message.content)

Out[3]:

Console

Initial answer: - Start with 15 apples.
- After selling 8 in the morning: 15 − 8 = 7 apples left.
- After selling 6 in the afternoon: 7 − 6 = 1 apple left.

Answer: 1 apple.


Verification: 1) Check each calculation:
- Morning: 15 − 8 = 7. Correct.
- Afternoon: 7 − 6 = 1. Correct.
- Combined check: 15 − (8 + 6) = 15 − 14 = 1. Matches.

2) Logic check:
- They had enough apples for each sale (7 left after morning, so selling 6 in the afternoon is feasible).
- Sequential subtraction or combined subtraction both yield the same result.

3) Consider what might be missing:
- No mention of restocking, spoilage, or returns, so subtracting the total sold from the initial stock is appropriate.

Conclusion: The answer is correct. 1 apple left. I’m confident in this result.

The verification step might reveal:

Let me verify my calculations:

1. Start: 15 apples ✓
2. After morning sales: 15 - 8 = 7 ✓
3. After afternoon sales: 7 - 6 = 1 ✓

Wait, let me also check if the problem is even possible:
- Total sold: 8 + 6 = 14 apples
- Started with: 15 apples
- 15 - 14 = 1 apple

Yes, the answer checks out. We have 1 apple remaining.

Let me verify my calculations:

1. Start: 15 apples ✓
2. After morning sales: 15 - 8 = 7 ✓
3. After afternoon sales: 7 - 6 = 1 ✓

Wait, let me also check if the problem is even possible:
- Total sold: 8 + 6 = 14 apples
- Started with: 15 apples
- 15 - 14 = 1 apple

Yes, the answer checks out. We have 1 apple remaining.

In this case, the answer was correct. But the verification process adds confidence. The agent didn't just solve the problem; it confirmed the solution makes sense.

Techniques for VerificationLink Copied

You can guide agents to check their work in several ways. Each technique serves a different purpose.

Ask for ConfirmationLink Copied

The simplest approach: explicitly ask the agent if it's sure.

You answered X. Are you confident in this answer? 
Double-check your reasoning before confirming.

You answered X. Are you confident in this answer? 
Double-check your reasoning before confirming.

This prompt nudges the agent to review its work without dictating how to do it. Sometimes that's all you need.

Request Alternative ApproachesLink Copied

Ask the agent to solve the problem a different way, then compare results.

You solved this using method A. Can you solve it using method B 
and see if you get the same answer?

You solved this using method A. Can you solve it using method B 
and see if you get the same answer?

If both approaches yield the same result, you can be more confident. If they differ, something went wrong.

Prompt for ExplanationLink Copied

Ask the agent to explain its reasoning in more detail.

Explain why you chose this approach and why your answer makes sense.

Explain why you chose this approach and why your answer makes sense.

When the agent has to justify its reasoning, it often catches its own mistakes. This is similar to how explaining a problem to someone else helps you spot errors in your own thinking.

Check Against ConstraintsLink Copied

Remind the agent of any constraints or requirements, then ask if its answer satisfies them.

Your answer is X. Does this satisfy all the requirements:
- Must be less than 100
- Must be a whole number
- Must be positive

Verify each constraint.

Your answer is X. Does this satisfy all the requirements:
- Must be less than 100
- Must be a whole number
- Must be positive

Verify each constraint.

This structured check helps catch violations the agent might have overlooked.

A More Complex ExampleLink Copied

Let's see verification in action with a trickier problem.

Problem: "You have a 3-gallon jug and a 5-gallon jug. How can you measure exactly 4 gallons of water?"

This is a classic puzzle that requires creative thinking. Let's see how an agent might solve it, then verify its solution.

Example (Claude Sonnet 4.5)Link Copied

In[4]:

Code

from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

problem = """You have a 3-gallon jug and a 5-gallon jug. 
How can you measure exactly 4 gallons of water?"""

## Initial solution
initial_response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": f"{problem}\n\nThink through this step by step."}
    ],
)

solution = initial_response.content[0].text
print("Initial solution:", solution)

## Verification with detailed checking
verification = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""You proposed this solution to the jug problem:
{solution}

Please verify by:

1. Walking through each step and tracking the water in each jug
2. Confirming you end with exactly 4 gallons
3. Checking if there's a simpler solution

Show your verification work clearly.""",
        }
    ],
)

print("\nVerification:", verification.content[0].text)

from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

problem = """You have a 3-gallon jug and a 5-gallon jug. 
How can you measure exactly 4 gallons of water?"""

## Initial solution
initial_response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": f"{problem}\n\nThink through this step by step."}
    ],
)

solution = initial_response.content[0].text
print("Initial solution:", solution)

## Verification with detailed checking
verification = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""You proposed this solution to the jug problem:
{solution}

Please verify by:

1. Walking through each step and tracking the water in each jug
2. Confirming you end with exactly 4 gallons
3. Checking if there's a simpler solution

Show your verification work clearly.""",
        }
    ],
)

print("\nVerification:", verification.content[0].text)

Out[4]:

Console

Initial solution: I need to measure exactly 4 gallons using a 3-gallon jug and a 5-gallon jug.

Let me work through this step-by-step:

**Solution:**

1. **Fill the 5-gallon jug completely**
   - 3-gal jug: 0 gallons
   - 5-gal jug: 5 gallons

2. **Pour from the 5-gallon jug into the 3-gallon jug**
   - 3-gal jug: 3 gallons (full)
   - 5-gal jug: 2 gallons (remaining)

3. **Empty the 3-gallon jug**
   - 3-gal jug: 0 gallons
   - 5-gal jug: 2 gallons

4. **Pour the 2 gallons from the 5-gallon jug into the 3-gallon jug**
   - 3-gal jug: 2 gallons
   - 5-gal jug: 0 gallons

5. **Fill the 5-gallon jug completely again**
   - 3-gal jug: 2 gallons
   - 5-gal jug: 5 gallons

6. **Pour from the 5-gallon jug into the 3-gallon jug (which already has 2 gallons)**
   - The 3-gallon jug can only take 1 more gallon
   - 3-gal jug: 3 gallons (full)
   - 5-gal jug: **4 gallons** ✓

**Result:** The 5-gallon jug now contains exactly 4 gallons!


Verification: # Verification of the Jug Problem Solution

Let me carefully track each step:

## Step-by-Step Verification

| Step | Action | 3-gal jug | 5-gal jug | ✓/✗ |
|------|--------|-----------|-----------|-----|
| Start | Initial state | 0 | 0 | ✓ |
| 1 | Fill 5-gal jug | 0 | 5 | ✓ |
| 2 | Pour 5-gal → 3-gal (pour 3) | 3 | 2 | ✓ |
| 3 | Empty 3-gal jug | 0 | 2 | ✓ |
| 4 | Pour 5-gal → 3-gal (pour 2) | 2 | 0 | ✓ |
| 5 | Fill 5-gal jug | 2 | 5 | ✓ |
| 6 | Pour 5-gal → 3-gal (pour 1) | 3 | 4 | ✓ |

## Verification Checks

**1. Each step is valid:**
- ✓ Step 2: 3-gal jug empty, can accept 3 gallons from 5-gal
- ✓ Step 4: Pouring 2 gallons into empty 3-gal jug works
- ✓ Step 6: 3-gal jug has 2 gallons, can only accept 1 more (5 - 1 = 4)

**2. Final result:**
- ✓ **5-gallon jug contains exactly 4 gallons**

**3. Simpler solution?**

Let me check an alternative:

| Step | Action | 3-gal jug | 5-gal jug |
|------|--------|-----------|-----------|
| 1 | Fill 3-gal | 3 | 0 |
| 2 | Pour 3-gal → 5-gal | 0 | 3 |
| 3 | Fill 3-gal again | 3 | 3 |
| 4 | Pour 3-gal → 5-gal (fills it) | 1 | 5 |
| 5 | Empty 5-gal | 1 | 0 |
| 6 | Pour 3-gal → 5-gal | 0 | 1 |
| 7 | Fill 3-gal | 3 | 1 |
| 8 | Pour 3-gal → 5-gal | 0 | 4 |

This takes **8 steps** vs the original **6 steps**.

## Conclusion

✅ **The original solution is VERIFIED and CORRECT**
- It successfully produces exactly 4 gallons in the 5-gallon jug
- It uses only 6 steps
- It appears to be the optimal (shortest) solution

The agent might initially propose:

Here's one way to measure 4 gallons:

1. Fill the 5-gallon jug completely
2. Pour from the 5-gallon jug into the 3-gallon jug (leaving 2 gallons in the 5-gallon jug)
3. Empty the 3-gallon jug
4. Pour the 2 gallons from the 5-gallon jug into the 3-gallon jug
5. Fill the 5-gallon jug again
6. Pour from the 5-gallon jug into the 3-gallon jug until it's full (this adds 1 gallon)
7. The 5-gallon jug now contains exactly 4 gallons

Here's one way to measure 4 gallons:

1. Fill the 5-gallon jug completely
2. Pour from the 5-gallon jug into the 3-gallon jug (leaving 2 gallons in the 5-gallon jug)
3. Empty the 3-gallon jug
4. Pour the 2 gallons from the 5-gallon jug into the 3-gallon jug
5. Fill the 5-gallon jug again
6. Pour from the 5-gallon jug into the 3-gallon jug until it's full (this adds 1 gallon)
7. The 5-gallon jug now contains exactly 4 gallons

During verification, the agent walks through each step:

Let me verify step by step:

Start: 5-gal = 0, 3-gal = 0

Step 1: Fill 5-gallon jug
$\to$ 5-gal = 5, 3-gal = 0 ✓

Step 2: Pour into 3-gallon jug
$\to$ 5-gal = 2, 3-gal = 3 ✓

Step 3: Empty 3-gallon jug
$\to$ 5-gal = 2, 3-gal = 0 ✓

Step 4: Pour 2 gallons into 3-gallon jug
$\to$ 5-gal = 0, 3-gal = 2 ✓

Step 5: Fill 5-gallon jug
$\to$ 5-gal = 5, 3-gal = 2 ✓

Step 6: Pour into 3-gallon jug (which has 2 gallons, so can take 1 more)
$\to$ 5-gal = 4, 3-gal = 3 ✓

Result: 4 gallons in the 5-gallon jug. Solution verified!

Let me verify step by step:

Start: 5-gal = 0, 3-gal = 0

Step 1: Fill 5-gallon jug
$\to$ 5-gal = 5, 3-gal = 0 ✓

Step 2: Pour into 3-gallon jug
$\to$ 5-gal = 2, 3-gal = 3 ✓

Step 3: Empty 3-gallon jug
$\to$ 5-gal = 2, 3-gal = 0 ✓

Step 4: Pour 2 gallons into 3-gallon jug
$\to$ 5-gal = 0, 3-gal = 2 ✓

Step 5: Fill 5-gallon jug
$\to$ 5-gal = 5, 3-gal = 2 ✓

Step 6: Pour into 3-gallon jug (which has 2 gallons, so can take 1 more)
$\to$ 5-gal = 4, 3-gal = 3 ✓

Result: 4 gallons in the 5-gallon jug. Solution verified!

This detailed verification confirms the solution works. The agent caught no errors here, but the systematic check builds confidence.

When Verification Catches MistakesLink Copied

Verification really shines when the initial reasoning has flaws. Consider this scenario:

Problem: "A train travels 60 miles in 1 hour. At this rate, how far will it travel in 90 minutes?"

Initial answer: "60 miles in 60 minutes, so 90 miles in 90 minutes."

This is wrong. The agent incorrectly assumed a 1:1 ratio between minutes and miles. Let's see how verification helps:

In[5]:

Code

## Using the verification pattern from earlier
verification_prompt = """You said the train travels 90 miles in 90 minutes.

Let's verify this:

1. What is the train's speed in miles per minute?
2. How many miles would it travel in 90 minutes at that speed?
3. Does your answer match your calculation?

Check your work carefully."""

## Using the verification pattern from earlier
verification_prompt = """You said the train travels 90 miles in 90 minutes.

Let's verify this:

1. What is the train's speed in miles per minute?
2. How many miles would it travel in 90 minutes at that speed?
3. Does your answer match your calculation?

Check your work carefully."""

The agent might respond:

Let me recalculate:

1. Speed: 60 miles / 60 minutes = 1 mile per minute
2. Distance in 90 minutes: 1 mile/minute $\times$ 90 minutes = 90 miles

Wait, that gives me 90 miles, which matches my answer. Let me verify once more:
- 60 miles in 60 minutes = 1 mile per minute ✓
- 90 minutes $\times$ 1 mile per minute = 90 miles ✓

Actually, my answer was correct. The train travels 90 miles in 90 minutes.

Let me recalculate:

1. Speed: 60 miles / 60 minutes = 1 mile per minute
2. Distance in 90 minutes: 1 mile/minute $\times$ 90 minutes = 90 miles

Wait, that gives me 90 miles, which matches my answer. Let me verify once more:
- 60 miles in 60 minutes = 1 mile per minute ✓
- 90 minutes $\times$ 1 mile per minute = 90 miles ✓

Actually, my answer was correct. The train travels 90 miles in 90 minutes.

In this case, the agent's initial answer was actually right (my example was misleading). But the verification process forced it to show its work clearly, making the reasoning transparent.

Let's try a problem where the initial answer is genuinely wrong:

Problem: "If 5 machines can produce 5 widgets in 5 minutes, how many machines are needed to produce 100 widgets in 100 minutes?"

Initial answer: "20 machines (scaling up proportionally)."

This is a classic trick question. Let's verify:

In[6]:

Code

verification_prompt = """You said 20 machines are needed.

Before confirming, let's think about this carefully:

1. How many widgets does ONE machine produce in 5 minutes?
2. How many widgets does ONE machine produce in 100 minutes?
3. How many machines do you need to produce 100 widgets in 100 minutes?

Work through this step by step."""

verification_prompt = """You said 20 machines are needed.

Before confirming, let's think about this carefully:

1. How many widgets does ONE machine produce in 5 minutes?
2. How many widgets does ONE machine produce in 100 minutes?
3. How many machines do you need to produce 100 widgets in 100 minutes?

Work through this step by step."""

The verification reveals:

Let me reconsider:

1. If 5 machines produce 5 widgets in 5 minutes, then each machine produces 1 widget in 5 minutes.

2. In 100 minutes, one machine produces: 100 $\div$ 5 = 20 widgets

3. To produce 100 widgets in 100 minutes: 100 widgets $\div$ 20 widgets per machine = 5 machines

Wait, I made an error. The answer is 5 machines, not 20. 

Each machine produces at a rate of 1 widget per 5 minutes. In 100 minutes, each machine makes 20 widgets. So 5 machines make 100 widgets in 100 minutes.

Corrected answer: 5 machines.

Let me reconsider:

1. If 5 machines produce 5 widgets in 5 minutes, then each machine produces 1 widget in 5 minutes.

2. In 100 minutes, one machine produces: 100 $\div$ 5 = 20 widgets

3. To produce 100 widgets in 100 minutes: 100 widgets $\div$ 20 widgets per machine = 5 machines

Wait, I made an error. The answer is 5 machines, not 20. 

Each machine produces at a rate of 1 widget per 5 minutes. In 100 minutes, each machine makes 20 widgets. So 5 machines make 100 widgets in 100 minutes.

Corrected answer: 5 machines.

The verification caught the mistake. By breaking down the problem into smaller questions, the agent reconsidered its initial (incorrect) assumption.

Iterative RefinementLink Copied

Sometimes one verification pass isn't enough. You can create a refinement loop where the agent repeatedly improves its answer.

Here's a pattern for iterative refinement:

In[7]:

Code

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def refine_answer(client, problem, max_iterations=3):
    """Iteratively refine an answer through multiple verification passes."""

    # Initial solution
    messages = [{"role": "user", "content": f"{problem}\n\nSolve this step by step."}]

    response = client.chat.completions.create(model="gpt-5", messages=messages)

    current_answer = response.choices[0].message.content
    print(f"Initial answer:\n{current_answer}\n")

    # Refinement loop
    for i in range(max_iterations):
        messages.append({"role": "assistant", "content": current_answer})
        messages.append(
            {
                "role": "user",
                "content": """Review your answer. Are there any errors or improvements you can make? 
            If your answer is correct and complete, say 'VERIFIED'. 
            Otherwise, provide an improved version.""",
            }
        )

        response = client.chat.completions.create(model="gpt-5", messages=messages)

        refinement = response.choices[0].message.content
        print(f"Refinement {i+1}:\n{refinement}\n")

        if "VERIFIED" in refinement.upper():
            print("Answer verified!")
            break

        current_answer = refinement

    return current_answer

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


def refine_answer(client, problem, max_iterations=3):
    """Iteratively refine an answer through multiple verification passes."""

    # Initial solution
    messages = [{"role": "user", "content": f"{problem}\n\nSolve this step by step."}]

    response = client.chat.completions.create(model="gpt-5", messages=messages)

    current_answer = response.choices[0].message.content
    print(f"Initial answer:\n{current_answer}\n")

    # Refinement loop
    for i in range(max_iterations):
        messages.append({"role": "assistant", "content": current_answer})
        messages.append(
            {
                "role": "user",
                "content": """Review your answer. Are there any errors or improvements you can make? 
            If your answer is correct and complete, say 'VERIFIED'. 
            Otherwise, provide an improved version.""",
            }
        )

        response = client.chat.completions.create(model="gpt-5", messages=messages)

        refinement = response.choices[0].message.content
        print(f"Refinement {i+1}:\n{refinement}\n")

        if "VERIFIED" in refinement.upper():
            print("Answer verified!")
            break

        current_answer = refinement

    return current_answer

This pattern lets the agent improve its answer over multiple passes, catching progressively subtler issues.

For intermediate readers: This iterative refinement pattern is related to several advanced techniques in AI research. Self-consistency checking (running the same problem multiple times and comparing results) and self-critique (having the model evaluate its own outputs) are active research areas. The key insight is that language models can often recognize errors in reasoning when prompted appropriately, even if they made those errors initially. This works because the verification task is different from the generation task. During generation, the model is sampling from a probability distribution. During verification, it's evaluating a concrete proposal, which can activate different reasoning patterns. However, this isn't foolproof. Models can still miss errors or even introduce new ones during refinement. In production systems, you might combine self-verification with external checks (like running code, querying databases, or using specialized verification models).

Practical ApplicationsLink Copied

Let's apply these verification techniques to our personal assistant.

Scenario: Planning a BudgetLink Copied

Your assistant helps you plan monthly expenses. You want it to check its own calculations.

Example (GPT-5)Link Copied

In[8]:

Code

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

budget_problem = """I earn \$4,000 per month after taxes. I want to:
- Save 20% for retirement
- Spend no more than 30% on rent
- Allocate \$400 for groceries
- Set aside \$200 for entertainment
- Keep \$150 for utilities

How much will I have left for other expenses? Create a budget breakdown."""

## Initial budget calculation
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": budget_problem}],
)

initial_budget = response.choices[0].message.content
print("Initial budget:\n", initial_budget)

## Verification with specific checks
verification = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "user", "content": budget_problem},
        {"role": "assistant", "content": initial_budget},
        {
            "role": "user",
            "content": """Please verify your budget by:

1. Adding up all expenses to confirm they don't exceed \$4,000
2. Checking that percentages are calculated correctly
3. Confirming the remaining amount is accurate
4. Noting if any category seems unrealistic

Show your verification calculations.""",
        },
    ],
)

print("\nVerification:\n", verification.choices[0].message.content)

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

budget_problem = """I earn \$4,000 per month after taxes. I want to:
- Save 20% for retirement
- Spend no more than 30% on rent
- Allocate \$400 for groceries
- Set aside \$200 for entertainment
- Keep \$150 for utilities

How much will I have left for other expenses? Create a budget breakdown."""

## Initial budget calculation
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": budget_problem}],
)

initial_budget = response.choices[0].message.content
print("Initial budget:\n", initial_budget)

## Verification with specific checks
verification = client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "user", "content": budget_problem},
        {"role": "assistant", "content": initial_budget},
        {
            "role": "user",
            "content": """Please verify your budget by:

1. Adding up all expenses to confirm they don't exceed \$4,000
2. Checking that percentages are calculated correctly
3. Confirming the remaining amount is accurate
4. Noting if any category seems unrealistic

Show your verification calculations.""",
        },
    ],
)

print("\nVerification:\n", verification.choices[0].message.content)

Out[8]:

Console

Initial budget:
 Here’s a simple monthly budget based on $4,000 take-home pay:

- Retirement savings (20%): $800
- Rent (up to 30%): $1,200
- Groceries: $400
- Entertainment: $200
- Utilities: $150

Total allocated (assuming rent at the 30% cap): $2,750
Left for other expenses: $1,250

Notes:
- This leaves 31.25% of your income for other expenses (transportation, insurance, debt payments, phone/internet, healthcare, clothing, misc, or extra savings).
- If your rent is less than $1,200, the leftover amount will be higher by the difference.


Verification:
 Verification of the proposed budget (assuming rent at the 30% cap):

1) Add up all allocations
- Retirement: $800
- Rent: $1,200
- Groceries: $400
- Entertainment: $200
- Utilities: $150
Total allocated = 800 + 1,200 + 400 + 200 + 150 = $2,750
Check against income: $2,750 ≤ $4,000 (OK)

2) Check percentage calculations
- Retirement: 20% of $4,000 = 0.20 × 4,000 = $800 (matches)
- Rent: 30% of $4,000 = 0.30 × 4,000 = $1,200 (matches)
- Groceries as %: 400 / 4,000 = 10.0%
- Entertainment as %: 200 / 4,000 = 5.0%
- Utilities as %: 150 / 4,000 = 3.75%
- Total % allocated = 20% + 30% + 10% + 5% + 3.75% = 68.75%
- Remaining % = 100% − 68.75% = 31.25%

3) Confirm remaining amount
- Remaining dollars = $4,000 − $2,750 = $1,250
- Cross-check via %: 31.25% × $4,000 = 0.3125 × 4,000 = $1,250 (matches)

4) Notes on realism
- Rent at $1,200 (30%) may be challenging in high-cost areas but is a solid target if feasible.
- Utilities at $150 could be low if it needs to include electric, gas, water, trash, internet, and phone; you may want to budget internet/phone separately in “other expenses.”
- Groceries at $400 is reasonable for one person; for multiple people it may be tight.
- Retirement at 20% of take-home is ambitious and excellent if cash flow allows; ensure you still cover essentials, insurance, and any debt.

Result: Left for other expenses = $1,250. If your actual rent is below $1,200, the remaining amount increases dollar-for-dollar.

The verification might reveal:

Let me verify the budget:

Income: \$4,000

Expenses:
- Retirement (20%): \$4,000 $\times$ 0.20 = \$800 ✓
- Rent (30%): \$4,000 $\times$ 0.30 = \$1,200 ✓
- Groceries: \$400 ✓
- Entertainment: \$200 ✓
- Utilities: \$150 ✓

Total expenses: \$800 + \$1,200 + \$400 + \$200 + \$150 = \$2,750

Remaining: \$4,000 - \$2,750 = \$1,250 ✓

Verification complete. The budget is mathematically correct, and you'll have \$1,250 
remaining for other expenses. All calculations check out.

Let me verify the budget:

Income: \$4,000

Expenses:
- Retirement (20%): \$4,000 $\times$ 0.20 = \$800 ✓
- Rent (30%): \$4,000 $\times$ 0.30 = \$1,200 ✓
- Groceries: \$400 ✓
- Entertainment: \$200 ✓
- Utilities: \$150 ✓

Total expenses: \$800 + \$1,200 + \$400 + \$200 + \$150 = \$2,750

Remaining: \$4,000 - \$2,750 = \$1,250 ✓

Verification complete. The budget is mathematically correct, and you'll have \$1,250 
remaining for other expenses. All calculations check out.

This systematic verification ensures the budget is accurate before you rely on it.

Limitations and ConsiderationsLink Copied

While verification is powerful, it's not magic. Keep these limitations in mind:

Verification isn't perfect: The agent can still miss errors, especially subtle ones. It's checking its own work with the same reasoning capabilities that produced the initial answer.

It adds cost and latency: Each verification pass means another API call, which takes time and costs money. Use verification judiciously for important decisions, not every trivial query.

Over-verification can confuse: Asking the agent to verify too many times might lead it to second-guess correct answers or introduce new errors.

Some errors are hard to catch: If the agent fundamentally misunderstands the problem, verification might not help. It will just verify the wrong approach more confidently.

Think of verification as a safety net, not a guarantee. It significantly improves reliability, but it doesn't eliminate the need for human oversight on important decisions.

When to Use VerificationLink Copied

Use verification strategically:

High-stakes decisions: When the cost of an error is high (financial calculations, medical information, legal advice), always verify.

Complex reasoning: Multi-step problems with many opportunities for errors benefit from verification.

Unfamiliar domains: When the agent is working in an area where it might lack knowledge, verification helps catch knowledge gaps.

User-facing outputs: Before presenting an answer to a user, especially in professional contexts, verification adds polish.

Skip verification for: Simple queries, creative tasks where there's no "right" answer, or when speed matters more than perfect accuracy.

Combining Verification with Chain-of-ThoughtLink Copied

Verification works even better when combined with chain-of-thought reasoning from the previous chapter. Here's the pattern:

Think step by step (chain-of-thought): Break down the problem
Solve: Work through each step
Verify: Check the reasoning and calculations
Refine: Correct any errors found

This three-stage process (think, solve, verify) creates a robust reasoning pipeline for your agent.

Example (Claude Sonnet 4.5)Link Copied

In[9]:

Code

from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

problem = """A rectangular garden is 12 meters long and 8 meters wide. 
You want to build a path 1 meter wide around the entire garden. 
What is the area of the path?"""

## Stage 1 & 2: Chain-of-thought solving
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""{problem}

Think through this step by step:

1. What are the dimensions of the garden?
2. What will be the dimensions including the path?
3. How can you calculate the path area?

Solve the problem showing your work.""",
        }
    ],
)

solution = response.content[0].text
print("Solution:", solution)

## Stage 3: Verification
verification = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""{problem}

Your solution:
{solution}

Verify by:

1. Checking dimensions are correct
2. Confirming area calculations
3. Ensuring you calculated the path area, not the total area

If you find errors, provide a corrected solution.""",
        }
    ],
)

print("\nVerification:", verification.content[0].text)

from anthropic import Anthropic

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

problem = """A rectangular garden is 12 meters long and 8 meters wide. 
You want to build a path 1 meter wide around the entire garden. 
What is the area of the path?"""

## Stage 1 & 2: Chain-of-thought solving
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""{problem}

Think through this step by step:

1. What are the dimensions of the garden?
2. What will be the dimensions including the path?
3. How can you calculate the path area?

Solve the problem showing your work.""",
        }
    ],
)

solution = response.content[0].text
print("Solution:", solution)

## Stage 3: Verification
verification = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": f"""{problem}

Your solution:
{solution}

Verify by:

1. Checking dimensions are correct
2. Confirming area calculations
3. Ensuring you calculated the path area, not the total area

If you find errors, provide a corrected solution.""",
        }
    ],
)

print("\nVerification:", verification.content[0].text)

Out[9]:

Console

Solution: # Solving the Garden Path Problem

Let me work through this step by step.

## Step 1: Dimensions of the garden

The garden is a rectangle with:
- **Length = 12 meters**
- **Width = 8 meters**
- **Area of garden = 12 × 8 = 96 m²**

## Step 2: Dimensions including the path

The path is 1 meter wide and goes around the **entire** garden, which means it adds 1 meter on **all sides**.

So we add:
- 1 meter on the left + 1 meter on the right = 2 meters to the width
- 1 meter on the top + 1 meter on the bottom = 2 meters to the length

New dimensions (garden + path):
- **Length = 12 + 2 = 14 meters**
- **Width = 8 + 2 = 10 meters**
- **Total area = 14 × 10 = 140 m²**

## Step 3: Calculate the path area

The path area is the difference between the total area and the garden area:

**Path area = Total area - Garden area**

**Path area = 140 - 96 = 44 m²**

## Answer

The area of the path is **44 square meters**.


Verification: Your solution is **completely correct**! Let me verify each step:

## Verification

### Step 1: Garden dimensions ✓
- Length = 12 meters
- Width = 8 meters
- Garden area = 12 × 8 = 96 m² ✓

### Step 2: Outer dimensions (garden + path) ✓
You correctly identified that a 1-meter path around the **entire** garden adds:
- 1 meter on each side (left AND right) = +2 meters to width
- 1 meter on each end (top AND bottom) = +2 meters to length

Outer dimensions:
- Length = 12 + 2 = 14 meters ✓
- Width = 8 + 2 = 10 meters ✓
- Total area = 14 × 10 = 140 m² ✓

### Step 3: Path area calculation ✓
Path area = Total area - Garden area
Path area = 140 - 96 = **44 m²** ✓

## Alternative verification method:

The path can be thought of as four rectangles (with overlapping corners):
- Two long sides: 2 × (12 × 1) = 24 m²
- Two short sides: 2 × (8 × 1) = 16 m²
- Four corners: 4 × (1 × 1) = 4 m²
- Total: 24 + 16 + 4 = 44 m² ✓

**Your answer of 44 square meters is correct!** 🎉

This combined approach gives you both the benefits of structured thinking and the safety of verification.

Building Verification Into Your AgentLink Copied

As you develop your personal assistant, consider building verification into its core workflow for critical tasks. Here's a simple pattern:

In[10]:

Code

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


class PersonalAssistant:
    def __init__(self, client):
        self.client = client

    def solve_with_verification(self, problem, verify=True):
        """Solve a problem with optional verification."""

        # Initial solution
        solution = self._solve(problem)

        if not verify:
            return solution

        # Verification step
        verified = self._verify(problem, solution)

        return verified

    def _solve(self, problem):
        """Generate initial solution."""
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": f"{problem}\n\nSolve step by step."}],
        )
        return response.choices[0].message.content

    def _verify(self, problem, solution):
        """Verify and potentially refine the solution."""
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[
                {
                    "role": "user",
                    "content": f"""Problem: {problem}

Solution: {solution}

Verify this solution. If correct, return it as-is. 
If you find errors, return a corrected version.""",
                }
            ],
        )
        return response.choices[0].message.content

from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


class PersonalAssistant:
    def __init__(self, client):
        self.client = client

    def solve_with_verification(self, problem, verify=True):
        """Solve a problem with optional verification."""

        # Initial solution
        solution = self._solve(problem)

        if not verify:
            return solution

        # Verification step
        verified = self._verify(problem, solution)

        return verified

    def _solve(self, problem):
        """Generate initial solution."""
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": f"{problem}\n\nSolve step by step."}],
        )
        return response.choices[0].message.content

    def _verify(self, problem, solution):
        """Verify and potentially refine the solution."""
        response = self.client.chat.completions.create(
            model="gpt-5",
            messages=[
                {
                    "role": "user",
                    "content": f"""Problem: {problem}

Solution: {solution}

Verify this solution. If correct, return it as-is. 
If you find errors, return a corrected version.""",
                }
            ],
        )
        return response.choices[0].message.content

This pattern makes verification easy to enable or disable based on the task's importance.

Key TakeawaysLink Copied

Verification improves accuracy: Prompting agents to check their work catches many errors
Multiple techniques exist: Confirmation, alternative approaches, explanation, and constraint checking all help
Iterative refinement: Multiple verification passes can progressively improve answers
Combine with chain-of-thought: Verification works best alongside structured reasoning
Use strategically: Apply verification to high-stakes or complex problems, not every query
Not foolproof: Verification helps but doesn't guarantee correctness

With verification techniques in your toolkit, your agent becomes more reliable and trustworthy. It doesn't just solve problems; it double-checks its work, catching errors before they reach you.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about checking and refining agent reasoning.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to AI Agent Handbook

Previous Chapter

Step-by-Step Problem Solving (Chain-of-Thought)

Next Chapter

Why Agents Need Tools

Reference

BIBTEXAcademic

@misc{checkingandrefiningagentreasoningselfverificationtechniquesforaiaccuracy, author = {Michael Brenndoerfer}, title = {Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy}, year = {2025}, url = {https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy. Retrieved from https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification

MLAAcademic

Michael Brenndoerfer. "Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification>.

CHICAGOAcademic

Michael Brenndoerfer. "Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy'. Available at: https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy. https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification

Direct link:

https://mbrenndoerfer.com/writing/checking-refining-agent-reasoning-self-verification

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Checking and Refining Agent Reasoning: Self-Verification Techniques for AI Accuracy

Checking and Refining the Agent's ReasoningLink Copied

Why Agents Need to Check Their WorkLink Copied

The Power of Self-VerificationLink Copied

Example (GPT-5)Link Copied

Techniques for VerificationLink Copied

Ask for ConfirmationLink Copied

Request Alternative ApproachesLink Copied

Prompt for ExplanationLink Copied

Check Against ConstraintsLink Copied

A More Complex ExampleLink Copied

Example (Claude Sonnet 4.5)Link Copied

When Verification Catches MistakesLink Copied

Iterative RefinementLink Copied

Practical ApplicationsLink Copied

Scenario: Planning a BudgetLink Copied

Example (GPT-5)Link Copied

Limitations and ConsiderationsLink Copied

When to Use VerificationLink Copied

Combining Verification with Chain-of-ThoughtLink Copied

Example (Claude Sonnet 4.5)Link Copied

Building Verification Into Your AgentLink Copied

Key TakeawaysLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Differential Calculus and Optimization for Quantitative Finance

Reasoning: Teaching AI Agents to Think Step-by-Step with Chain-of-Thought Prompting

Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Differential Calculus and Optimization for Quantitative Finance

Reasoning: Teaching AI Agents to Think Step-by-Step with Chain-of-Thought Prompting

Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents

Stay updated