Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents

Michael BrenndoerferJune 17, 202515 min read

Learn how to teach AI agents to think through problems step by step using chain-of-thought reasoning. Discover practical techniques for improving accuracy and transparency in complex tasks.

Step-by-Step Problem Solving (Chain-of-Thought)

You've learned how to write clear prompts and use strategies like roles and examples to guide your AI agent. But what happens when you ask your agent a question that requires real thinking? Not just recalling facts, but working through a problem step by step?

Try this experiment. Ask a language model: "If a train leaves Chicago at 2 PM traveling 60 mph, and another train leaves St. Louis (300 miles away) at 3 PM traveling 75 mph toward Chicago, when do they meet?"

You might get an answer. But is it right? The model might jump straight to a conclusion without showing its work. And when the answer is wrong, you have no idea where the reasoning broke down.

Now try adding one simple phrase: "Let's think this through step by step."

Suddenly, the model shows its reasoning. It breaks down the problem, considers each piece, and works toward the answer methodically. This simple technique, called chain-of-thought reasoning, transforms how AI agents handle complex problems.

Why Reasoning Matters

Language models are excellent at pattern matching and generating text. They can recall facts, write coherently, and follow instructions. But complex problems require more than pattern matching. They require reasoning: breaking down a problem, considering relationships, and building toward a solution.

Without explicit guidance to reason, models often take shortcuts. They might pattern-match to similar problems they've seen in training and output an answer that looks plausible but is actually wrong. This is especially common with:

  • Math problems: Where each step depends on the previous one
  • Logic puzzles: Where you need to track multiple constraints
  • Multi-step tasks: Where you must plan a sequence of actions
  • Analytical questions: Where you need to weigh evidence and draw conclusions

The solution isn't a more powerful model (though that can help). The solution is teaching the model to think through problems explicitly, showing its work as it goes.

What Is Chain-of-Thought Reasoning?

Chain-of-thought (CoT) reasoning is simple: instead of asking the model to jump straight to an answer, you prompt it to explain its thinking step by step. You're essentially asking it to "show its work," just like a math teacher would require.

When you use chain-of-thought prompting, the model generates intermediate reasoning steps before arriving at a final answer. These steps serve two purposes:

  1. They improve accuracy: By working through the problem explicitly, the model is less likely to make logical errors or skip important considerations.

  2. They provide transparency: You can see how the model arrived at its answer, which helps you trust the result and debug when something goes wrong.

Think of it like the difference between asking someone "What's 17×2317 \times 23?" versus "What's 17×2317 \times 23? Show me how you calculated it." The second request produces not just an answer, but a process you can verify.

The Magic Phrase: "Let's Think Step by Step"

The simplest way to trigger chain-of-thought reasoning is to add a phrase like "Let's think through this step by step" or "Let's solve this step by step" to your prompt. This small addition signals to the model that you want explicit reasoning, not just a final answer.

Example (GPT-5)

Let's see this in action with a simple word problem:

In[3]:
Code
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

## Without chain-of-thought
prompt_simple = """A restaurant has 23 tables. Each table has 4 chairs. 
If 12 chairs are broken and removed, how many chairs are left?"""

## With chain-of-thought
prompt_cot = """A restaurant has 23 tables. Each table has 4 chairs. 
If 12 chairs are broken and removed, how many chairs are left?

Let's think through this step by step."""

## Get both responses
## Using GPT-5 for basic prompting and text generation
response_simple = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt_simple}]
)

response_cot = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt_cot}]
)

print("Without CoT:")
print(response_simple.choices[0].message.content)
print("\nWith CoT:")
print(response_cot.choices[0].message.content)
Out[3]:
Console
Without CoT:
80

Explanation:
- Total chairs = 23 × 4 = 92
- After removing 12 broken chairs: 92 − 12 = 80

With CoT:
- Total chairs originally: 23 tables × 4 chairs/table = 92
- Chairs removed: 12
- Chairs left: 92 − 12 = 80

Answer: 80 chairs.

The first response might just say "80 chairs" (which is wrong, by the way). The second response will show the reasoning:

Let's think through this step by step.

Step 1: Calculate the total number of chairs
- 23 tables $\times$ 4 chairs per table = 92 chairs

Step 2: Subtract the broken chairs
- 92 chairs - 12 broken chairs = 80 chairs

Therefore, there are 80 chairs left in the restaurant.

Wait, that's still 80. Let me recalculate: 23×4=9223 \times 4 = 92, then 9212=8092 - 12 = 80. Actually, that's correct! The point is that with chain-of-thought, you can verify each step. If there were an error, you'd see exactly where it occurred.

When Chain-of-Thought Helps Most

Chain-of-thought reasoning isn't necessary for every task. If you're asking for a fact ("What's the capital of France?") or a simple transformation ("Translate this to Spanish"), you don't need explicit reasoning steps. The model can answer directly.

But chain-of-thought becomes valuable when:

The problem has multiple steps: Each step builds on the previous one, and skipping a step leads to errors.

The answer isn't obvious: The problem requires analysis, not just recall.

You need to verify the logic: Being able to check the reasoning is as important as getting the right answer.

The model tends to make mistakes: If you notice the model getting certain types of questions wrong, adding chain-of-thought often helps.

Example: A Tricky Riddle

Riddles are a great test case because they require careful thinking. Let's try one:

Example (GPT-5)

In[4]:
Code
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

riddle = """I speak without a mouth and hear without ears. 
I have no body, but I come alive with wind. What am I?

Let's think through this step by step."""

## Using GPT-5 for reasoning through riddles and logic puzzles
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": riddle}]
)

print(response.choices[0].message.content)
Out[4]:
Console
An echo.

- It “speaks” by repeating sounds without a mouth.
- It “hears” without ears because it only reflects what it receives.
- It has no body.
- Sound travels through air; wind can carry and shape it, making the echo seem to come alive.

The model might respond:

Let's think through this step by step.

Step 1: Analyze the first clue - "I speak without a mouth"
This suggests something that makes sound but isn't alive or doesn't have a physical mouth.

Step 2: Consider the second clue - "hear without ears"
This means it can respond to or carry sound without having ears. This reinforces that it's not a living thing.

Step 3: Examine the third clue - "I have no body"
This confirms it's not a physical creature or person.

Step 4: Look at the final clue - "I come alive with wind"
This is the key clue. Something that needs wind to function and creates sound.

Step 5: Put it together
Something without a body that makes sound and requires wind... this describes an echo. Sound travels through air (wind), bounces off surfaces, and returns as an echo. An echo "speaks" (produces sound) without a mouth, "hears" (responds to sound) without ears, has no physical body, and exists because of sound waves traveling through air.

Answer: An echo

By working through each clue systematically, the model arrives at the correct answer with clear reasoning you can follow.

Teaching Through Examples: Few-Shot Chain-of-Thought

Remember few-shot prompting from the previous chapter? You can combine that technique with chain-of-thought by showing the model examples of step-by-step reasoning. This is especially powerful for tasks where you want consistent reasoning patterns.

Instead of just showing input-output pairs, you show input-reasoning-output triplets. The model learns not just what to answer, but how to think about the problem.

Example (GPT-5)

Let's say you're building a feature where your assistant helps evaluate whether claims are supported by evidence:

In[5]:
Code
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

prompt = """Determine if the claim is supported by the evidence. Show your reasoning.

Example 1:
Claim: "Exercise improves mental health"
Evidence: "A study of 1,000 participants found that those who exercised 30 minutes daily reported 25% lower anxiety levels than those who didn't exercise."
Reasoning:
- The evidence comes from a study with a large sample size (1,000 participants)
- It shows a specific, measurable benefit (25% lower anxiety)
- Anxiety is a component of mental health
- The evidence directly relates to the claim
Conclusion: Supported

Example 2:
Claim: "Coffee causes heart disease"
Evidence: "Some people who drink coffee have reported heart palpitations."
Reasoning:
- The evidence is anecdotal ("some people reported")
- Heart palpitations are not the same as heart disease
- No causal relationship is established (correlation vs causation)
- The evidence is too weak to support the strong claim
Conclusion: Not supported

Now evaluate this:
Claim: "Reading before bed improves sleep quality"
Evidence: "A survey found that 60% of people who read before bed felt they slept better."
"""

## Using GPT-5 for analytical reasoning with few-shot examples
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
Out[5]:
Console
Reasoning:
- The evidence is from a survey (observational, self-reported), not an experiment.
- “Felt they slept better” is subjective and lacks objective measures of sleep quality.
- No comparison group or baseline is provided; 60% reporting better sleep doesn’t establish that reading causes improvement.
- Correlation does not imply causation; people who already sleep well may be more likely to read before bed.
- Sample size, representativeness, and controls are unknown.

Conclusion: Not supported. The evidence is suggestive of a perceived association but is too weak to establish that reading before bed improves sleep quality.

The model will follow the reasoning pattern you demonstrated:

Reasoning:
- The evidence comes from a survey, which captures self-reported data
- 60% is a majority, suggesting a notable correlation
- "Felt they slept better" is subjective, not an objective measure of sleep quality
- The evidence shows correlation but doesn't prove causation (other factors could be involved)
- The sample size and methodology aren't specified, which limits confidence
Conclusion: Partially supported (shows correlation but not causation)

By providing examples of good reasoning, you've taught the model how to approach this type of analysis.

Practical Applications for Your Personal Assistant

Let's apply chain-of-thought reasoning to make your personal assistant more capable. Here are some scenarios where it helps:

Planning a Multi-Step Task

In[6]:
Code
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

prompt = """I need to prepare for a presentation next Tuesday. I need to:
- Research the topic (3 hours)
- Create slides (4 hours)
- Practice presenting (2 hours)
- Get feedback from a colleague (1 hour)

Today is Thursday. I have 2 hours available each evening (Thu, Fri, Mon).
I have 6 hours available on Saturday.

Create a schedule for completing these tasks. Think through this step by step."""

## Using GPT-5 for multi-step planning and scheduling
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
Out[6]:
Console
Here’s the step-by-step reasoning and a concrete schedule.

How I’m thinking about it
- Dependencies: Research should come before creating slides; slides should be mostly done before serious practice; feedback is most useful once a draft deck exists; a final practice after feedback locks it in.
- Time math: You need 10 hours total. You have 12 hours available (Thu 2 + Fri 2 + Sat 6 + Mon 2), so we’ll build in buffer.
- Goal: Finish the core work by Saturday, use Monday for feedback and final refinement/practice.

Schedule
- Thursday evening (2h)
  - Research: 2h
  - Admin: Message your colleague to book a 1h feedback slot on Monday and tell them you’ll send the deck beforehand.

- Friday evening (2h)
  - Research: 1h (completes 3h total)
  - Slides: 1h (outline/storyboard, define key messages and flow)

- Saturday (6h)
  - Slides: 3h (build the full deck; this, plus Friday’s 1h, completes 4h slides)
  - Practice: 1h (first timed run-through to check flow and timing)
  - Buffer/Polish: 2h (optional but recommended)
    - Use for improving visuals, tightening wording, adding speaker notes, and addressing issues uncovered in the practice.
    - Send the draft deck to your colleague when done so they can skim before Monday.

- Monday evening (2h)
  - Feedback with colleague: 1h (walk them through the deck; capture notes)
  - Final practice: 1h (incorporate key feedback, then do a final timed run)

Notes
- You’ll complete the required 10 hours by using: Research 3h (Thu+Fri), Slides 4h (Fri+Sat), Practice 2h (Sat+Mon), Feedback 1h (Mon).
- You still have up to 2 hours of buffer on Saturday for polish or rest.
- If your colleague can meet Saturday instead, move the 1h feedback into Saturday and use both Monday hours for final tweaks and one last full run.

The model will break down the problem:

Let's think through this step by step.

Step 1: Calculate total time needed
- Research: 3 hours
- Create slides: 4 hours
- Practice: 2 hours
- Feedback: 1 hour
Total: 10 hours

Step 2: Calculate available time
- Thursday evening: 2 hours
- Friday evening: 2 hours
- Saturday: 6 hours
- Monday evening: 2 hours
Total: 12 hours (enough time with 2 hours buffer)

Step 3: Consider dependencies
- Must research before creating slides
- Must create slides before practicing
- Should get feedback after practicing

Step 4: Create schedule
Thursday evening (2 hours): Research (part 1)
Friday evening (2 hours): Research (part 2, complete)
Saturday (6 hours): Create slides (4 hours) + Practice (2 hours)
Monday evening (2 hours): Get colleague feedback (1 hour) + final practice (1 hour)

This schedule respects the dependencies and uses 11 of your 12 available hours.

Debugging Logic

Chain-of-thought is also valuable for helping users debug their own thinking:

In[7]:
Code
from openai import OpenAI

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

prompt = """I'm trying to decide if I should buy a new laptop. Help me think through this decision step by step.

Current laptop: 4 years old, still works but slow
New laptop cost: \$1,200
My budget: \$2,000 saved for tech upgrades
Other needs: Might need a new phone in 6 months (\$800)

Should I buy the laptop now?"""

## Using GPT-5 for structured decision-making and analysis
response = client.chat.completions.create(
    model="gpt-5",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)
Out[7]:
Console
Here’s a simple step-by-step way to decide, with a clear recommendation at the end.

1) Clarify urgency and value
- How often does the slowness hurt you? If it costs you 5–10 minutes a day and you use the laptop for work/school, a faster machine can pay for itself quickly.
  • Quick check: If your time is worth $30/hour and a new laptop saves 5 minutes/day, 5 days/week, over 2 years, that’s ~43 hours saved ≈ $1,300—enough to justify a $1,200 laptop.

2) Do the budget math (include tax and extras)
- Current tech fund: $2,000
- Laptop: $1,200
- Remaining: $800
- Phone in ~6 months: $800
- That leaves $0 before tax/warranty/accessories.
- Add estimated sales tax and basics:
  • If tax is ~8%: laptop tax ≈ $96; phone tax ≈ $64; total ≈ $160
  • Common extras to consider: case/charger/dock ≈ $50–$100; optional warranty ≈ $100–$200
- Likely shortfall: ~$160–$360 unless you can save more in the next 6 months.
- What you need to save: About $45–$60/month for 6 months would comfortably cover tax/extras.

3) Consider lower-cost alternatives (to buy time or save money)
- Tune-up now (free): OS cleanup, remove startup apps, check for malware, update drivers, reset/clean install.
- Cheap hardware refresh (if your model allows):
  • Upgrade to SSD (if you don’t have one) and/or add RAM: $60–$150; can make a 4‑year‑old laptop feel dramatically faster.
  • Replace battery if degraded: $50–$100.
  • These can extend usable life 1–2 years at a fraction of the cost.
- Shop smarter:
  • Look for refurbished/open-box from reputable sellers: often $600–$900 for very capable machines.
  • Watch for sales; 10–20% off turns a $1,200 laptop into ~$960–$1,080, which likely fixes the tax/extra shortfall.
  • Check trade-in credits or card benefits (cashback/extended warranty).

4) Timing and risk
- If your current laptop is stable and the slowness is tolerable, waiting 1–2 months can net post-holiday/clearance deals or new model announcements, often improving value.
- If the laptop is mission-critical and costing time daily (or showing signs of impending failure), the productivity and reliability gains favor buying sooner.

Recommendation
- Buy now if:
  • You use the laptop daily for work/school and estimate at least ~5 minutes/day saved, and
  • You can save roughly $50/month over the next 6 months OR you can find a deal that keeps the laptop near $1,100 out-the-door (price plus tax) so you’re not short when the phone purchase comes due.
- Wait/bridge if:
  • You can’t comfortably save ~$45–$60/month before the phone purchase, or
  • Slowness is tolerable and a tune-up/SSD/RAM upgrade could carry you another 6–12 months, letting you buy during a strong sale or after replenishing savings.

Quick next steps
- Tell me your sales tax rate and whether you can save ~$50/month until the phone purchase.
- Share your current laptop specs (CPU, RAM, storage type, battery health) and main tasks. I can estimate how much a tune-up or cheap upgrade would help and suggest specific models/deals if buying now makes sense.

The model will help structure the decision:

Let's think through this step by step.

Step 1: Assess the urgency
- Current laptop still works, just slow
- This suggests want, not urgent need
- You can likely wait if needed

Step 2: Check the budget
- Total budget: \$2,000
- Laptop cost: \$1,200
- Phone cost (in 6 months): \$800
- Total needed: \$2,000
- Remaining after both: \$0

Step 3: Consider timing
- If you buy the laptop now: \$800 left for phone
- That's exactly enough for the phone
- But no buffer for unexpected expenses

Step 4: Evaluate alternatives
- Wait on laptop, buy phone when needed: keeps more flexibility
- Buy laptop now: commits most of your budget
- Fix/upgrade current laptop: might be cheaper option

Step 5: Recommendation
Given that your current laptop works (just slow) and you have a more predictable phone need in 6 months, consider:
- Option A: Wait on the laptop purchase until after you assess the phone situation
- Option B: Look into cheaper laptop upgrades (RAM, SSD) to extend current laptop life
- Option C: If the laptop significantly impacts your work/productivity, buy it now but plan carefully for the phone purchase

The decision depends on how much the slow laptop affects your daily life versus the importance of maintaining budget flexibility.

This kind of structured thinking helps users make better decisions by explicitly considering all factors.

Common Patterns in Chain-of-Thought Reasoning

As you use chain-of-thought prompting, you'll notice certain patterns that work well for different types of problems:

For math problems: Break into calculation steps, show each intermediate result

For logic puzzles: List constraints, test possibilities, eliminate what doesn't work

For planning tasks: Identify requirements, check resources, sequence actions

For analytical questions: State the question, gather relevant facts, weigh evidence, draw conclusions

For decision-making: Define options, list pros and cons for each, compare, recommend

You don't need to specify these patterns in your prompt. Just asking for step-by-step thinking often triggers the appropriate pattern. But if the model isn't structuring its reasoning the way you want, you can provide an example that demonstrates the pattern you prefer.

Limitations and When Not to Use Chain-of-Thought

Chain-of-thought reasoning is powerful, but it's not always the right tool:

It's slower: Generating reasoning steps takes more time than jumping to an answer. For simple questions, this overhead isn't worth it.

It uses more tokens: More generated text means higher API costs. Use chain-of-thought when accuracy matters more than speed or cost.

It can be verbose: Sometimes you just want a quick answer, not a detailed explanation. Match the technique to your needs.

It doesn't guarantee correctness: Chain-of-thought improves accuracy, but the model can still make errors in its reasoning. Always verify critical results.

The key is knowing when the benefits (better accuracy, transparency, debuggability) outweigh the costs (time, tokens, verbosity).

Building Intuition

Start by applying chain-of-thought to problems where the model makes mistakes. If a simple prompt produces wrong answers, try adding "Let's think step by step." You'll quickly notice which types of problems benefit most from explicit reasoning.

Keep track of what works. When you find a prompt pattern that produces good reasoning for a particular type of problem, save it. Over time, you'll build a library of effective chain-of-thought prompts you can reuse and adapt.

Pay attention to how the model structures its reasoning. You'll start recognizing good reasoning patterns versus sloppy ones. This helps you craft better prompts and evaluate the model's outputs more effectively.

Looking Ahead

Chain-of-thought reasoning is your first tool for teaching agents to think, not just respond. By prompting the model to show its work, you get more accurate answers and insight into how it arrived at them.

But we can go further. In the next chapter, you'll learn how to make your agent check its own work and refine its answers. You'll discover techniques for getting the agent to review its reasoning, consider alternatives, and improve its responses through self-reflection. These approaches build on chain-of-thought to create even more reliable agents.

The key takeaway: when you need your agent to handle complex problems, don't just ask for an answer. Ask it to think through the problem step by step. That simple change transforms a pattern-matching system into something that can reason.

Key Takeaways

  • Chain-of-thought reasoning improves accuracy by making the model show its work instead of jumping to conclusions
  • The phrase "Let's think step by step" is often all you need to trigger explicit reasoning
  • Use chain-of-thought for complex problems where accuracy matters more than speed
  • Combine with few-shot prompting to teach specific reasoning patterns
  • Verify the reasoning, not just the answer, to catch errors and build trust
  • Save effective patterns to build a library of proven chain-of-thought prompts

With chain-of-thought reasoning in your toolkit, your AI agent can handle problems that require real thinking. The next chapter builds on this foundation by teaching your agent to check and refine its own reasoning.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about chain-of-thought reasoning.

Loading component...

Reference

BIBTEXAcademic
@misc{stepbystepproblemsolvingchainofthoughtreasoningforaiagents, author = {Michael Brenndoerfer}, title = {Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents}, year = {2025}, url = {https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents. Retrieved from https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning
MLAAcademic
Michael Brenndoerfer. "Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning>.
CHICAGOAcademic
Michael Brenndoerfer. "Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents'. Available at: https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). Step-by-Step Problem Solving: Chain-of-Thought Reasoning for AI Agents. https://mbrenndoerfer.com/writing/step-by-step-problem-solving-chain-of-thought-reasoning