top of page

How to Build an Agentic Reasoning System Like Claude Extended Thinking or o3 Deep Research

  • May 13
  • 12 min read

Introduction: The Gap Between "Smart" and "Useful"

You've seen the demos. A model reasons through a hard problem, cites live sources, runs code to verify its math, and lands on a correct, grounded answer all in one fluid session. That's Claude Extended Thinking. That's o3 Deep Research. That's Perplexity's reasoning mode.


Now you try to build something similar. You chain a few LLM calls, bolt on a Wikipedia lookup, and get a system that occasionally hallucinates its tool calls, loops forever on ambiguous queries, and confidently gives you a wrong answer backed by an outdated source.


The difference between those polished demos and your prototype is protocol design  specifically, how the model's chain-of-thought interleaves with tool calls, how failures are reflected on, and how the loop terminates cleanly.


This blog post covers that architecture end-to-end. Here's what you can build with it:


  • Perplexity-style answer engines with live citations from the web

  • Deep-research assistants for analysts who need multi-step reasoning over dozens of sources

  • Code-fixing bots that read documentation, write patches, and run the test suite to verify

  • Customer-support agents with database lookup, calculator, and escalation logic

  • Compliance and audit bots that verify claims against original source documents

  • Personal research agents for academics, journalists, and legal professionals


This post covers the core architecture, a recommended tech stack, the five implementation phases, and the hardest engineering challenges. It does not include full source code that's in the complete course on labs.codersarts.com.

📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]


How It Works: The Think-Act-Observe Protocol

Why Pure Chain-of-Thought Fails

Plain chain-of-thought (CoT) reasoning is powerful for closed-world problems math proofs, logical puzzles, structured writing. But as soon as your question requires current information, code execution, or multi-hop factual lookups, CoT hits a hard wall: the model can only reason over what's already in its context window. It can't look anything up. It can't run anything. When it doesn't know, it guesses and it guesses confidently.


The naive fix is tool use without reasoning: a function-calling agent that picks a tool, gets a result, and returns an answer. But without a dedicated thinking phase, the agent makes shallow decisions. It picks the first plausible tool, misses edge cases, and can't recover gracefully when a tool fails or returns conflicting information.


The Think-Act-Observe Loop

The architecture that solves this is deceptively simple. Imagine a detective who, before making any move, writes private notes hypotheses, plans, uncertainty flags. Those notes never go in the official report; they're just thinking scaffolding. Then the detective acts on those notes (calls a witness, checks a database), gets a result, folds it back into the private notes, and thinks again.


That's the think-act-observe loop:


User Question

      │
      ▼
┌──────────────────────────────────────────┐
│  THINK  (reasoning model generates       │
│  structured <think> block with plan,     │
│  hypotheses, and pending tool calls)     │
└──────────────────────────────────────────┘
      │
      ▼  (parser extracts <tool> calls)
┌──────────────────────────────────────────┐
│  ACT    (tool dispatcher runs calls in   │
│  parallel: web search, code exec, calc)  │
└──────────────────────────────────────────┘
      │
      ▼  (results injected as <observation> blocks)
┌──────────────────────────────────────────┐
│  OBSERVE  (model reads tool results,     │
│  reflects on failures, updates plan)     │
└──────────────────────────────────────────┘
      │
      ├── more tools needed? ──► THINK (loop)
      │
      └── answer ready? ──────► <answer> block → client

Every step is streamed to the client over Server-Sent Events (SSE), so the user sees live thoughts and tool results as they happen not a blank spinner followed by a wall of text.


The key insight: the reasoning model controls its own loop. The runtime just parses tags, dispatches tools, and injects results. This separation keeps the system debuggable and replaceable at every layer.




System Architecture Deep Dive

Layer Overview

The system has five distinct layers:


Presentation Layer  A web client or API consumer that renders the SSE stream. Receives live <think>, <tool_call>, <observation>, and <answer> events.


API Layer  A FastAPI server that accepts user queries, manages session state, and streams events back. The single POST /v1/reason endpoint is the main entry point.


Reasoning Layer  The HuggingFace Transformers model (or any compatible LLM endpoint) wrapped in a loop controller that manages context assembly, loop iteration counting, and termination detection.


Tool Dispatch Layer  A tool registry with Pydantic-derived JSON schemas. At runtime, the parallel dispatcher fans out tool calls within a single thinking block, manages timeout/retry logic, and assembles <observation> blocks from results.


Execution Layer  The actual tool implementations: a Playwright-based web search + page reader, a Docker-sandboxed code executor, and lightweight utilities (calculator, date resolver, URL fetcher).

Component Table

Component

Role

Options

Reasoning model

Generates <think> blocks with tool calls

Qwen2.5, DeepSeek-R1, Llama-3.1 via HuggingFace; Claude API

Context manager

Assembles prompt from conversation + observations

Custom Python class; LangChain Memory (heavier)

Tag parser

Extracts <tool> blocks from model output

Regex + fallback XML parser; custom state machine

Tool registry

Maps tool names → schemas + callables

Pydantic BaseModel + JSON Schema; OpenAI function spec

Parallel dispatcher

Fans out tool calls concurrently

asyncio.gather; ThreadPoolExecutor for sync tools

Web tool

Live search + full-page read

Playwright + DuckDuckGo API; SerpAPI; Tavily

Code executor

Sandboxed Python/shell execution

Docker container with resource limits; E2B sandbox

Streaming server

SSE event source for live thought streaming

FastAPI + SSE-Starlette; Flask-SSE

Evaluation harness

Scores final answers + tool traces

Custom GAIA-style scorer; LLM-as-judge

Rate limiter

Throttles external API calls

Token bucket in Redis; in-memory for prototypes

Data Flow: Step by Step

  1. Client sends POST /v1/reason with { "query": "...", "session_id": "..." }.

  2. The API server creates or retrieves a session context (conversation history + previous observations).

  3. The loop controller calls the reasoning model with the current context. The model generates a <think> block.

  4. The tag parser scans the output for <tool name="..."> blocks. If none, checks for <answer>  if found, emits the answer and exits.

  5. All identified tool calls are dispatched in parallel via asyncio.gather. Each call runs with a timeout.

  6. Results (or errors) are wrapped in <observation tool="..." status="ok|error"> blocks.

  7. The observation blocks are appended to the context. The loop counter increments.

  8. If the loop counter exceeds max_iterations or a loop-detection hash match fires, the controller forces a summarize_and_answer call.

  9. Every token emitted by the model and every observation injected is streamed to the client as an SSE event.

  10. The final <answer> block is emitted, and citation grounding maps claims to specific observation indices.


Two Non-Obvious Design Decisions

Why parallel tool dispatch matters more than it looks. A multi-step research query typically needs 5–12 tool calls. Sequential dispatch at 2 seconds per call means 10–24 seconds of dead time per reasoning step. Parallel dispatch inside a single thinking block cuts this to the latency of the slowest call in that batch usually 2–3 seconds total. This is the difference between a system that feels alive and one that feels broken.


Why you need a separate reflection-and-replan module. When a tool call fails (rate limit, timeout, empty result), the naive approach is to inject <observation status="error"> and hope the model recovers. In practice, models often re-call the same failing tool with identical arguments. A dedicated reflection module detects this pattern, summarises what's known so far, and prompts the model to either try an alternative tool or acknowledge the gap preventing infinite retry loops.




Tech Stack Recommendation

Stack A Prototype (Weekend Build)

Layer

Technology

Why

Language

Python 3.10+

Async support, rich ecosystem

Reasoning model

DeepSeek-R1 (8B via HuggingFace)

Free, strong reasoning, runs on a single A100

Tag parser

Regex + xml.etree fallback

No dependencies, fast enough for prototypes

Web tool

requests + BeautifulSoup

No browser needed; good enough for static pages

Code exec

subprocess with timeout

Simple; NOT safe for production

API server

FastAPI + SSE-Starlette

Minimal boilerplate, instant SSE support

Tool schemas

Pydantic BaseModel

Auto-generates JSON Schema with one decorator

Frontend

Basic HTML + EventSource API

Zero framework; works immediately


Estimated monthly cost: ~$20–50 (single GPU instance on RunPod or Lambda Labs for inference; free for hosted model APIs like Together AI's free tier).

Stack B Production-Ready

Layer

Technology

Why

Language

Python 3.11+

Faster async, better type hints

Reasoning model

Qwen2.5-72B (vLLM) or Claude API

Higher quality reasoning, reliable tool-call format

Tag parser

State machine + Pydantic validation

Handles streaming tokens, rejects malformed calls

Web tool

Playwright + SerpAPI

JavaScript rendering, reliable structured results

Code exec

Docker with seccomp + cgroups

Safe multi-language execution, resource limits

API server

FastAPI + Redis (session store)

Stateless workers; sessions survive restarts

Tool schemas

JSON Schema Registry

Versioned schemas; validation before dispatch

Eval harness

GAIA-style custom scorer

Measures answer accuracy + tool trace quality

Observability

OpenTelemetry + Grafana

Trace every reasoning step for debugging

Auth

API key + JWT

Multi-tenant SaaS deployment


Estimated monthly cost: $200–600 depending on inference provider, traffic volume, and Docker host. Self-hosted vLLM on a 4×A10G cluster runs ~$400/month; Claude API at moderate usage is roughly $150–300/month.


Implementation Phases

Phase 1: Tag Protocol and Parser

Before writing a single line of model inference code, design the tag format. This is the contract between the model and the runtime get it wrong here and every downstream component breaks.


You need to define the schema for <think>, <tool name="..." args="...">, <observation>, and <answer> tags. The format must be unambiguous to parse from a streaming token buffer, handle nested JSON argument strings safely, and be injected cleanly into few-shot prompts so the model learns to emit it reliably.


Key decisions: Will you use XML-style tags or JSON-wrapped blocks? How do you handle tool calls with multi-line argument payloads? Do you validate args at parse time or at dispatch time? How do you render the protocol in the system prompt to minimise hallucinated tag formats?


Getting the parser to handle every edge case the model throws at it including malformed tags, truncated JSON, and mixed-format output is covered in detail in the full course with working, tested code.


Phase 2: Tool Registry and Dispatch Engine

The tool registry is where Pydantic earns its keep. Define each tool as a Pydantic BaseModel subclass; the JSON Schema is auto-generated and injected into the system prompt so the model knows exactly what arguments each tool expects.


The dispatch engine wraps the registry with asyncio.gather for parallel execution. Each tool call gets a configurable timeout. Failures return structured error observations. The engine tracks which tools have been called with which arguments in the current session to power loop detection.


Key decisions: How do you handle tool dependencies (Tool B needs Tool A's output)? Do you implement a dependency graph or rely on the model to sequence calls correctly? What's your retry policy for rate-limit errors? How do you truncate oversized tool outputs before injecting them as observations?


The parallel dispatch engine with dependency analysis and loop detection is covered in detail in the full course with working, tested code.


Phase 3: Web Search and Code Execution Tools

These are the two highest-value tools and also the most complex to implement safely.


The web tool needs to fetch search results, pick the most relevant pages, and read their full content not just titles and snippets. Playwright handles JavaScript-rendered pages that BeautifulSoup can't reach. You need to handle paywalls, redirects, and pages that return 200 but render no useful content. A content-extraction heuristic (Readability-style scoring) filters out boilerplate.


The code execution tool wraps a Docker container with strict resource limits: CPU cap, memory limit, no network egress, no privileged syscalls via seccomp profiles. The model submits Python or shell code; the sandbox executes it and returns stdout/stderr within a timeout. Jailbreak attempts via os.system, subprocess, or socket calls are blocked at the kernel level.


Key decisions: Which Playwright browser profile minimises bot detection? How do you handle search rate limits across multiple concurrent sessions? What Docker base image gives you a useful Python environment without exposing attack surface?


The full Playwright web tool with content scoring and the Docker sandbox with seccomp profiles are provided as ready-to-run components in the full course with working, tested code.


Phase 4: Reasoning Loop Controller and Reflection Module

The loop controller is the orchestration heart. It assembles the context for each model call, tracks the iteration count, enforces the maximum-steps policy, and detects when the model is spinning.


The reflection module fires when a tool call has failed twice with the same arguments, or when the loop has run more than half its allowed iterations without making progress. It injects a structured reflection prompt asking the model to (a) summarise what it knows, (b) identify what's missing, and (c) decide whether to try an alternative approach or answer with acknowledged uncertainty.


Key decisions: How do you define "progress" for loop detection? How do you compress long observation histories to stay within the context window? When should the controller force an answer versus allow more iterations?


The full loop controller with configurable termination policies and the reflection-and-replan module are covered in detail in the full course with working, tested code.


Phase 5: SSE Streaming API and GAIA-Style Evaluation

The FastAPI server wraps the loop controller and streams every event model tokens, tool calls, observation injections, and the final answer over Server-Sent Events. The client receives a live, structured event stream it can render progressively.


The evaluation harness scores the system on GAIA-style tasks: multi-step questions with verifiable answers. The harness measures answer correctness, tool call efficiency (did the agent use the minimum necessary calls?), citation grounding (are answer claims traceable to specific observations?), and loop efficiency (did it terminate cleanly?).


Key decisions: What SSE event schema enables the client to render thoughts, tools, and answers in separate UI sections? How do you handle client disconnects mid-stream? How do you implement LLM-as-judge scoring for subjective answer quality?


The SSE streaming server with full event typing and the GAIA evaluation harness with worked examples are covered in detail in the full course with working, tested code.


Common Challenges

Building this system is an exercise in handling failure modes that don't show up in small demos but become blocking problems the moment you test with real, diverse queries.


1. Malformed tag hallucination under load. The root cause: the model's tag format degrades when the context window fills up with long observation blocks. The fix: inject the tag format reminder into every context assembly, implement a streaming token validator that detects malformed openings before the full tag is emitted, and add a correction prompt that rescues partially valid calls.


2. Infinite retry loops. Root cause: after a tool failure, the model re-calls the same tool with the same arguments because the error observation doesn't give it enough signal to change strategy. Fix: the loop detector hashes (tool_name, args_json) and fires the reflection module after two identical calls in the same session.


3. Parallel tool call race conditions. Root cause: Tool B reads a resource that Tool A writes, but they're dispatched simultaneously. Fix: a simple dependency declaration in the tool schema ("depends_on": ["tool_a"]) lets the dispatcher build a minimal execution graph and sequence dependent calls correctly.


4. Context-window blowup. Root cause: long web pages injected as raw observations can consume 8,000–20,000 tokens in a single observation block, leaving no room for further reasoning. Fix: observation summarisation the dispatcher runs a fast summarisation pass over any observation exceeding a configurable token threshold before injection.


5. Code sandbox jailbreaks. Root cause: even with subprocess blocked at the Python level, models sometimes generate code that uses ctypes, socket, or os.fork to escape. Fix: seccomp profiles at the Docker level block the relevant syscalls regardless of the code path.


6. Citation grounding failures. Root cause: the model's final answer references facts that don't appear in any observation it's filling gaps with parametric memory. Fix: a post-answer citation checker maps each claim in the answer against the observation log and flags ungrounded statements for the client to highlight.


7. Knowing when to stop. Root cause: some models are overly cautious and keep searching long after they have enough information to answer. Fix: a confidence signal computed from the model's own reasoning text (look for phrases like "I now have enough information") combined with a minimum-confidence threshold that allows early termination.


Solving these issues took us over 120 hours of testing across dozens of query types. The course walks you through each fix with working, production-tested code so you don't have to discover them the hard way.


Ready to Build This Yourself?

Understanding the architecture is one thing. Shipping a working, tested, deployed system is another. There's a long way between "I understand how this works" and "my reasoning agent is live and serving real users."


The Agentic Tool-Using Reasoning Models from Scratch course on Codersarts Labs bridges that gap. Here's exactly what you get:


  • Full annotated source code  every component described in this post, production-quality

  • Video walkthroughs  step-by-step explanations of every architectural decision

  • Docker sandbox images  pre-configured code execution environment, ready to run

  • Playwright web tool  full implementation with content scoring and anti-detection configuration

  • GAIA evaluation harness  test your agent against graded multi-step benchmarks

  • SSE streaming API  FastAPI server with full event typing and client reference implementation

  • Lifetime access  all future updates included at no extra cost

  • Tested configurations  every stack configuration has been validated end-to-end


$49.99. Everything above.



Building something more complex? Integrating a reasoning agent into an enterprise product or deploying for a specific vertical? The 1:1 Guided Session ($199.99 for three 1-hour live sessions) includes a tool-architecture review tailored to your use case and hands-on deployment help. Book a session on labs.codersarts.com.




Conclusion

The architecture described here a structured think-act-observe tag protocol, a Pydantic-powered tool registry, a parallel dispatch engine, a reflection-and-replan module, and an SSE-streaming FastAPI server is what separates toy chatbots from production reasoning agents. It's the same pattern behind Claude Extended Thinking, o3 Deep Research, and Perplexity's reasoning mode, implemented from scratch with full control over every layer.


If you're starting fresh, begin with Stack A: DeepSeek-R1 on a single GPU, simple regex parser, requests-based web tool, subprocess code executor, and basic FastAPI SSE. Get the loop running and the tools responding. You can swap in production-grade components one layer at a time once the protocol is stable.


When you're ready to move beyond architecture diagrams and into working, deployable code, the full course on labs.codersarts.com has everything you need.


 
 
 

Comments


bottom of page