How to Build an Agentic Reasoning System Like Claude Extended Thinking or o3 Deep Research
- May 13
- 12 min read

Introduction: The Gap Between "Smart" and "Useful"
You've seen the demos. A model reasons through a hard problem, cites live sources, runs code to verify its math, and lands on a correct, grounded answer all in one fluid session. That's Claude Extended Thinking. That's o3 Deep Research. That's Perplexity's reasoning mode.
Now you try to build something similar. You chain a few LLM calls, bolt on a Wikipedia lookup, and get a system that occasionally hallucinates its tool calls, loops forever on ambiguous queries, and confidently gives you a wrong answer backed by an outdated source.
The difference between those polished demos and your prototype is protocol design specifically, how the model's chain-of-thought interleaves with tool calls, how failures are reflected on, and how the loop terminates cleanly.
This blog post covers that architecture end-to-end. Here's what you can build with it:
Perplexity-style answer engines with live citations from the web
Deep-research assistants for analysts who need multi-step reasoning over dozens of sources
Code-fixing bots that read documentation, write patches, and run the test suite to verify
Customer-support agents with database lookup, calculator, and escalation logic
Compliance and audit bots that verify claims against original source documents
Personal research agents for academics, journalists, and legal professionals
This post covers the core architecture, a recommended tech stack, the five implementation phases, and the hardest engineering challenges. It does not include full source code that's in the complete course on labs.codersarts.com.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
How It Works: The Think-Act-Observe Protocol
Why Pure Chain-of-Thought Fails
Plain chain-of-thought (CoT) reasoning is powerful for closed-world problems math proofs, logical puzzles, structured writing. But as soon as your question requires current information, code execution, or multi-hop factual lookups, CoT hits a hard wall: the model can only reason over what's already in its context window. It can't look anything up. It can't run anything. When it doesn't know, it guesses and it guesses confidently.
The naive fix is tool use without reasoning: a function-calling agent that picks a tool, gets a result, and returns an answer. But without a dedicated thinking phase, the agent makes shallow decisions. It picks the first plausible tool, misses edge cases, and can't recover gracefully when a tool fails or returns conflicting information.
The Think-Act-Observe Loop
The architecture that solves this is deceptively simple. Imagine a detective who, before making any move, writes private notes hypotheses, plans, uncertainty flags. Those notes never go in the official report; they're just thinking scaffolding. Then the detective acts on those notes (calls a witness, checks a database), gets a result, folds it back into the private notes, and thinks again.
That's the think-act-observe loop:
User Question
│
▼
┌──────────────────────────────────────────┐
│ THINK (reasoning model generates │
│ structured <think> block with plan, │
│ hypotheses, and pending tool calls) │
└──────────────────────────────────────────┘
│
▼ (parser extracts <tool> calls)
┌──────────────────────────────────────────┐
│ ACT (tool dispatcher runs calls in │
│ parallel: web search, code exec, calc) │
└──────────────────────────────────────────┘
│
▼ (results injected as <observation> blocks)
┌──────────────────────────────────────────┐
│ OBSERVE (model reads tool results, │
│ reflects on failures, updates plan) │
└──────────────────────────────────────────┘
│
├── more tools needed? ──► THINK (loop)
│
└── answer ready? ──────► <answer> block → clientEvery step is streamed to the client over Server-Sent Events (SSE), so the user sees live thoughts and tool results as they happen not a blank spinner followed by a wall of text.
The key insight: the reasoning model controls its own loop. The runtime just parses tags, dispatches tools, and injects results. This separation keeps the system debuggable and replaceable at every layer.
System Architecture Deep Dive
Layer Overview
The system has five distinct layers:
Presentation Layer A web client or API consumer that renders the SSE stream. Receives live <think>, <tool_call>, <observation>, and <answer> events.
API Layer A FastAPI server that accepts user queries, manages session state, and streams events back. The single POST /v1/reason endpoint is the main entry point.
Reasoning Layer The HuggingFace Transformers model (or any compatible LLM endpoint) wrapped in a loop controller that manages context assembly, loop iteration counting, and termination detection.
Tool Dispatch Layer A tool registry with Pydantic-derived JSON schemas. At runtime, the parallel dispatcher fans out tool calls within a single thinking block, manages timeout/retry logic, and assembles <observation> blocks from results.
Execution Layer The actual tool implementations: a Playwright-based web search + page reader, a Docker-sandboxed code executor, and lightweight utilities (calculator, date resolver, URL fetcher).
Component Table
Component | Role | Options |
Reasoning model | Generates <think> blocks with tool calls | Qwen2.5, DeepSeek-R1, Llama-3.1 via HuggingFace; Claude API |
Context manager | Assembles prompt from conversation + observations | Custom Python class; LangChain Memory (heavier) |
Tag parser | Extracts <tool> blocks from model output | Regex + fallback XML parser; custom state machine |
Tool registry | Maps tool names → schemas + callables | Pydantic BaseModel + JSON Schema; OpenAI function spec |
Parallel dispatcher | Fans out tool calls concurrently | asyncio.gather; ThreadPoolExecutor for sync tools |
Web tool | Live search + full-page read | Playwright + DuckDuckGo API; SerpAPI; Tavily |
Code executor | Sandboxed Python/shell execution | Docker container with resource limits; E2B sandbox |
Streaming server | SSE event source for live thought streaming | FastAPI + SSE-Starlette; Flask-SSE |
Evaluation harness | Scores final answers + tool traces | Custom GAIA-style scorer; LLM-as-judge |
Rate limiter | Throttles external API calls | Token bucket in Redis; in-memory for prototypes |
Data Flow: Step by Step
Client sends POST /v1/reason with { "query": "...", "session_id": "..." }.
The API server creates or retrieves a session context (conversation history + previous observations).
The loop controller calls the reasoning model with the current context. The model generates a <think> block.
The tag parser scans the output for <tool name="..."> blocks. If none, checks for <answer> if found, emits the answer and exits.
All identified tool calls are dispatched in parallel via asyncio.gather. Each call runs with a timeout.
Results (or errors) are wrapped in <observation tool="..." status="ok|error"> blocks.
The observation blocks are appended to the context. The loop counter increments.
If the loop counter exceeds max_iterations or a loop-detection hash match fires, the controller forces a summarize_and_answer call.
Every token emitted by the model and every observation injected is streamed to the client as an SSE event.
The final <answer> block is emitted, and citation grounding maps claims to specific observation indices.
Two Non-Obvious Design Decisions
Why parallel tool dispatch matters more than it looks. A multi-step research query typically needs 5–12 tool calls. Sequential dispatch at 2 seconds per call means 10–24 seconds of dead time per reasoning step. Parallel dispatch inside a single thinking block cuts this to the latency of the slowest call in that batch usually 2–3 seconds total. This is the difference between a system that feels alive and one that feels broken.
Why you need a separate reflection-and-replan module. When a tool call fails (rate limit, timeout, empty result), the naive approach is to inject <observation status="error"> and hope the model recovers. In practice, models often re-call the same failing tool with identical arguments. A dedicated reflection module detects this pattern, summarises what's known so far, and prompts the model to either try an alternative tool or acknowledge the gap preventing infinite retry loops.
Tech Stack Recommendation
Stack A Prototype (Weekend Build)
Layer | Technology | Why |
Language | Python 3.10+ | Async support, rich ecosystem |
Reasoning model | DeepSeek-R1 (8B via HuggingFace) | Free, strong reasoning, runs on a single A100 |
Tag parser | Regex + xml.etree fallback | No dependencies, fast enough for prototypes |
Web tool | requests + BeautifulSoup | No browser needed; good enough for static pages |
Code exec | subprocess with timeout | Simple; NOT safe for production |
API server | FastAPI + SSE-Starlette | Minimal boilerplate, instant SSE support |
Tool schemas | Pydantic BaseModel | Auto-generates JSON Schema with one decorator |
Frontend | Basic HTML + EventSource API | Zero framework; works immediately |
Estimated monthly cost: ~$20–50 (single GPU instance on RunPod or Lambda Labs for inference; free for hosted model APIs like Together AI's free tier).
Stack B Production-Ready
Layer | Technology | Why |
Language | Python 3.11+ | Faster async, better type hints |
Reasoning model | Qwen2.5-72B (vLLM) or Claude API | Higher quality reasoning, reliable tool-call format |
Tag parser | State machine + Pydantic validation | Handles streaming tokens, rejects malformed calls |
Web tool | Playwright + SerpAPI | JavaScript rendering, reliable structured results |
Code exec | Docker with seccomp + cgroups | Safe multi-language execution, resource limits |
API server | FastAPI + Redis (session store) | Stateless workers; sessions survive restarts |
Tool schemas | JSON Schema Registry | Versioned schemas; validation before dispatch |
Eval harness | GAIA-style custom scorer | Measures answer accuracy + tool trace quality |
Observability | OpenTelemetry + Grafana | Trace every reasoning step for debugging |
Auth | API key + JWT | Multi-tenant SaaS deployment |
Estimated monthly cost: $200–600 depending on inference provider, traffic volume, and Docker host. Self-hosted vLLM on a 4×A10G cluster runs ~$400/month; Claude API at moderate usage is roughly $150–300/month.
Implementation Phases
Phase 1: Tag Protocol and Parser
Before writing a single line of model inference code, design the tag format. This is the contract between the model and the runtime get it wrong here and every downstream component breaks.
You need to define the schema for <think>, <tool name="..." args="...">, <observation>, and <answer> tags. The format must be unambiguous to parse from a streaming token buffer, handle nested JSON argument strings safely, and be injected cleanly into few-shot prompts so the model learns to emit it reliably.
Key decisions: Will you use XML-style tags or JSON-wrapped blocks? How do you handle tool calls with multi-line argument payloads? Do you validate args at parse time or at dispatch time? How do you render the protocol in the system prompt to minimise hallucinated tag formats?
Getting the parser to handle every edge case the model throws at it including malformed tags, truncated JSON, and mixed-format output is covered in detail in the full course with working, tested code.
Phase 2: Tool Registry and Dispatch Engine
The tool registry is where Pydantic earns its keep. Define each tool as a Pydantic BaseModel subclass; the JSON Schema is auto-generated and injected into the system prompt so the model knows exactly what arguments each tool expects.
The dispatch engine wraps the registry with asyncio.gather for parallel execution. Each tool call gets a configurable timeout. Failures return structured error observations. The engine tracks which tools have been called with which arguments in the current session to power loop detection.
Key decisions: How do you handle tool dependencies (Tool B needs Tool A's output)? Do you implement a dependency graph or rely on the model to sequence calls correctly? What's your retry policy for rate-limit errors? How do you truncate oversized tool outputs before injecting them as observations?
The parallel dispatch engine with dependency analysis and loop detection is covered in detail in the full course with working, tested code.
Phase 3: Web Search and Code Execution Tools
These are the two highest-value tools and also the most complex to implement safely.
The web tool needs to fetch search results, pick the most relevant pages, and read their full content not just titles and snippets. Playwright handles JavaScript-rendered pages that BeautifulSoup can't reach. You need to handle paywalls, redirects, and pages that return 200 but render no useful content. A content-extraction heuristic (Readability-style scoring) filters out boilerplate.
The code execution tool wraps a Docker container with strict resource limits: CPU cap, memory limit, no network egress, no privileged syscalls via seccomp profiles. The model submits Python or shell code; the sandbox executes it and returns stdout/stderr within a timeout. Jailbreak attempts via os.system, subprocess, or socket calls are blocked at the kernel level.
Key decisions: Which Playwright browser profile minimises bot detection? How do you handle search rate limits across multiple concurrent sessions? What Docker base image gives you a useful Python environment without exposing attack surface?
The full Playwright web tool with content scoring and the Docker sandbox with seccomp profiles are provided as ready-to-run components in the full course with working, tested code.
Phase 4: Reasoning Loop Controller and Reflection Module
The loop controller is the orchestration heart. It assembles the context for each model call, tracks the iteration count, enforces the maximum-steps policy, and detects when the model is spinning.
The reflection module fires when a tool call has failed twice with the same arguments, or when the loop has run more than half its allowed iterations without making progress. It injects a structured reflection prompt asking the model to (a) summarise what it knows, (b) identify what's missing, and (c) decide whether to try an alternative approach or answer with acknowledged uncertainty.
Key decisions: How do you define "progress" for loop detection? How do you compress long observation histories to stay within the context window? When should the controller force an answer versus allow more iterations?
The full loop controller with configurable termination policies and the reflection-and-replan module are covered in detail in the full course with working, tested code.
Phase 5: SSE Streaming API and GAIA-Style Evaluation
The FastAPI server wraps the loop controller and streams every event model tokens, tool calls, observation injections, and the final answer over Server-Sent Events. The client receives a live, structured event stream it can render progressively.
The evaluation harness scores the system on GAIA-style tasks: multi-step questions with verifiable answers. The harness measures answer correctness, tool call efficiency (did the agent use the minimum necessary calls?), citation grounding (are answer claims traceable to specific observations?), and loop efficiency (did it terminate cleanly?).
Key decisions: What SSE event schema enables the client to render thoughts, tools, and answers in separate UI sections? How do you handle client disconnects mid-stream? How do you implement LLM-as-judge scoring for subjective answer quality?
The SSE streaming server with full event typing and the GAIA evaluation harness with worked examples are covered in detail in the full course with working, tested code.
Common Challenges
Building this system is an exercise in handling failure modes that don't show up in small demos but become blocking problems the moment you test with real, diverse queries.
1. Malformed tag hallucination under load. The root cause: the model's tag format degrades when the context window fills up with long observation blocks. The fix: inject the tag format reminder into every context assembly, implement a streaming token validator that detects malformed openings before the full tag is emitted, and add a correction prompt that rescues partially valid calls.
2. Infinite retry loops. Root cause: after a tool failure, the model re-calls the same tool with the same arguments because the error observation doesn't give it enough signal to change strategy. Fix: the loop detector hashes (tool_name, args_json) and fires the reflection module after two identical calls in the same session.
3. Parallel tool call race conditions. Root cause: Tool B reads a resource that Tool A writes, but they're dispatched simultaneously. Fix: a simple dependency declaration in the tool schema ("depends_on": ["tool_a"]) lets the dispatcher build a minimal execution graph and sequence dependent calls correctly.
4. Context-window blowup. Root cause: long web pages injected as raw observations can consume 8,000–20,000 tokens in a single observation block, leaving no room for further reasoning. Fix: observation summarisation the dispatcher runs a fast summarisation pass over any observation exceeding a configurable token threshold before injection.
5. Code sandbox jailbreaks. Root cause: even with subprocess blocked at the Python level, models sometimes generate code that uses ctypes, socket, or os.fork to escape. Fix: seccomp profiles at the Docker level block the relevant syscalls regardless of the code path.
6. Citation grounding failures. Root cause: the model's final answer references facts that don't appear in any observation it's filling gaps with parametric memory. Fix: a post-answer citation checker maps each claim in the answer against the observation log and flags ungrounded statements for the client to highlight.
7. Knowing when to stop. Root cause: some models are overly cautious and keep searching long after they have enough information to answer. Fix: a confidence signal computed from the model's own reasoning text (look for phrases like "I now have enough information") combined with a minimum-confidence threshold that allows early termination.
Solving these issues took us over 120 hours of testing across dozens of query types. The course walks you through each fix with working, production-tested code so you don't have to discover them the hard way.
Ready to Build This Yourself?
Understanding the architecture is one thing. Shipping a working, tested, deployed system is another. There's a long way between "I understand how this works" and "my reasoning agent is live and serving real users."
The Agentic Tool-Using Reasoning Models from Scratch course on Codersarts Labs bridges that gap. Here's exactly what you get:
✅ Full annotated source code every component described in this post, production-quality
✅ Video walkthroughs step-by-step explanations of every architectural decision
✅ Docker sandbox images pre-configured code execution environment, ready to run
✅ Playwright web tool full implementation with content scoring and anti-detection configuration
✅ GAIA evaluation harness test your agent against graded multi-step benchmarks
✅ SSE streaming API FastAPI server with full event typing and client reference implementation
✅ Lifetime access all future updates included at no extra cost
✅ Tested configurations every stack configuration has been validated end-to-end
$49.99. Everything above.
Building something more complex? Integrating a reasoning agent into an enterprise product or deploying for a specific vertical? The 1:1 Guided Session ($199.99 for three 1-hour live sessions) includes a tool-architecture review tailored to your use case and hands-on deployment help. Book a session on labs.codersarts.com.
Conclusion
The architecture described here a structured think-act-observe tag protocol, a Pydantic-powered tool registry, a parallel dispatch engine, a reflection-and-replan module, and an SSE-streaming FastAPI server is what separates toy chatbots from production reasoning agents. It's the same pattern behind Claude Extended Thinking, o3 Deep Research, and Perplexity's reasoning mode, implemented from scratch with full control over every layer.
If you're starting fresh, begin with Stack A: DeepSeek-R1 on a single GPU, simple regex parser, requests-based web tool, subprocess code executor, and basic FastAPI SSE. Get the loop running and the tools responding. You can swap in production-grade components one layer at a time once the protocol is stable.
When you're ready to move beyond architecture diagrams and into working, deployable code, the full course on labs.codersarts.com has everything you need.



Comments