top of page

How to Build a Multi-Agent Research Assistant with LangGraph, FastAPI, and Next.js

  • 13 hours ago
  • 14 min read


Research is the hidden tax on every decision that matters. A founder sizing a market, an engineer evaluating a technology, a consultant briefing on an unfamiliar industry — they all face the same bottleneck. Hours spent searching, cross-referencing contradictory sources, deciding what to trust, and formatting the results into something usable. And when they finally look up, they've spent a morning to produce two paragraphs.

The obvious answer seems to be "just use ChatGPT." It can search. It can summarise. But it can't do what a skilled research team actually does: decompose a question into parallel tracks of investigation, evaluate evidence quality across sources, detect gaps and follow up on them, resolve conflicts between contradictory claims, and synthesise everything into a structured, cited report — all while you watch the work happen in real time.

That's not a prompt engineering problem. It's an architecture problem.

The Multi-Agent Research Assistant solves it. You submit a natural language research query. A LangGraph-orchestrated team of specialised agents deploys: a Planner decomposes your question into sub-questions, parallel Researcher agents retrieve and rank sources for each one, a Critic evaluates evidence quality and identifies gaps, a Synthesiser merges findings into a coherent narrative, and a Formatter produces a structured Markdown report with clickable citations — streamed to your browser in real time, agent by agent, as it happens.

Real-world use cases this application handles:

  • Technical founders running competitive and market research without a research team

  • AI/ML engineers building and benchmarking multi-agent LangGraph pipelines

  • Data scientists conducting systematic literature reviews across web and document sources

  • Consulting professionals briefing quickly on unfamiliar industries before client calls

  • CS students studying stateful agent graphs, RAG patterns, and real-time streaming architectures

  • Content researchers producing structured first-draft reports at scale

  • Full-stack developers learning production-grade LangGraph + Next.js + FastAPI integration

This article covers the core concept, the system architecture, the LangGraph agent graph design, the implementation phases you will work through, and the most common challenges you'll encounter building it. Full source code is available in the complete course at labs.codersarts.com.




📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]


How It Works: Core Concept

The concept powering this system is stateful multi-agent orchestration with parallel retrieval-augmented generation.

A single LLM call is a one-shot transformation: input goes in, output comes out. It has no persistent memory of prior steps, no ability to delegate work to specialised processes running simultaneously, and no mechanism for a quality-checking pass before the output reaches the user. These are structural limitations — they cannot be fixed with better prompts.

LangGraph solves this by modelling the entire research process as a directed graph. Nodes are agent functions. Edges define execution flow. State is a typed dictionary that every node reads from and writes to, persisting across the entire pipeline. This means the Synthesiser can see exactly what the Planner decided, what each Researcher found, and what the Critic flagged — without any of that context being manually stitched together in prompts.

Why parallel matters. A research question like "What are the competitive dynamics of the AI coding assistant market?" naturally decomposes into five or six sub-questions: who are the players, what is the pricing model, what is the user adoption data, what do developers think, what are the technical differentiators, what is the VC investment trend. Answering each one sequentially takes 5–6 minutes. Running them in parallel — five Researcher Agents firing simultaneously via asyncio.gather — collapses that to the latency of the slowest single agent. The wall-clock difference between sequential and parallel research at this scale is the difference between a usable tool and one nobody opens twice.

Why the Critic node exists. LLMs are optimistic synthesisers. Given insufficient evidence, they fill gaps with plausible-sounding text rather than acknowledging uncertainty. The Critic Agent is the adversarial check: it reviews every ResearchResult, scores confidence per sub-question, flags factual conflicts between sources, and identifies sub-questions where fewer than two credible sources were found. When gaps are detected, the graph loops back — Researcher Agents are re-invoked for the specific gap sub-questions before synthesis begins. This is the graph's conditional edge in action: real iterative reasoning, not a simulation of it.



RESEARCH PIPELINE — HIGH LEVEL:

  User submits natural language query
          │
          ▼
  [PLANNER AGENT]
  Decomposes query into 3–7 sub-questions
  Assigns depth label (broad / deep / verify)
          │
          ▼
  [RESEARCHER FAN-OUT]
  Spawns one Researcher Agent per sub-question
          │
    ┌─────┼─────┐
    ▼     ▼     ▼  (parallel via asyncio.gather)
  [R-0] [R-1] [R-N]
  Each agent: Tavily search → rank sources
            → extract claims with citations
            → return ResearchResult
    └─────┬─────┘
          ▼
  [RESEARCHER JOIN]
  Aggregates all ResearchResults into state
          │
          ▼
  [CRITIC AGENT]
  Scores evidence quality per sub-question
  Detects factual conflicts
  Flags gap sub-questions
          │
          ├──(gaps found)──→ [RESEARCHER FAN-OUT]  (loop, max 1 iteration)
          │
          └──(no gaps / max iterations reached)
          ▼
  [SYNTHESISER AGENT]
  Merges findings into coherent narrative
  Streams section-by-section to frontend
          │
          ▼
  [FORMATTER AGENT]
  Applies Markdown structure + citation list
          │
          ▼
  REPORT_COMPLETE event → browser renders report



System Architecture Deep Dive

The Multi-Agent Research Assistant has seven layers. Each has a specific responsibility and a clear boundary.

Layer 1 — Frontend (Next.js 15 + React 19 + Tailwind CSS). The UI has three areas: a query input panel with focus mode selection, a real-time agent activity feed that shows every agent as it starts and completes (with timestamps and status badges), and a report viewer that renders the final Markdown output with syntax-highlighted citations. The agent feed is the differentiator — users aren't watching a spinner, they're watching a research team work. The frontend communicates with the backend over WebSocket for the live event stream, with SSE as a fallback.

Layer 2 — API Gateway (FastAPI + WebSocket). The backend exposes a REST endpoint to create a research session and a WebSocket endpoint to stream its events. FastAPI's async-native architecture means multiple concurrent sessions run on a single event loop without blocking. Session state is maintained in PostgreSQL (production) or SQLite (prototype).

Layer 3 — LangGraph Orchestration Engine. This is the application's core. The graph is defined as a StateGraph with a typed state dictionary. Nodes are registered with add_node(). Edges — including the conditional gap-detection loop — are declared with add_conditional_edges(). The graph is compiled once at startup and invoked per session. stream_mode="values" ensures the full state is emitted to the WebSocket handler after every node transition.

Layer 4 — Agent Layer (five specialised agents). Each agent is a Python function that takes the current graph state, calls an OpenAI model with a focused system prompt, validates the output with a PydanticOutputParser, and returns an updated state slice. No agent is aware of what other agents do — they only see the state fields they need. This isolation makes each agent independently testable and replaceable.

Layer 5 — AI Services (OpenAI). Three models are used at different points for cost and quality reasons. gpt-4o handles Planner, Critic, and Synthesiser — the steps requiring deep reasoning. gpt-4o-mini handles Researcher extraction — a high-frequency, lower-complexity step where cost efficiency matters. text-embedding-3-small handles document chunking and embedding for the optional RAG feature.

Layer 6 — Retrieval Layer (Tavily + Pinecone). Tavily is the primary web search provider — it returns structured results with credibility signals and integrates natively with LangChain. DuckDuckGo is the no-key fallback. For sessions with uploaded documents, Pinecone stores chunk embeddings in a namespace scoped to the session ID, and Researcher Agents query it alongside the web.

Layer 7 — Persistence Layer (PostgreSQL / SQLite). Session state, research history, and source credibility overrides are stored here. LangGraph's built-in checkpointer (SqliteSaver for prototype, PostgresSaver for production) saves full graph state after each node — enabling fault tolerance and reconnect replay.



Architecture Table

Layer

Component

Role

1

Next.js 15 + React 19 + Tailwind

Query input, live agent feed, report viewer, session history, export

2

FastAPI + WebSocket

Session creation, event streaming, auth middleware

3

LangGraph StateGraph

Graph definition, node execution, conditional branching, checkpointing

4

5 Specialised Agents

Planner, Researcher ×N, Critic, Synthesiser, Formatter

5

OpenAI (gpt-4o / mini / embedding)

LLM generation and vectorisation

6

Tavily + Pinecone

Web retrieval and document vector search

7

PostgreSQL / SQLite

Session state, history, credibility overrides



LangGraph Graph Design

The graph topology is the most important architectural decision in this project. Get it wrong and you end up with a pipeline that looks like a graph but behaves like a sequential chain — losing all the benefits of parallel execution and conditional branching.

Node Definitions

Node

Agent

Input State Fields

Output State Fields

planner

Planner Agent

query, focus_mode

plan

researcher_fan_out

asyncio fan-out

plan

Spawns N researcher nodes

researcher_N

Researcher Agent

sub_question (one)

Appends to research_results

researcher_join

State merge

research_results (list)

aggregated_results

critic

Critic Agent

aggregated_results, plan

critic_report

synthesiser

Synthesiser Agent

aggregated_results, critic_report

synthesis

formatter

Formatter Agent

synthesis

final_report


The Conditional Edge

After the Critic node, the graph branches. If critic_report.gap_questions is non-empty AND state.iteration < 2, the edge points back to researcher_fan_out. The Researcher Agents run again for the gap sub-questions only, append their results to research_results, and the Critic re-evaluates. After the second pass (or if no gaps are found), execution flows unconditionally to the Synthesiser.

This is the key LangGraph pattern to learn: not every edge is static. add_conditional_edges() takes a routing function that inspects the current state and returns the next node name. The graph doesn't "decide" anything — it evaluates a pure function against the state. This makes the logic testable in isolation without running the full pipeline.



State Schema

The shared state dictionary is a TypedDict with these fields:



class ResearchState(TypedDict):
    query:            str
    focus_mode:       str                    # comprehensive | quick | technical
    plan:             ResearchPlan           # Planner output
    research_results: list[ResearchResult]   # Appended by each Researcher Agent
    critic_report:    CriticReport           # Critic output
    synthesis:        SynthesisDocument      # Synthesiser output
    final_report:     str                    # Formatter output — Markdown string
    iteration:        int                    # Gap-detection loop counter
    stream_events:    list[StreamEvent]      # Append-only — replayed on reconnect
    error:            str | None             # First error; halts graph on non-None

Every agent reads only the fields it needs. Every agent writes only the fields it produces. No agent touches another agent's output fields. This isolation is what makes the graph composable, debuggable, and swappable — replace the Critic Agent's implementation without touching any other node.


Implementation Phases


Phase 1: LangGraph Graph Skeleton

Set up the monorepo, install LangGraph, and define the full graph topology with mocked agent functions before writing a single real LLM call. This phase proves the graph structure is correct — state flows in the right order, conditional edges fire as expected, and stream_mode="values" emits events after each node — before any API costs are incurred.

Key decisions to make:

  • State schema: which fields are immutable after creation (query) vs append-only (research_results) vs replaced (critic_report)

  • Checkpointer choice: SqliteSaver (zero-config prototype) vs PostgresSaver (persistent, multi-worker production)

  • Streaming: stream_mode="values" vs "updates" — values emits full state on every transition; updates emits diffs. Values is simpler to map to frontend events.

  • Fan-out pattern: asyncio.gather with asyncio.Semaphore(3) to cap parallel LLM calls within OpenAI TPM limits

Wiring the conditional gap-detection edge and verifying that iteration_count prevents infinite loops — with full trace output to confirm — is covered in detail in the full course with working, tested code.


Phase 2: Planner and Researcher Agents

Implement the two most critical agents. The Planner must produce schema-valid JSON reliably — its output drives every downstream step, so a malformed plan breaks the entire pipeline. The Researcher must run in parallel without hitting rate limits and produce credibility-scored ResearchResults that the Critic can evaluate meaningfully.

Key decisions to make:

  • Planner temperature: 0.0 for deterministic plan JSON — creativity is not needed at the decomposition step

  • Sub-question count: the system prompt should enforce 3–7; validate the count in PydanticOutputParser and retry if out of range

  • Tavily vs DuckDuckGo: Tavily returns structured results with relevance scores natively; DuckDuckGo requires additional parsing but has no API key requirement

  • Credibility scoring: domain-authority heuristics (Wikipedia, .gov, .edu = high) are fast and free; LLM scoring is more nuanced but adds latency and cost — the right approach is hybrid

  • Researcher model: gpt-4o-mini for extraction is ~8× cheaper than gpt-4o at this step; quality difference is negligible for claim extraction from short snippets

Building the asyncio fan-out with Semaphore-guarded parallelism and testing it against Tavily's rate limits is covered in detail in the full course with working, tested code.


Phase 3: Critic and Synthesiser Agents

The Critic is the system's quality gate. The Synthesiser is where the value is delivered. Both require careful prompt engineering — the Critic must produce structured, actionable JSON that the conditional edge can route on; the Synthesiser must produce coherent prose that doesn't repeat itself across sections despite being generated section-by-section.

Key decisions to make:

  • Critic gap threshold: sub-questions with fewer than 2 sources scoring credibility_score ≥ 0.6 trigger gap questions — calibrate this against your test queries before going to production

  • Synthesiser section order: generate sections in the order they appear in the plan, injecting a 3-sentence running summary of prior sections into each subsequent prompt to prevent repetition

  • Streaming: the Synthesiser should stream tokens to the WebSocket handler via SYNTHESIS_CHUNK events — users see the report being written word by word, not a blank screen followed by a wall of text

  • Output format options: executive summary (150–250 words), detailed report (full sections), or bullet brief — controlled by a user-selectable output_format parameter passed through the state

Tuning the Critic's confidence threshold to avoid both false gaps (re-running research unnecessarily) and missed gaps (passing thin evidence to the Synthesiser) is covered in detail in the full course with worked examples.


Phase 4: WebSocket Streaming and Frontend

Build the real-time agent activity feed that makes this application visually compelling and functionally transparent. The gap between a tool that shows a spinner and one that shows you agents working is the gap between something users tolerate and something they trust.

Key decisions to make:

  • Event schema: each StreamEvent has type, agent, timestamp, payload — strictly typed so the frontend can render the right component per event without conditionals everywhere

  • Reconnect replay: store all stream_events in session state (append-only); on WebSocket reconnect, client sends { last_event_id } and server replays only the missed events

  • SSE fallback: implement GET /api/research/{session_id}/stream as an SSE endpoint for environments that don't support WebSocket (Safari on older iOS, some corporate proxies)


  • Agent feed UI: each agent gets a card with a status badge (waiting / running / complete / error), elapsed time, and a collapsible detail view showing what it found

Building the event replay mechanism that makes the feed survive mobile network drops without losing state is covered in detail in the full course with working, tested code.


Phase 5: RAG Document Upload, Export, and Deployment

Add document upload for RAG augmentation, one-click Markdown and PDF export, session history, and production deployment. This phase turns a technically impressive demo into a tool people return to.

Key decisions to make:

  • Chunk size: 512 tokens with 50-token overlap is the standard starting point — validate against your expected document types (dense academic PDFs may need smaller chunks; sparse reports may tolerate larger ones)

  • Pinecone namespace: scope to session_id so document vectors from one session never pollute another; delete the namespace when the session is deleted

  • Export: Markdown export is a file download from the API; PDF export uses the browser print API with a clean @media print stylesheet — no server-side PDF generation needed

  • Deployment: Docker Compose for local; Railway for production (persistent disk for SQLite, or swap to Neon for PostgreSQL)

Setting up LangSmith tracing to get per-agent token costs and latency breakdowns in production is covered in detail in the full course with a full tracing walkthrough.



Common Challenges

1. The Planner generates too many or too few sub-questions.


Root cause: Without a hard count constraint, gpt-4o at temperature=0.0 tends to over-decompose specific queries (8–10 sub-questions) and under-decompose broad ones (2).


Fix: Add an explicit count instruction to the system prompt — "generate exactly 3 to 7 sub-questions, no more, no fewer." Validate the count in PydanticOutputParser and trigger a retry with a stricter correction prompt on violation.


2. Parallel Researcher Agents hit OpenAI's TPM rate limit.


Root cause: Five simultaneous gpt-4o-mini calls, each processing 3–5 retrieved documents, can spike to 50,000+ tokens per minute — above the Tier 1 account limit.


Fix: Wrap the fan-out with asyncio.Semaphore(3) to cap simultaneous LLM calls at three. Add tenacity retry with exponential backoff on RateLimitError. Upgrade to Tier 2 before production deployment.


3. The gap-detection loop runs indefinitely.


Root cause: If the Critic's confidence threshold is set too aggressively, it flags gaps even after a second research pass on topics with genuinely sparse public coverage.


Fix: Hard-cap iteration at 2 in the conditional edge routing function. After the second iteration, the graph proceeds to the Synthesiser regardless — the final report notes flagged gaps as "areas requiring further research."


4. The Synthesiser repeats content from earlier sections.


Root cause: Each section prompt is an independent LLM call. Without explicit context about prior sections, the Synthesiser re-introduces background already covered in Section 1 by the time it reaches Section 3.


Fix: After each section is generated, run a lightweight gpt-4o-mini call to produce a 3-sentence summary. Inject this running summary into subsequent section prompts as an explicit "do not repeat" block.


5. The WebSocket drops during long synthesis runs.


Root cause: Synthesis can take 15–20 seconds on a complex query. Mobile connections and some corporate proxies drop idle WebSockets after 10–15 seconds of no new frames.


Fix: The server should send a heartbeat frame every 5 seconds during synthesis. Store all stream_events in session state. On reconnect, the client sends its last_event_id and the server replays only the events it missed.


6. LangGraph state grows too large for long sessions.


Root cause: Storing full source text in research_results for a 7-sub-question query with 5 sources each produces a state dictionary approaching 1–2MB. This adds serialisation overhead to every checkpoint.


Fix: Store only source URL, title, credibility score, and extracted claims (not full text) in LangGraph state. Retrieve full source text from the database by ID when needed for display.


7. Credibility scores are inconsistent across agents.


Root cause: When credibility scoring is done inside each Researcher Agent (a separate LLM call per agent), the same source URL receives different scores from parallel agents due to temperature variation and slightly different context.


Fix: Move credibility scoring to a post-join step that runs once on the aggregated source list. Use a deterministic domain-authority lookup first (Wikipedia, .gov, .edu = 0.85+), and reserve LLM scoring only for sources not in the lookup table.


8. The Critic produces valid JSON but the conditional edge routes incorrectly.


Root cause: The routing function checks state["critic_report"]["gap_questions"] — but if the Critic returns an empty list ([]) rather than None, the truthiness check if gap_questions returns False, correctly routing to the Synthesiser. If the Critic returns null or omits the field, the routing function throws a KeyError.


Fix: Use .get("gap_questions", []) in the routing function. Enforce gap_questions: list[str] = [] (not Optional) in the CriticReport Pydantic schema.

Solving these issues took significant testing across query types and edge cases — the course walks through each fix with working code and the test scenarios that surface the bug.


Ready to Build This Yourself?

Understanding an architecture is not the same as shipping it. The gap between this article and a working, deployed multi-agent research system is filled with LangGraph graph debugging, prompt tuning sessions, rate limit negotiations, WebSocket reconnect edge cases, and Pinecone namespace management.

The Multi-Agent Research Assistant course on labs.codersarts.com gives you everything you need to go from zero to deployed:

✅ Full source code for all 6 sprints — LangGraph backend + Next.js frontend, fully commented

✅ Step-by-step tutorials walking through every architectural decision in each sprint

✅ All five agent system prompts with the exact configurations that produce reliable structured output

✅ LangSmith tracing setup walkthrough — see per-agent token costs and latency in production

✅ asyncio fan-out with Semaphore rate-limit guard — tested against OpenAI Tier 1 and Tier 2 limits

✅ WebSocket reconnect replay implementation — no missed events on mobile network drops

✅ Docker Compose setup for reproducible local development

✅ Deployment walkthrough for Railway and AWS ECS

✅ Lifetime access — including all future updates as LangGraph releases new versions

✅ Community support via the Codersarts Discord

$30.00. Everything above.

Already have a team or a project in motion and need a faster path? Book a 1:1 guided session at $20/hour — build it alongside the Codersarts team with your own research use case, your own stack decisions, and your own LangGraph graph reviewed live. Session recording included.



Conclusion

The Multi-Agent Research Assistant is a seven-layer system: a Next.js streaming frontend, a FastAPI WebSocket gateway, a LangGraph stateful graph engine, five specialised agents, OpenAI for generation and embedding, Tavily and Pinecone for retrieval, and PostgreSQL for persistence. The key architectural insight is the agent graph topology — not a sequential chain with a fancy name, but a genuine directed graph with parallel fan-out, stateful join, and a conditional loop that runs until evidence quality passes a programmatic quality gate.

The simplest place to start is Stack A: LangGraph + FastAPI + gpt-4o + Tavily + SQLite. No Pinecone, no Redis, no PostgreSQL. You can have a working multi-agent research pipeline — with real parallel execution and a real conditional Critic loop — running locally in a weekend.

When you are ready to move from architecture to working code, the full course is waiting at labs.codersarts.com — complete source, all five agent prompts, LangSmith tracing, and a full deployment walkthrough included.

Comments


bottom of page