Build an Agentic RAG System with LangGraph | Major Project
- 13 hours ago
- 11 min read

CS / AI Engineering — Graduate Assignment
Difficulty: Advanced | Total Points: 100 | Duration: 2–3 Weeks | Format: Individual / Pairs
Overview
Retrieval-Augmented Generation (RAG) has become one of the most widely adopted patterns in applied NLP, enabling language models to answer questions grounded in external knowledge rather than relying solely on parametric memory. However, naive RAG pipelines have a fundamental weakness: they retrieve blindly, generate unconditionally, and have no ability to recognize or recover from failure. If the top-k retrieved documents are irrelevant, the generator hallucinates. If the question is complex and multi-hop, a single retrieval pass is insufficient. If the query is ambiguous, the wrong knowledge source is consulted entirely.
Agentic RAG addresses these shortcomings by transforming the retrieval-generation pipeline into a reasoning loop. Rather than a fixed sequence of steps, an agentic system can pause, evaluate, decide, and retry. It can ask: Are these documents actually relevant to my question? Is my generated answer grounded in what I retrieved? Should I search the web instead of my vector store? Does this question even require retrieval at all? These are the questions that distinguish a naive RAG pipeline from a production-grade intelligent system.
In this assignment, you will design and implement a fully agentic RAG pipeline in Python using LangGraph — a stateful, graph-based orchestration framework built on top of LangChain. Unlike linear chains, LangGraph allows you to define nodes (processing steps), edges (transitions), and conditional branches (decisions), making it the natural substrate for building systems that loop, backtrack, and adapt. Your implementation must incorporate all four core architectural patterns that define modern agentic RAG: query routing, retrieval grading, corrective RAG (CRAG), and adaptive RAG.
By the end of this assignment, you will have built a system capable of handling diverse, real-world queries with far greater robustness than a baseline RAG pipeline — and you will have developed a deep, practical understanding of why each architectural component matters.
Background & Motivation
Before diving into implementation, it is worth understanding why each component of agentic RAG was invented and what problem it solves.
Why query routing? A single vectorstore is not always the right knowledge source. Some questions are best answered by a structured database. Others are time-sensitive and require live web search. Still others are so general that no retrieval is needed. A router acts as the intelligent dispatcher of your pipeline — it reads the question and decides where to look.
Why retrieval grading? Embedding-based similarity is a useful but imperfect signal. A document can be topically related to a query without actually containing the information needed to answer it. Retrieval graders use an LLM to apply a more nuanced judgment: does this specific document help answer this specific question? Filtering on this criterion before generation dramatically reduces hallucination.
Why corrective RAG? Even with a grader, retrieval sometimes fails entirely. The vectorstore may not contain the answer. The query may have been poorly formed. Corrective RAG (introduced by Yan et al., 2024) builds a recovery mechanism directly into the graph: when graded retrieval quality falls below a threshold, the system rewrites the query and tries again, or escalates to a web search fallback. Failure becomes a recoverable state rather than a terminal one.
Why adaptive RAG? Not all questions are equally complex. Asking "What is the capital of France?" does not require retrieval. Asking "Summarize the key arguments made by three different authors on the ethics of AI alignment" requires iterative, multi-source retrieval. Adaptive RAG (Jeong et al., 2024) classifies questions by complexity and selects the appropriate retrieval strategy — none, single-pass, or iterative — before the pipeline begins. This makes the system both more efficient and more capable.
Core Concepts You Will Implement
⇄ Query Routing Direct queries to the most appropriate retrieval source — vectorstore, web search, or structured database — based on query intent. The router must be LLM-powered with structured output, returning a typed routing decision that drives a conditional edge in the graph. You should think carefully about the classification taxonomy: what types of questions belong to each route, and how will you handle ambiguous cases?
✓ Retrieval Grading Use an LLM-as-judge to score each retrieved document for relevance before passing it downstream. The grader must evaluate each document independently against the original question and return a binary yes/no judgment. Irrelevant documents are discarded before generation begins. Your grader prompt design is critical — a poorly written prompt will either over-filter (discarding useful documents) or under-filter (letting noise through).
↺ Corrective RAG (CRAG) When the grader determines that retrieved documents are insufficient — either because all were filtered out or because the average relevance score is too low — the system must not simply give up. Instead, it must trigger a correction strategy: rewrite the query using an LLM to better express the underlying information need, fall back to a web search API (Tavily), or both in sequence. The graph must loop and recover rather than terminate on bad retrieval. This is the most complex component of the assignment and carries the most points.
◈ Adaptive RAG Before any retrieval occurs, classify the incoming question by complexity. Simple, self-contained factual questions are answered directly without retrieval. Standard questions follow the normal single-pass retrieval path. Complex, multi-hop questions that require synthesizing information from multiple sources trigger an iterative retrieval mode, where the graph loops through multiple retrieve-grade-generate cycles, accumulating context before producing a final answer. Demonstrate each mode with concrete examples.
Assignment Tasks
Task 01 — Graph Setup & State Schema Define a GraphState TypedDict capturing at minimum: the original question, the list of retrieved documents, the current generation, a routing decision field, a grading outcome field, and a loop counter to prevent infinite loops. Initialise a LangGraph StateGraph using this schema. Define all nodes and edges upfront before implementing logic. Your graph structure must be visualisable — use graph.get_graph().draw_mermaid_png() or draw_ascii() to produce a diagram and include it in your report.
Task 02 — Query Router Node Implement a router node using an LLM with structured output (use LangChain's .with_structured_output()) to classify each incoming query into one of your defined routing categories. Your router must return a typed Pydantic model, not a raw string. Wire the output to a conditional edge that dispatches the graph to the appropriate retrieval node. Log every routing decision with the query and reasoning for your evaluation dataset.
Task 03 — Retrieval Grader Node After retrieval, implement a grader node that iterates over each retrieved document and evaluates it independently. Your grader must use a carefully engineered prompt that asks the LLM to assess whether the document contains information directly useful for answering the question. Return a binary GradeDocuments Pydantic object with a binary_score field. Filter the document list before passing state to the generation node. Log all grading decisions.
Task 04 — Corrective RAG Loop Implement the CRAG correction mechanism as a conditional edge: if the filtered document list is empty or below a minimum threshold, route to a correction node rather than generation. The correction node must (a) use an LLM to rewrite the original query into a more effective search formulation, and (b) call the Tavily Search API to retrieve web results. Re-grade the web results using the same grader before generation. Include a loop counter in state to cap retries at a configurable maximum (e.g. 3) to prevent infinite loops.
Task 05 — Hallucination & Answer Grader After generation, implement two sequential grader nodes. The first checks whether the generated answer is grounded in the retrieved documents (hallucination check). The second checks whether the answer actually addresses the original question (usefulness check). If either check fails, loop back to the generation node with a refined prompt that explicitly instructs the model to stay grounded. Document the rate at which each check triggers in your evaluation.
Task 06 — Adaptive RAG Logic Extend your query router to output a three-way classification: no_retrieval, single_pass, or iterative. The no_retrieval path bypasses the vectorstore and calls the LLM directly. The single_pass path follows the standard retrieve-grade-generate flow. The iterative path enters a loop that retrieves, grades, generates a partial answer, identifies gaps in that answer using an LLM, reformulates the query to address those gaps, and retrieves again — repeating until the LLM judges the accumulated context sufficient or the loop cap is reached. Demonstrate each mode on at least two example queries and include traces in your notebook.
Task 07 — Evaluation & Analysis Construct a test set of at least 20 queries spanning: simple factual questions, questions requiring vectorstore knowledge, questions requiring up-to-date web information, and complex multi-hop questions. Run your full pipeline on all queries. For each query, record: the routing decision, the number of documents retrieved and graded-in, whether a correction loop was triggered, whether the hallucination or usefulness grader failed, and a final quality score. Report aggregate metrics: routing accuracy (manually labeled), retrieval precision@5, correction trigger rate, hallucination rate, and mean answer quality. Use RAGAS if possible; a custom LLM-as-judge scoring script is also acceptable with justification.
Technical Requirements
Framework: LangGraph (≥ 0.1) + LangChain. A simple linear LangChain chain is not acceptable — the graph structure with conditional edges, state management, and loops is mandatory and will be verified during grading.
LLM: GPT-4o, Claude 3 Sonnet, or Gemini 1.5 Pro for graders and generators. A smaller model (e.g. GPT-3.5-turbo or Claude Haiku) may be used for the router and query rewriter to reduce API costs.
Vector Store: Chroma or FAISS for local development; Pinecone or Weaviate for cloud deployment. Your vectorstore must be populated with a real document corpus of at least 50 documents on a coherent topic of your choice.
Embeddings: OpenAI text-embedding-3-small, Cohere embed-v3, or a local HuggingFace sentence-transformer model.
Web Search Fallback: Tavily Search API (free tier available at tavily.com). Bing Search API or SerpAPI are acceptable alternatives with justification in your report.
Structured Outputs: All LLM-powered graders and routers must use Pydantic models via .with_structured_output(). Raw string parsing is not acceptable.
Code Quality: Type hints throughout, docstrings on all node functions, a requirements.txt with pinned versions, a .env.example file listing required API keys, and a runnable main.py or top-level notebook demo. The project must run end-to-end after pip install -r requirements.txt without manual edits.
Grading Rubric
Component | Criteria | Points |
Graph architecture & state design | Clean TypedDict state schema; correctly wired nodes and conditional edges; no dead-end states; graph is visualisable | 15 |
Query routing | Router correctly classifies queries; Pydantic structured output used; routing logic is explainable and logged | 15 |
Retrieval grading | Grader evaluates documents independently; binary judgment returned; irrelevant docs filtered before generation | 15 |
Corrective RAG loop | Query rewriting implemented; Tavily fallback working; loop terminates safely; correction rate logged | 20 |
Adaptive RAG modes | All three retrieval strategies implemented; demonstrated on ≥2 examples each; traces included in notebook | 15 |
Evaluation & results | ≥20 queries tested; routing accuracy, precision@5, correction rate, hallucination rate reported; results discussed | 10 |
Code quality & documentation | Typed, documented, runnable code; README with setup instructions; .env.example included | 10 |
Total | 100 |
Deliverables
1. Source Code (GitHub Repository) Submit a link to a public or private GitHub repository. The repository must contain a clean, well-organised Python project with a descriptive README that covers: project overview, architecture diagram, setup instructions, environment variables required, how to populate the vectorstore, and how to run the demo. Code must be committed with meaningful commit messages — a single monolithic commit will be penalised.
2. Demo Notebook A Jupyter notebook (demo.ipynb) with a complete end-to-end walkthrough of the system. It must include: the rendered graph diagram using LangGraph's visualisation tools, annotated traces of at least five representative query runs (one per routing path, one triggering CRAG correction, one triggering adaptive iterative retrieval), and inline commentary explaining what the graph is doing at each step and why. The notebook must be fully executed with outputs visible — do not submit an unrun notebook.
3. Evaluation Report A PDF or well-formatted Markdown report of minimum 5 pages covering: system architecture and design rationale (why did you make the component choices you did?), quantitative evaluation results with tables and charts, error analysis (which query types fail and why?), ablation discussion (what happens if you remove the grader or the correction loop?), and limitations and future improvements. The report should be written as if for a technical reader who has not seen your code.
4. Demo Video A 5–8 minute screen recording walking through your system live. The video must show: your graph structure and explain the architecture, at least one query that follows the standard path end-to-end, at least one query that triggers the corrective RAG loop (show the rewritten query and the Tavily fallback), and at least one example of each adaptive retrieval mode. Narrate your explanation clearly. Upload to YouTube (unlisted) or Google Drive and include the link in your README.
Academic Integrity Note: You are permitted and encouraged to reference tutorials, the LangGraph documentation, and published research papers. However, submitting code copied wholesale from a tutorial without substantial modification, or code generated entirely by an AI assistant without genuine understanding, constitutes a violation of academic integrity policy. During grading, you may be asked to explain any part of your submission. If you cannot explain a component, it will receive zero credit regardless of whether it functions correctly.
Bonus Challenges (+10 pts each, max +20)
★ Self-RAG Integration Replace the binary yes/no grader with the Self-RAG paper's fine-grained critique token framework. Implement the three critique dimensions — IsREL (is the retrieved passage relevant?), IsSUP (does the generation support claims in the passage?), and IsUSE (is the overall response useful?) — as separate grader prompts. Use the combined scores to make more nuanced routing decisions rather than a hard binary threshold. Reference: Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique, NeurIPS 2023.
★ LangGraph Studio / Cloud Deployment Deploy your graph as a persistent, stateful API endpoint using LangGraph Studio (local) or LangGraph Cloud. The deployed system must accept a query via HTTP POST and return the full traced execution including all intermediate states. Share the deployment URL or a working local Studio screenshot in your submission. Provide a curl example in your README demonstrating a live query.
★ Multi-agent Collaboration Refactor your system so that retrieval and generation are handled by separate sub-agents, each with their own LangGraph graph, that communicate through LangGraph's inter-graph message passing. The orchestrator graph should dispatch to the retrieval agent, receive graded documents, and dispatch to the generation agent. Document the communication protocol between agents and discuss the tradeoffs of this architecture versus a monolithic graph.
Recommended Reading
Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020 — the original RAG paper; read this first.
Shi et al., REPLUG: Retrieval-Augmented Language Model Pre-Training, 2023 — on treating retrieval as a plug-in module.
Asai et al., Self-RAG: Learning to Retrieve, Generate, and Critique, NeurIPS 2023 — introduces fine-grained critique tokens for grading.
Yan et al., Corrective Retrieval Augmented Generation, arXiv 2024 — the paper your CRAG implementation is based on; read carefully.
Jeong et al., Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity, arXiv 2024 — the basis for your adaptive routing component.
LangGraph documentation — Agentic RAG tutorial at langchain-ai.github.io/langgraph — essential reference for implementation.
LangChain blog — LangGraph: Multi-Agent Workflows — good overview of graph design patterns.
Frequently Asked Questions
Can I use a different graph framework, such as LlamaIndex Workflows or AutoGen? No. LangGraph is mandatory for this assignment. One of the learning objectives is gaining hands-on experience with stateful graph orchestration as implemented in LangGraph specifically.
What should my vectorstore document corpus be about? Any coherent topic with enough publicly available text to populate at least 50 documents. Suggested options: arXiv abstracts on a subfield of ML, Wikipedia articles on a historical period, SEC 10-K filings for a sector, or a set of technical documentation pages. Avoid trivial corpora (e.g. a single short document split into chunks).
How strict is the 20-query evaluation minimum? It is a floor, not a target. A stronger submission will test on 40–50 queries and include stratified analysis across query types. Diversity of queries matters more than raw count.
Can I use LangSmith for tracing? Yes, and it is strongly encouraged. LangSmith traces make it far easier to debug grading decisions and routing logic. Include LangSmith trace links or screenshots in your report where relevant.
Need Help With This Assignment?
CodersArts can help you build it.
Agentic RAG is one of the most technically demanding topics in applied AI engineering right now. The combination of LangGraph's stateful graph model, multi-step LLM grading, corrective loops, and RAGAS evaluation involves a lot of moving parts — and it is completely normal to get stuck. Whether you are struggling with conditional edge wiring, debugging an infinite loop in your CRAG implementation, getting structured outputs to parse correctly, or interpreting your evaluation metrics, CodersArts provides expert, one-on-one guidance from engineers with hands-on experience building production RAG systems.
CodersArts does not just finish your assignment for you — they help you genuinely understand the architecture, debug your specific code, explain the research papers your implementation is based on, and prepare you to discuss your design decisions confidently. Students who work with CodersArts consistently submit stronger reports, produce more robust implementations, and come away with skills they can actually use in industry.
CodersArts can help with:
LangGraph state graph design and debugging
Retrieval grading prompt engineering
Corrective RAG and adaptive routing implementation
Tavily and vectorstore integration
RAGAS evaluation setup and interpretation
Python AI/ML project architecture
NLP and LLM coursework across all levels
End-to-end AI pipeline development and deployment
Similar assignments CodersArts has helped with: LangChain agent development, LlamaIndex RAG pipelines, fine-tuning with LoRA/QLoRA, vector database integration (Pinecone, Weaviate, Chroma), LLM evaluation frameworks, multi-agent system design, and production ML deployment on AWS and GCP.
Visit codersarts.com to get expert assignment help today — and build something you are genuinely proud of.



Comments