Debugging, Incident Response, and Postmortem for LLM Systems

Mar 25
12 min read

Course: LLM Observability — From Traces to Incident Response

Chapters Covered: 7 – 10 (Trace-Based Debugging & Replay, Incident Response, LangSmith vs Langfuse in Real Teams, Final Lab & Packaging)

Level: Medium → Advanced

Type: Individual Assignment

Duration: 7 – 10 days

Prerequisite: familiarity with trace schemas, metrics, and tracing concepts from Chapters 1–6

Objective

By the end of this assignment you will be able to:

Load, inspect, and diagnose production trace failures across multiple failure modes (retrieval mismatch, hallucination, wrong prompt version, tool timeout, latency SLA breach).
Implement a trace replay engine that re-runs a request with overrides (different prompt version, different model) and compares results side-by-side.
Classify LLM-specific incidents by type and severity using the five incident classes and three severity levels from the course.
Design circuit breakers and mitigation strategies for stop-the-bleeding scenarios.
Write a production-grade postmortem following the course template.
Execute a complete end-to-end incident drill — from alert to diagnosis to mitigation to postmortem.

Problem Statement

Your team's RAG customer-support bot has been in production for three weeks. The observability stack you built (Assignment 1) is now generating traces, metrics, and alerts. This morning, three alerts fired in quick succession:

Quality proxy rate dropped from 92% to 68% over the last hour.
p95 latency spiked from 1,800 ms to 9,200 ms.
Tool success rate dropped from 97% to 72%.

You are the on-call engineer. Your job is to diagnose the root causes, mitigate the impact, and write a postmortem so this class of failure doesn't recur.

Provided Assets

You will use the following assets from the course repository:

Asset	Location	Relevance
Pre-built trace files (6 traces)	traces/ch07_debug/	Real failure scenarios to diagnose
Replay module	src/replay/replay_trace.py	Reference for replay functions
Incident runbook	runbooks/incident_runbook.md	5-step triage + incident classes
Postmortem template	runbooks/postmortem_template.md	Structure for your final postmortem
Example postmortem	eval/postmortem_example.json	Reference for tone and detail level
Failure injection guide	eval/failure_injection_demo.md	Scenarios for injecting failures
Course notebooks (ch07–ch10)	notebooks/	Concepts and worked examples
Observability contract	runbooks/observability_contract.json	Naming and severity definitions

Tasks & Requirements

Task 1: Trace Inspection & Failure Diagnosis (20 Marks)

Context: Chapter 7 introduced the six trace files in traces/ch07_debug/, each representing a different failure mode that can occur in production RAG systems.

What to do:

Load all six trace files from traces/ch07_debug/:

trace-a-normal-success.json (healthy baseline)
trace-b-retrieval-mismatch.json
trace-c-wrong-prompt-ver.json
trace-d-tool-timeout.json
trace-e-hallucination.json
trace-f-slow-response.json

For each trace, extract and display in a summary table:

trace_id
User query
Number of retrieved docs and their relevance scores
Prompt version used
Model name
Latency (ms)
Whether an error occurred
A 1-line diagnosis of what went wrong (or "Healthy" for trace-a)

Deep-dive analysis — for each of the five failure traces (b through f), write a detailed diagnosis (150–250 words each) that answers:

What is the symptom? (what does the user experience?)
What is the root cause? (what went wrong technically?)
Which observability signal reveals the problem? (retrieval scores? prompt version? latency? token count?)
Would traditional HTTP monitoring have caught this? (yes/no and why)

Rank the five failures by severity (most impactful to least) and justify your ranking in one paragraph.

Deliverable: Trace loading code, summary table, five written diagnoses, severity ranking with justification.

Task 2: Trace Replay Engine (20 Marks)

Context: Chapter 7 demonstrated how replay lets you re-run a failed request with modifications (different prompt version, different model) to verify your hypothesis about the root cause — without needing the original user to repeat their query.

What to do:

Implement the following functions (you may reference src/replay/replay_trace.py but must write your own working version):

save_trace_record(trace, filepath) — serialize a trace to JSON.
load_trace_record(filepath) — load a trace from JSON.
replay_trace(trace, overrides) — re-execute the LLM call from a saved trace with optional overrides:
- prompt_version (swap to a different prompt)
- model (e.g., swap gpt-4o to gpt-4o-mini)
- temperature (adjust generation parameters)
compare_traces(original, replayed) — produce a side-by-side comparison table showing differences in: prompt version, model, latency, token count, response text (first 200 chars), and quality score.

Replay Experiment 1 — Fix the wrong prompt version:

Load trace-c-wrong-prompt-ver.json.
Replay with prompt_version = "v2.0" (the correct version).
Compare original vs replayed trace side-by-side.
Confirm the response quality improves.

Replay Experiment 2 — Fix the slow response:

Load trace-f-slow-response.json.
Replay with model = "gpt-4o-mini" (cheaper, faster model).
Compare latency and cost between the two.
Discuss: does the model downgrade affect answer quality?

Bulk Replay for Regression Testing:

Implement a bulk_replay(trace_dir, overrides) function that:
- Loads all traces from a directory.
- Replays each with the given overrides.
- Produces a summary table showing pass/fail for each trace.
Run bulk replay on all six traces with prompt_version = "v2.0".
Identify which traces improve, which stay the same, and which degrade.

Note: If you do not have an OpenAI API key, you may implement replay as a simulation that modifies the trace metadata and generates a mock response. The function signatures, comparison logic, and bulk replay flow must still be fully implemented. State this clearly in your report.

Deliverable: All four functions implemented, two replay experiments with comparison tables, bulk replay summary, and analysis.

Task 3: Incident Classification & Severity Assignment (15 Marks)

Context: Chapter 8 defined five LLM-specific incident classes (RETRIEVAL_OUTAGE, QUALITY_DEGRADATION, PERFORMANCE_DEGRADATION, COST_SPIKE, DATA_INCIDENT) and three severity levels (SEV-1, SEV-2, SEV-3) with response time expectations.

What to do:

Reproduce the incident classification framework from the course by creating a data structure (dict, dataclass, or Pydantic model) that defines all five incident classes with:

Class name and description
At least 3 symptoms for each class
Typical root causes
Detection method (which metric/alert reveals it)

Classify 8 simulated scenarios. For each scenario below, determine the incident class and severity level. Justify each decision in 2–3 sentences.

#	Scenario
S1	The vector database returns an empty result set for every query across all customers.
S2	A prompt version rollback caused the LLM to start responses with "As an AI assistant, I'd be happy to help!" — violating brand tone guidelines.
S3	A new model deployment increased p95 latency from 1.5s to 4.2s, still within the 5s warning threshold.
S4	Token usage per request tripled overnight due to a retry loop in the tool-calling middleware.
S5	A customer's email address appeared in a trace record stored in LangSmith.
S6	The SQL tool is returning timeout errors for 30% of queries involving the financials table.
S7	Hallucination rate increased from 3% to 28% after switching to a fine-tuned model variant.
S8	Latency for enterprise-tier customers is 200ms higher than free-tier due to an extra retrieval step — this is by design and within SLA.

Design a triage decision tree (text-based or diagram) that an on-call engineer could follow to classify any incoming alert into the correct incident class and severity within 5 minutes.

Deliverable: Incident classification data structure, 8 scenario classifications with justifications, triage decision tree.

Task 4: Circuit Breakers & Mitigation Strategies (15 Marks)

Context: Chapter 8 introduced circuit breakers as immediate "stop-the-bleeding" actions that can be taken before the root cause is fully understood.

What to do:

For each of the five incident classes, design a circuit breaker that can be activated within minutes. Specify:

Trigger condition (what metric value or alert activates it)
Action (what the circuit breaker does — e.g., disable a tool, fall back to cache, downgrade model)
User impact (what the user experiences while the circuit breaker is active)
Rollback procedure (how to deactivate the circuit breaker once the root cause is fixed)

Present this as a table:

Incident Class	Trigger	Action	User Impact	Rollback
RETRIEVAL_OUTAGE	...	...	...	...
QUALITY_DEGRADATION	...	...	...	...
PERFORMANCE_DEGRADATION	...	...	...	...
COST_SPIKE	...	...	...	...
DATA_INCIDENT	...	...	...	...

Implement 2 circuit breakers in code:

Choose any two incident classes and implement their circuit breakers as Python functions. Each function should:

Accept the current metric values (or a trace object) as input.
Check against the trigger condition.
Return a decision object: {"activate": True/False, "action": "description", "fallback_response": "..."}

Example:


def retrieval_circuit_breaker(retrieved_docs, min_required=1):

    """Activate if retrieval returns zero documents."""

    ...

Simulate a circuit breaker activation:

Generate a batch of 20 traces where 8 exhibit the failure condition.
Run your circuit breaker function on each trace.
Show how many traces would have been intercepted and what fallback the user would have received.

Deliverable: Circuit breaker design table (all 5 classes), 2 implemented circuit breaker functions, simulation output with analysis.

Task 5: Postmortem Writing (15 Marks)

Context: Chapter 8 and the runbooks/postmortem_template.md established that every SEV-1 and SEV-2 incident must result in a postmortem within 48 hours.

What to do:

Write a full postmortem document for the following incident scenario:

Incident: INC-2025-007 — Hallucination Spike After Model Update

On Monday at 09:15 UTC, the team deployed a fine-tuned variant of gpt-4o-mini to production. By 10:30 UTC, the quality proxy metric dropped from 91% to 64%. Customer support received 12 tickets reporting "the bot is making things up." At 11:00 UTC, the on-call engineer identified the issue via trace replay — the fine-tuned model was generating facts not present in the retrieved context. The team rolled back to the previous model at 11:20 UTC. Quality proxy recovered to 89% by 11:45 UTC.

Your postmortem must follow the course template and include:

Header: Incident ID, title, severity, date, duration, author.
Summary: 2–3 sentence overview.
Impact: Number of affected users/requests, business impact, customer experience impact.
Timeline: Minute-by-minute chronology from deployment to recovery (at least 8 entries).
Root Cause Analysis: Technical explanation of why the fine-tuned model hallucinated (propose a plausible explanation).
Detection: How was the incident detected? Which metric/alert fired first? What was the detection latency (time from incident start to first alert)?
Mitigation: What was done to stop the bleeding? (Reference circuit breakers from Task 4.)
Prevention — Action Items: At least 5 specific, actionable items to prevent recurrence. Each must have:
- Description
- Owner (use role names like "ML Engineer", "Platform Engineer")
- Priority (P0 / P1 / P2)
- Due date (relative, e.g., "within 1 week")
Lessons Learned: At least 3 insights. What went well? What went poorly? What was lucky?
Trace Evidence: Reference at least 2 specific traces (you may reference the course trace files or create synthetic ones) showing:
- A trace from before the deployment (healthy)
- A trace from during the incident (hallucinating)

Deliverable: A complete postmortem document in Markdown format (1,500–2,500 words).

Task 6: End-to-End Incident Drill (15 Marks)

Context: Chapter 10 (Final Lab) demonstrated the complete lifecycle: healthy request → injected failure → alert fires → diagnosis via replay → mitigation → postmortem. This task asks you to execute a mini version of that lifecycle independently.

What to do:

Design and execute a tabletop incident drill that simulates the following scenario end-to-end:

Baseline State — Show your system operating normally:

Generate or load 10 healthy traces.
Compute metrics (p95 latency, failure rate, quality proxy) — all should be within green thresholds.
Print: "System status: HEALTHY ✓"

Failure Injection — Introduce a failure:

Choose one failure mode: retrieval outage, latency spike, or quality degradation.
Modify 6 of the 10 traces to exhibit the chosen failure (e.g., set all retrieval scores to 0.0 for a retrieval outage, or set latency to 12,000 ms for a latency spike).
Recompute metrics — at least one metric should now breach the warning or critical threshold.
Print: "ALERT: [metric_name] breached [threshold_level] — current value: [value]"

Triage — Follow your decision tree from Task 3:

Classify the incident (class and severity).
Print triage output: incident class, severity, recommended response time.

Diagnosis via Replay — Use your replay engine from Task 2:

Select one of the failed traces.
Replay with an appropriate override (e.g., force retrieval from backup index, use different model).
Compare original vs replayed trace.
Print: "Root cause identified: [description]"

Mitigation — Activate the relevant circuit breaker from Task 4:

Run the circuit breaker on all failed traces.
Show how many would be intercepted.
Print: "Circuit breaker activated — [X] requests redirected to fallback"

Recovery — Remove the failure and show metrics returning to normal:

Restore the original healthy traces.
Recompute metrics — all should be green.
Print: "System status: RECOVERED ✓"

Postmortem Reference — Reference your postmortem from Task 5 (or write a 200-word mini-postmortem for this drill scenario).

The entire drill should be executable in a single notebook section with clear output at each step.

Deliverable: Drill code in a single executable notebook section, with printed output at each stage, and a brief reflection (200–300 words) on what the drill revealed about your system's observability maturity.

Deliverables Summary

You must submit:

1. Code (Required)

Jupyter Notebook (.ipynb) with clearly labelled sections for each task (Task 1 through Task 6).
Well-commented code with meaningful variable names.
Each task should be runnable independently (Task 6 may depend on functions from Tasks 2–4).

2. Postmortem Document (Required)

postmortem_INC-2025-007.md — the full postmortem from Task 5.
Must follow the course template structure.
Format: Markdown (.md)

3. Report (Required)

A short report (4–6 pages) including:

Approach for each task.
Observations from trace analysis and replay experiments.
Triage decision tree (from Task 3 — may be embedded or attached).
Circuit breaker design rationale (from Task 4).
Drill reflection (from Task 6).
Key learnings about LLM-specific incident response vs traditional incident response.

Format: PDF or DOCX

4. Output Samples (Required)

Include:

Trace summary table from Task 1.
Replay comparison tables from Task 2.
Incident classification table from Task 3.
Circuit breaker simulation output from Task 4.
Drill output log from Task 6.

Submission Guidelines

Submit via your LMS

(e.g., Moodle / Google Classroom / institutional portal).

File Naming Convention

<YourName>_LLM_Observability_Assignment2.zip

Inside the ZIP

/notebook.ipynb
/postmortem_INC-2025-007.md
/report.pdf
/screenshots/ (optional — trace screenshots, diagrams)

Deadline

Submit within 10 days from assignment release date.

Late Submission Policy

Delay	Penalty
Up to 24 hours late	10% deduction
24 – 48 hours late	20% deduction
Beyond 48 hours	Not accepted

Important Instructions

Do NOT copy-paste code from the course notebooks without understanding it. You must adapt and extend the examples.
You are expected to use the six trace files provided in traces/ch07_debug/. Do not fabricate trace analysis — load the actual files.
If you do not have an OpenAI API key for replay experiments, you may use simulated replay (modify trace metadata and generate mock responses). Clearly state this in your report.
Your postmortem must be written in a professional, blameless tone — focus on systems and processes, not individuals.
Plagiarism will result in disqualification and referral to the academic integrity committee.

Evaluation Rubric

Criteria	Marks	What the Evaluator Looks For
Task 1: Trace Inspection & Diagnosis	20	All 6 traces loaded correctly; accurate diagnosis for each failure mode; clear symptom→root-cause reasoning; severity ranking justified.
Task 2: Trace Replay Engine	20	All functions implemented; both replay experiments show meaningful comparison; bulk replay summary complete; analysis discusses trade-offs.
Task 3: Incident Classification	15	Framework covers all 5 classes; all 8 scenarios classified correctly with justification; triage decision tree is practical and complete.
Task 4: Circuit Breakers	15	Design table covers all 5 classes; 2 circuit breakers implemented correctly; simulation shows interception logic working.
Task 5: Postmortem	15	Follows course template; all 10 sections present; timeline is detailed; action items are specific and actionable; blameless tone.
Task 6: End-to-End Drill	15	All 7 drill stages executed with clear output; metrics change correctly; triage and replay integrated; reflection is insightful.
Total	100

Bonus (Optional — up to +10 Marks)

+3 marks: Implement a LangSmith vs Langfuse comparison for your trace analysis — load the same trace into both data models and discuss which gives better debugging visibility (references Chapter 9 concepts).
+3 marks: Create a runbook addendum for a novel incident class not covered in the course (e.g., prompt injection attack, multi-turn session contamination) with symptoms, detection, and mitigation.
+4 marks: Build a mini dashboard (using matplotlib, plotly, or streamlit) that visualises the incident drill in real-time — showing metrics transitioning from green → red → green as the drill progresses.

Guidance & Tips

Read the trace files first — understanding the six failure modes is the foundation for everything else.
Tasks 1 and 2 build on each other — diagnose first, then replay to verify your diagnosis.
Tasks 3 and 4 build on each other — classify the incident, then design the response.
Task 5 is a writing task — budget time for a well-structured, detailed postmortem. The best postmortems read like incident reports that a VP of Engineering would share with the team.
Task 6 ties everything together — think of it as a dress rehearsal. If your drill runs smoothly end-to-end, you've demonstrated production-readiness.
Refer to runbooks/incident_runbook.md for the triage framework and runbooks/postmortem_template.md for the postmortem structure.
Refer to eval/course_completion_checklist.md to verify you've covered all expected skills.

Instructor Note

This assignment is designed to simulate the operational reality of running an LLM application in production. Coding skill alone is not sufficient — you must demonstrate:

Diagnostic reasoning: Can you look at a trace and identify what went wrong?
Systematic response: Can you follow a structured process (triage → diagnose → mitigate → postmortem) under pressure?
Communication quality: Can you write a postmortem that a cross-functional team (engineering, product, leadership) can understand?
Systems thinking: Do you understand how instrumentation, metrics, debugging, and incident response connect into a single lifecycle?

There is no single correct answer for most tasks. What matters is the quality of your reasoning, the clarity of your communication, and the completeness of your approach.

The best submissions will demonstrate that you can be trusted as the on-call engineer for a production LLM system.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Debugging, Incident Response, and Postmortem for LLM Systems

Objective

Problem Statement

Provided Assets

Tasks & Requirements

Task 1: Trace Inspection & Failure Diagnosis (20 Marks)

Task 2: Trace Replay Engine (20 Marks)

Task 3: Incident Classification & Severity Assignment (15 Marks)

Task 4: Circuit Breakers & Mitigation Strategies (15 Marks)

Task 5: Postmortem Writing (15 Marks)

Task 6: End-to-End Incident Drill (15 Marks)

Deliverables Summary

1. Code (Required)

2. Postmortem Document (Required)

3. Report (Required)

4. Output Samples (Required)

Submission Guidelines

Submit via your LMS

File Naming Convention

Inside the ZIP

Deadline

Late Submission Policy

Important Instructions

Evaluation Rubric

Bonus (Optional — up to +10 Marks)

Guidance & Tips

Instructor Note

Call to Action

Get Started Today

Schedule an AI & Data Science Consultation:

Request a Custom AI Demo:

Recent Posts

Comments