top of page

Instrumenting and Monitoring an LLM Application for Production

  • 2 hours ago
  • 10 min read


Course: LLM Observability — From Traces to Incident Response

Chapters Covered: 

1 – 6 (Why LLM Observability, Environment Setup, LangSmith Setup, Langfuse Setup, Instrumentation Design, Metrics & Dashboards)

Level: Medium → Advanced

Type: Individual Assignment

Duration: 7 – 10 days





Objective

By the end of this assignment you will be able to:


  1. Articulate why traditional monitoring fails for LLM applications and identify the observability gap.

  2. Design a Pydantic-based trace schema that captures every critical signal (retrieval scores, prompt version, token usage, tool I/O, latency).

  3. Implement PII redaction so no sensitive data leaves your application boundary.

  4. Wire tracing into a FastAPI RAG endpoint using LangSmith and/or Langfuse SDKs.

  5. Collect the five north-star metrics (p95 latency, failure rate, tool success rate, cost per request, quality proxy rate) from a batch of traces.

  6. Define alert thresholds and reason about SLO trade-offs for a production system.





Problem Statement

You are the first LLM Observability Engineer hired at a startup that has shipped a Retrieval-Augmented Generation (RAG) customer-support bot. The bot is live and returning HTTP 200 on every request — yet customers are complaining about wrong answers, slow replies, and occasional data leaks. Traditional dashboard shows "all green."


Your job is to retrofit a full observability stack onto the existing FastAPI application so the team can detect, measure, and triage quality problems before customers notice them.





Provided Assets

You may use any code, notebooks, and data files from the course repository:


Asset

Location

Relevance

Baseline RAG app

src/app.py

The FastAPI app you will instrument

LangSmith setup module

src/observability/langsmith_setup.py

Reference for init_langsmith()

Langfuse setup module

src/observability/langfuse_setup.py

Reference for init_langfuse()

Prometheus metrics module

src/utils/metrics.py

Reference for setup_metrics(), MetricsCollector

Course notebooks (ch01–ch06)

notebooks/

Concepts and worked examples

Observability contract example

runbooks/observability_contract.json

Naming conventions and sampling rules

Course completion checklist

eval/course_completion_checklist.md

Self-check against expected skills


You are expected to reference and build upon these assets, not to start from scratch.





Tasks & Requirements




Task 1: Diagnosing the Observability Gap (10 Marks)

Context: Chapter 1 demonstrated that two prompt versions can both return HTTP 200 while delivering vastly different answer quality.


What to do:


  1. Create two prompt versions (v1 — verbose/chatty, v2 — concise/fact-based) for a domain of your choice (e.g., product pricing, refund policy, API docs).

  2. Send the same user query to both prompt versions via the OpenAI API (or any LLM API).

  3. Capture the raw HTTP status code and the actual response text for each.

  4. Write a quality scoring function that checks the response for specific expected facts (minimum 3 facts). Score each response out of 100.

  5. Present a comparison table showing:

    • HTTP status (both should be 200)

    • Quality score (should differ meaningfully)

    • At least 3 missing signals that traditional monitoring would not capture.


Deliverable: A clearly commented code section + a short write-up (200–300 words) explaining what signals are invisible to traditional monitoring and why this matters.




Task 2: Trace Schema Design & Validation (15 Marks)

Context: Chapter 5 introduced TraceSchema as a Pydantic BaseModel with required fields, enums, and validators to guarantee every trace leaving your application carries the necessary metadata.


What to do:


  1. Define a Pydantic TraceSchema model that includes at minimum:


  • trace_id (UUID)

  • timestamp (ISO-8601)

  • environment (enum: dev, staging, production)

  • customer_tier (enum: free, pro, enterprise)

  • prompt_version (string, e.g., "v2.1")

  • model_name (string)

  • input_text (string, the user query after redaction)

  • output_text (string, the LLM response)

  • retrieved_docs (list of objects with content, relevance_score, source)

  • tool_calls (list of objects with tool_name, input, output, latency_ms, success)

  • total_tokens (int)

  • latency_ms (float)

  • error (optional string)

  • tags (list of strings)



  1. Write field validators for:


  • environment must be one of the allowed values.

  • latency_ms must be non-negative.

  • relevance_score for each retrieved doc must be between 0.0 and 1.0.



  1. Demonstrate the schema by:


  • Creating 3 valid trace objects (one healthy, one with a tool error, one with low retrieval scores).

  • Creating 2 invalid trace objects and showing the Pydantic validation errors.


Deliverable: Schema code, validation demos with printed outputs, and a brief explanation (150–200 words) of why schema enforcement matters in production.




Task 3: PII Redaction Pipeline (15 Marks)

Context: Chapter 5 showed that traces can accidentally contain API keys, emails, phone numbers, and user IDs. Redaction must happen before traces leave the application.


What to do:


  1. Implement the following redaction functions:


  • redact_api_keys(text) — detect and mask patterns for OpenAI, LangSmith, and Langfuse API keys.

  • redact_pii(text) — detect and mask email addresses, phone numbers, and SSN-like patterns.

  • hash_user_id(user_id) — produce a deterministic SHA-256 hash so the ID can still be correlated across traces without exposing the raw value.



  1. Create a wrapper function strip_pii(payload: dict) -> dict that applies all redaction steps to every string field in a trace payload (including nested fields in retrieved_docs and tool_calls).



  1. Demonstrate with a before/after table on a realistic trace payload that contains:


  • At least 1 API key embedded in a retrieval document.

  • At least 1 email address in the user query.

  • At least 1 phone number in an LLM response.

  • A raw user ID.



  1. Write 3 unit tests (using pytest or plain assert statements) that verify:


  • API keys are fully masked.

  • Emails are redacted.

  • The same raw user ID always produces the same hash.



Deliverable: Redaction code, before/after demo, unit tests with passing output.





Task 4: Tracing with LangSmith and/or Langfuse (20 Marks)

Context: Chapters 3 & 4 introduced the two tools' data models and SDKs. Chapter 5 wired them into the instrumented endpoint.


What to do:


Choose one of the following options:


Option A — LangSmith Tracing


  1. Initialize the LangSmith client using environment variables (LANGCHAIN_TRACING_V2, LANGCHAIN_API_KEY, LANGCHAIN_PROJECT).

  2. Create a trace hierarchy for a single RAG request:

    • Root run (the incoming /ask request)

    • Child run 1 — Retrieval step (capture query, retrieved docs, relevance scores)

    • Child run 2 — LLM call (capture prompt text, model name, token usage, response)

    • Child run 3 (optional) — Tool call (capture tool name, input, output, latency)

  3. Attach metadata to every run: environment, prompt_version, customer_tier, release.

  4. Show a screenshot or printed trace tree confirming the hierarchy.



Option B — Langfuse Tracing


  1. Initialize the Langfuse client using environment variables (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST).

  2. Create a trace with observations for a single RAG request:

    • Trace (one per request)

    • SPAN — Retrieval step

    • GENERATION — LLM call (with model, usage object for tokens)

    • EVENT (optional) — any notable marker (e.g., "cache_miss")

  3. Attach metadata: environment, prompt_version, customer_tier.

  4. Show a screenshot or printed trace structure confirming the observations.



Option C — Both (Recommended for advanced learners)

Implement both Option A and Option B, then write a 300-word comparison of the experience:


  • What was easier to set up?

  • How do the data models differ in practice?

  • Which gives you better visibility into token costs?


Note: If you do not have API keys for either platform, you may use the course setup helpers in src/observability/ and demonstrate with mock/simulated traces that follow the correct SDK structure. Clearly state this in your report.


Deliverable: Tracing code, trace output (screenshot or printed), and comparison write-up (if Option C).




Task 5: Metrics Collection & Dashboard Design (25 Marks)

Context: Chapter 6 defined the five north-star metrics and implemented a MetricsCollector that processes a batch of traces.


What to do:


  1. Generate a synthetic trace batch of at least 200 traces with realistic distributions:


  • Latency: mostly 800–2000 ms with occasional outliers up to 10,000 ms.

  • Failure rate: ~3–5% of traces should have an error.

  • Tool success rate: ~90–95% tool calls succeed.

  • Token usage: vary between 200–2000 total tokens per request.

  • Quality proxy: ~85–90% of traces pass a quality check.



  1. Implement a MetricsCollector class (or extend the one from the course) with a compute(traces) method that returns:


  • p95_latency_ms

  • p50_latency_ms

  • failure_rate (%)

  • tool_success_rate (%)

  • avg_cost_per_request ($) — use GPT-4o-mini pricing: $0.15/1M input tokens, $0.60/1M output tokens (or state your pricing assumption)

  • quality_proxy_rate (%)



  1. Define alert thresholds for each metric in a table:


Metric

Target (Green)

Warning (Yellow)

Critical (Red)

p95 latency

< 3,000 ms

> 5,000 ms

> 10,000 ms

Failure rate

< 2%

> 5%

> 15%

...

...

...

...



Complete the table for all five metrics. Justify each threshold in 1–2 sentences.


  1. Visualise the metrics in at least 2 charts (e.g., latency distribution histogram, failure rate bar chart). Use matplotlib, plotly, or any plotting library.



  1. (Optional but recommended) Register Prometheus metrics using the patterns from src/utils/metrics.py — Histogram for latency, Counter for failures, Gauge for quality proxy.



Deliverable: Synthetic trace generator code, MetricsCollector code, threshold table with justification, at least 2 charts, and a short analysis (200–300 words) discussing what the dashboard reveals about system health.





Task 6: Observability Contract & Sampling Strategy (15 Marks)

Context: Chapter 5 introduced the observability contract (a JSON document defining naming conventions, required fields, and sampling rules) and discussed sampling strategies for different environments.


What to do:


  1. Write an observability contract (observability_contract.json) for your application that defines:


  • Naming conventions: trace naming pattern (e.g., {service}.{endpoint}.{version}), tag taxonomy (e.g., env:production, tier:enterprise), required metadata fields.

  • Sampling rules:

    • Development: 100% of traces.

    • Staging: 100% of traces.

    • Production: 10% random sample + 100% of error traces + 100% of traces where latency_ms > threshold.

  • Retention policy: how long traces are kept per environment.

  • Required fields: list of fields that must be present on every trace (reference your TraceSchema from Task 2).



  1. Implement a should_sample(trace) function that takes a trace object and returns True/False based on the sampling rules in your contract.



  1. Demonstrate the sampling function on a batch of 50 synthetic traces across all three environments. Show the sampling rate achieved for each environment and verify it matches the contract.



  1. Write a short paragraph (150–200 words) explaining why 100% sampling in production is impractical and how your sampling strategy ensures you still capture all critical events.



Deliverable: JSON contract file, sampling function code, demo output, and write-up.





Deliverables Summary

You must submit:




1. Code (Required)


  • Jupyter Notebook (.ipynb) with clearly labelled sections for each task (Task 1 through Task 6).

  • Well-commented code with meaningful variable names.

  • Each task should be runnable independently.




2. Report (Required)

A short report (4–6 pages) including:


  • Approach for each task (what you chose and why).

  • Observations (what worked, what surprised you).

  • Architecture diagram of your instrumentation pipeline (can be a simple box-and-arrow diagram — hand-drawn is acceptable).

  • Key learnings about LLM observability vs traditional monitoring.


Format: PDF or DOCX




3. Output Samples (Required)

Include:


  • Sample trace objects (valid and invalid) from Task 2.

  • Before/after redaction table from Task 3.

  • Trace hierarchy printout or screenshot from Task 4.

  • Dashboard charts from Task 5.

  • Sampling demo output from Task 6.




4. Observability Contract File (Required)

  • observability_contract.json — the JSON document from Task 6.





Submission Guidelines


Submit via your LMS

 (e.g., Moodle / Google Classroom / institutional portal).




File Naming Convention

<YourName>_LLM_Observability_Assignment1.zip




Inside the ZIP


  • /notebook.ipynb

  • /report.pdf

  • /observability_contract.json

  • /screenshots/           (optional — trace screenshots)

  • /data/                  (optional — sample data files)




Deadline

Submit within 10 days from assignment release date.




Late Submission Policy


Delay

Penalty

Up to 24 hours late

10% deduction

24 – 48 hours late

20% deduction

Beyond 48 hours

Not accepted





Important Instructions


  1. Do NOT copy-paste code from the course notebooks without understanding it. You must adapt and extend the examples to your own use case.

  2. You must explain your design choices clearly in comments and in the report.

  3. Use of third-party libraries is permitted (e.g., pydantic, tiktoken, openai, langsmith, langfuse, prometheus_client, matplotlib), but core logic must be implemented by you.

  4. If you do not have API keys for LangSmith/Langfuse/OpenAI, you may work with mock/simulated data — clearly state this in your report. You will not be penalised for using mocks as long as the code structure and logic are correct.

  5. Plagiarism will result in disqualification and referral to the academic integrity committee.





Evaluation Rubric


Criteria

Marks

What the Evaluator Looks For

Task 1: Observability Gap Diagnosis

10

Correct identification of missing signals; meaningful quality scoring function; clear write-up.

Task 2: Trace Schema Design

15

Complete schema with validators; valid/invalid demos; explanation of why schema enforcement matters.

Task 3: PII Redaction Pipeline

15

All redaction functions implemented; nested field handling; unit tests passing; before/after demo.

Task 4: Tracing (LangSmith / Langfuse)

20

Correct SDK usage; proper trace hierarchy; metadata attached; screenshot or printed output. Option C earns up to 5 bonus marks.

Task 5: Metrics & Dashboard

25

Realistic synthetic data; all 5 metrics computed correctly; justified thresholds; at least 2 charts; analysis.

Task 6: Contract & Sampling

15

Well-structured JSON; correct sampling implementation; environment-specific rates verified; write-up.

Total

100





Bonus (Optional — up to +10 Marks)


  • +4 marks: Implement tracing with both LangSmith and Langfuse (Task 4, Option C) and write the comparison.

  • +3 marks: Register all five metrics as Prometheus metrics with appropriate types (Histogram, Counter, Gauge) and demonstrate scraping.

  • +3 marks: Create a Grafana-style dashboard mockup (screenshot or config file) showing how you would lay out the five metrics in a single view.





Guidance & Tips


  • Start with Tasks 1–3 — they require no external API keys and build the foundation.

  • Task 4 may require API key setup — plan ahead and test connectivity early.

  • Task 5 is the largest task (25 marks) — allocate time for the synthetic data generator, metrics computation, and visualisation.

  • Reuse your TraceSchema from Task 2 as the data structure for Tasks 4, 5, and 6. Consistency across tasks demonstrates production thinking.

  • Think about why each metric matters, not just how to compute it.

  • Refer to eval/course_completion_checklist.md as a self-check before submission.





Instructor Note

This assignment is designed to simulate the real-world workflow of an observability engineer joining a team that has shipped an LLM application without adequate monitoring.


There is no single correct implementation. What matters is:


  • Clarity of reasoning — can you explain why you made each design choice?

  • Quality of implementation — does your code work, and is it structured for a production codebase?

  • Depth of analysis — do you understand the trade-offs, not just the mechanics?


The best submissions will read like a technical design document that a team lead would approve for production deployment.





Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?


Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.


Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.




Get Started Today



Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.




Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.









Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.


Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.


Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments


bottom of page