How to Build a Vernacular Contact-Center QA Platform with Saaras v3 and Sarvam-105B
- May 13
- 12 min read

1. The Problem: 98% of Indian Customer Calls Are Never Reviewed
Picture a collections team at a mid-size Indian bank. Every day, hundreds of agents call borrowers in Hinglish, Bhojpuri-Hindi, Tanglish, and Marathinglish. Every one of those calls is recorded — and stored. Yet the QA team manually reviews fewer than two percent of them. The rest disappear into a storage bucket, reviewed only when a customer complaint or RBI audit forces someone to dig in.
This is the normal state of contact-center quality assurance across Indian BFSI, telecom, and BPO operations. The gap isn't laziness — it's a tooling failure. Whisper and Deepgram produce garbled transcripts on code-mixed phone audio. Gong, CallMiner, and Verint are built entirely for English. Internal GPT-4 pipelines raise data-residency flags. The result is regulatory exposure, undetected mis-selling, agent quality drift, and zero visibility into the 98% of calls that are never reviewed.
This post explains how to build a Vernacular Contact-Center QA & Compliance Analytics Platform using Saaras v3 for code-mixed transcription and Sarvam-105B for scoring and compliance checking — replacing the 1–2% sample-audit model with 100% coverage.
Real-world use cases this platform addresses:
Banks scoring collections calls for RBI fair-practice and harassment violations
Life and health insurers detecting IRDAI mis-selling and missing free-look disclosures
Telcos (Jio, Airtel, Vi) auditing customer-care quality and CSAT drivers
BPOs reporting BFSI client-side compliance KPIs
Neo-banks meeting RBI digital-lending recovery rules
Brokerages and AMCs scoring advisor calls for SEBI suitability compliance
What this post covers: the core architecture, technology stack, data model, implementation phases, and the non-obvious engineering challenges you will hit in production. What it does not cover: full source code — that is in the complete course on labs.codersarts.com.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
2. How It Works: The Core Technical Concept
Why the naive approach fails
The obvious path — pipe call audio into Whisper, send the transcript to GPT-4, ask "did the agent comply?" — breaks down in three ways on Indian contact-center audio:
Transcription accuracy collapses. Whisper struggles with 8 kHz telephony audio, regional accents, and code-switching. A Hinglish phrase like "aapka EMI bounce hua hai, toh please NACH authorise karo" comes out mangled or partially in transliterated Roman script, making downstream analysis unreliable.
Context window limits kill long calls. A 40-minute insurance sales call is 6,000–8,000 words. Naive single-pass scoring hits context limits and misses early or late compliance events.
Data residency is a hard constraint. Indian banks and insurers cannot send customer call recordings to US-hosted models. Sarvam's models are hosted in India (sovereign AI stack), eliminating this blocker.
The architecture that solves it
Think of this platform as a factory line for call intelligence: raw audio goes in one end; a scored, summarised, indexed QA report comes out the other.
┌─────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ S3 Bucket / Genesys Export / NICE Export / Avaya SFTP │
└───────────────────────┬─────────────────────────────────────────┘
│ audio files (WAV/MP3/OPUS, mono or stereo)
▼
┌─────────────────────────────────────────────────────────────────┐
│ TRANSCRIPTION LAYER │
│ Saaras v3 (codemix mode) → Speaker Diarisation │
│ (Hinglish / Tanglish / Marathinglish + finance jargon prompts) │
└───────────────────────┬─────────────────────────────────────────┘
│ diarised transcript JSON
▼
┌─────────────────────────────────────────────────────────────────┐
│ SCORING LAYER │
│ Sarvam-105B (chunked scoring, structured JSON output) │
│ Domain rubric: RBI / IRDAI / SEBI / CSAT / Empathy │
└───────────┬───────────────────────────┬─────────────────────────┘
│ scores + evidence quotes │ compliance flags
▼ ▼
┌─────────────────────┐ ┌─────────────────────────────────────┐
│ SUMMARY LAYER │ │ ALERTING LAYER │
│ Mayura → English │ │ Near-real-time alert (collections) │
│ manager summary │ │ or overnight batch (insurance) │
└──────────┬──────────┘ └─────────────┬───────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ DATA + ANALYTICS LAYER │
│ PostgreSQL (transcripts, scores) + S3 (audio, clips) │
│ Apache Superset / Metabase → React QA Cockpit │
└─────────────────────────────────────────────────────────────────┘The analogy: think of Saaras v3 as a bilingual court reporter who can capture every code-switched word accurately. Sarvam-105B is the senior compliance officer who reads the transcript against a regulatory checklist and highlights every violation. Mayura is the executive assistant who writes the one-paragraph summary for the branch manager. The React cockpit is the glass-panel dashboard the QA team monitors all day.
3. System Architecture Deep Dive
Layer Overview
The platform is organised into five logical layers: Frontend, Backend orchestration, AI/ML services, Data persistence, and external integrations.
Component | Role | Technology Options |
Call Ingestion | Pull audio from recorder exports or S3 | AWS S3, Genesys Cloud API, NICE CXone SFTP, Avaya SFTP |
Transcription | Code-mixed ASR with speaker diarisation | Saaras v3 (codemix), Whisper (fallback, lower accuracy) |
Speaker Diarisation | Separate agent vs customer turns | Saaras built-in, pyannote.audio (fallback) |
Scoring Engine | Rubric-based QA scoring + compliance checks | Sarvam-105B (primary), Sarvam-30B (bulk summaries) |
Summary Generation | Vernacular → English manager summaries | Mayura / Sarvam-Translate |
Coaching Clip TTS | Re-synthesize coaching clips for training | Bulbul v3 |
Backend Orchestration | Async batch job management | FastAPI + Celery + Redis |
Data Store | Transcripts, scores, agent profiles, consent | PostgreSQL + S3 |
BI Dashboard | Aggregated KPIs and drill-downs | Apache Superset or Metabase |
QA Cockpit (Frontend) | Analyst-facing synced replay + scoring UI | React 18 + Tailwind + Vite |
DPDP Compliance | Retention, purge, right-to-be-forgotten | Custom purge service (PostgreSQL + S3) |
Data Flow: Step by Step
Audio lands in S3 (direct upload, Genesys export, or NICE SFTP sync). The S3 event triggers a Celery task.
Celery worker pulls metadata — call ID, agent ID, queue, language hint, customer consent flag — and creates a call_record row in PostgreSQL with status queued.
Saaras v3 is invoked in codemix mode with a domain prompt (e.g., "BFSI collections, Hindi-English codemix, finance terminology"). The response is a timestamped JSON transcript with word-level confidence scores.
Speaker diarisation assigns each turn to AGENT or CUSTOMER. If stereo audio is available, left-channel = agent is assumed. Otherwise, pyannote.audio or Saaras built-in diarisation is used.
The transcript is chunked into overlapping 3,000-token segments (with 200-token overlap to preserve context across chunk boundaries).
Sarvam-105B scores each chunk against a domain rubric passed as a structured system prompt. The model returns a JSON object with: dimension_scores (empathy, script adherence, compliance, product knowledge), compliance_flags (mandatory phrases detected / missing), evidence_quotes (verbatim, with timestamp anchors), and coachable_moments.
Chunk scores are rolled up — compliance flags are union-merged; dimension scores are averaged with recency weighting; evidence quotes are de-duplicated by semantic similarity.
Mayura generates a 3-sentence English summary from the rollup, suitable for a manager dashboard card.
PostgreSQL is updated with the full score record; S3 stores coaching clips.
Superset/Metabase polls the materialised view for dashboard refresh. The React cockpit loads transcript + scores on demand for replay.
Two Non-Obvious Design Decisions
Decision 1: Overlapping chunk boundaries, not hard splits. Splitting a 45-minute call at hard token boundaries is dangerous — a mandatory RBI disclosure that starts near the end of chunk 3 and finishes at the start of chunk 4 will be detected as incomplete in both chunks. The 200-token overlap ensures every sentence is fully contained in at least one chunk.
Decision 2: Sarvam-30B for summaries, Sarvam-105B only for compliance scoring. Running the 105B model on every chunk of every call is expensive at scale. Bulk manager summaries are semantically simpler than compliance detection; routing them to the 30B model reduces cost-per-hour-of-audio by approximately 60% while keeping compliance-critical scoring on the stronger model.
4. Tech Stack Recommendation
Stack A — Beginner / Prototype (Build in a Weekend)
Layer | Technology | Why |
Ingestion | Local file system + Python watchdog | Zero setup, great for local dev |
Transcription | Saaras v3 API (codemix) | Same API as production |
Scoring | Sarvam-105B API | Same API as production |
Summary | Mayura API | Single REST call |
Backend | FastAPI (single process, no Celery) | Simple, fewer moving parts |
Database | SQLite | No install required |
Frontend | Streamlit | 50-line dashboard |
Estimated monthly cost (100 calls/day at 10 min avg): ~$40–80 in API costs, $0 infrastructure (runs on laptop or free-tier VPS).
Stack B — Production-Ready (Designed to Scale)
Layer | Technology | Why |
Ingestion | AWS S3 + SQS event triggers | Reliable at millions of files |
Transcription | Saaras v3 API (codemix) | Sovereign, codemix-native |
Diarisation | Saaras built-in + pyannote fallback | Handles mono and stereo |
Scoring | Sarvam-105B (compliance) + Sarvam-30B (summary) | Cost-tiered routing |
Summary | Mayura API | Fast, Indian-language-native |
Backend | FastAPI + Celery + Redis | Horizontal scaling, retries |
Database | PostgreSQL (RDS) + S3 (audio + clips) | ACID compliance + cheap storage |
BI Dashboard | Apache Superset | Self-hosted, no per-seat cost |
Frontend | React 18 + Tailwind + Vite | Modern, fast, component-ready |
Deployment | Docker Compose → ECS Fargate | Containerised, auto-scales |
Estimated monthly cost (10,000 calls/day at 10 min avg): ~$1,200–2,000 (API costs dominant). Infrastructure adds ~$300–500 for RDS + ECS + Redis.
5. Implementation Phases
Phase 1: Data Ingestion and Audio Normalisation
The first thing to build is a reliable pipeline that gets audio files from wherever they live — S3, Genesys, NICE, Avaya — into a normalised working format, with metadata correctly parsed. This means writing an adapter layer per recorder type, handling authentication, pagination, and format conversion (MP3/OPUS/G.711 → 16 kHz WAV for Saaras).
Key technical decisions:
Pull vs event-driven ingestion: Overnight batch (cron pull) is simpler but adds latency; S3 event triggers + SQS enable near-real-time processing for collections alerting use cases.
Audio format normalisation: Saaras v3 prefers 16 kHz mono WAV. You will need ffmpeg for conversion; decide whether to do it at ingestion or as a pre-processing step inside the worker.
Call metadata schema: Define what mandatory fields must be present (call_id, agent_id, queue, language_hint, customer_consent) and what happens when they are missing — reject, infer, or default.
Setting up the S3 adapter, Celery task pipeline, and ffmpeg normalisation worker is covered in detail in the full course with working, tested code.
Phase 2: Transcription and Speaker Diarisation
This is the hardest phase technically. Saaras v3 in codemix mode delivers strong baseline accuracy on Hinglish and Tanglish, but phone-quality 8 kHz audio with overlapping speech, heavy regional accents, and dense BFSI jargon requires additional work.
Key technical decisions:
Domain prompting strategy: Saaras v3 accepts a domain hint that improves jargon accuracy. Experiment with vocabulary hints (EMI, NACH, LRN, free-look, NACH mandate, KYC) vs generic BFSI domain labels.
Diarisation method: Stereo recordings (agent on left channel, customer on right) allow near-perfect speaker separation. For mono recordings, use the built-in diarisation; on difficult calls (rapid turn-taking, interruptions), fall back to pyannote.audio.
Post-processing: Regex normalisation of number words (ek lakh → 100000), currency (paanch hazaar rupees → ₹5,000), and product codes improves downstream scoring accuracy.
The domain prompting strategy, diarisation fallback logic, and jargon post-processing rules for BFSI, insurance, and telco are all covered in detail in the full course with working, tested code.
Phase 3: Compliance Scoring with Sarvam-105B
This is where the platform earns its value. You will build a scoring service that takes a diarised, chunked transcript and sends each chunk to Sarvam-105B with a domain-specific rubric. The rubric encodes what a compliant call looks like for your vertical: RBI fair-practice for collections, IRDAI mis-selling for insurance sales, SEBI suitability for advisory.
Key technical decisions:
Structured output enforcement: Sarvam-105B must return deterministic JSON for downstream BI. You will need a JSON schema, a structured output prompt pattern, and a validation layer that retries on schema violations.
Fuzzy matching for verbatim compliance: Mandatory phrases (e.g., "This call may be recorded for quality purposes") are rarely spoken word-for-word. Implement phonetic fuzzy matching and semantic similarity checks alongside exact match.
Rollup logic: When aggregating scores across chunks, decide how to handle conflicting compliance signals — a compliance flag raised in chunk 2 should persist in the final record even if subsequent chunks are clean.
The full Sarvam-105B rubric design, structured output schema, rollup algorithm, and domain-specific compliance checklists for RBI/IRDAI/SEBI are covered in detail in the full course with working, tested code.
Phase 4: Manager Summaries, React Cockpit, and Coaching Clips
With scores in the database, you now build the interfaces that QA teams and managers actually use. The React cockpit must support: synced audio + transcript replay (click a word, audio jumps to that timestamp), a score breakdown panel, and one-click coaching clip extraction.
Key technical decisions:
Waveform sync: Use WaveSurfer.js or a custom HTML5 <audio> + timestamp index. The transcript JSON must carry word-level timestamps from Saaras v3 for accurate sync.
Coaching clip extraction: Define heuristics for "best" and "worst" moments — highest empathy score in 30-second window, lowest compliance score, escalation keyword detected. Clips are stored in S3 and delivered as pre-signed URLs.
Mayura summary placement: The 3-sentence manager summary appears as a dashboard card above the full transcript. Decide language policy — always English, or match the manager's language preference.
Building the WaveSurfer-synced transcript player, the coaching clip extraction heuristic, and the Mayura summary integration is covered in detail in the full course with working, tested code.
Phase 5: DPDP Compliance, Deployment, and Integration
The final phase makes the platform production-safe. India's Digital Personal Data Protection (DPDP) Act requires that customer data be deleted on request — and contact-center transcripts containing PII are firmly in scope.
Key technical decisions:
Right-to-be-forgotten scope: A purge request must delete the audio file (S3), the transcript (PostgreSQL), the score record (PostgreSQL), any coaching clips (S3), and invalidate Superset materialised view caches — all transactionally, with an audit trail.
Retention policy engine: Define per-queue retention periods (collections: 90 days per RBI, insurance: 5 years per IRDAI) and implement a nightly purge job.
Docker Compose vs ECS: The provided Docker Compose setup runs all services locally for development; the ECS migration guide shows how to split services into separate task definitions with auto-scaling.
The DPDP purge service, retention policy engine, Docker Compose setup, and ECS deployment guide are covered in detail in the full course with working, tested code.
6. Common Challenges You Will Hit
Developers who have built similar pipelines — or attempted to use Whisper + GPT-4 for this problem — run into the same set of walls. Here are the seven most painful ones, with root causes and fixes.
1. Saaras v3 accuracy degrades on 8 kHz phone audio with heavy regional accent Root cause: Telephone codecs compress audio to 8 kHz (G.711/AMR), removing overtones. Regional accents (Bhojpuri Hindi, Chettinad Tamil, Nagpuri Marathi) differ significantly from training distribution. Fix: Apply a high-pass filter + loudness normalisation before sending to Saaras. Use domain vocabulary hints. For heavily accented queues, fine-tune with 50–100 representative calls.
2. Speaker diarisation fails on rapid turn-taking Root cause: BFSI collections calls involve frequent interruptions and overlapping speech that confuse VAD-based diarisation. Fix: On stereo recordings, use channel separation (no diarisation model needed). On mono, increase the minimum segment duration threshold to reduce over-segmentation, and apply a post-processing pass that merges segments from the same speaker within 500 ms.
3. Sarvam-105B returns malformed JSON on long chunks Root cause: At the edge of the context window, the model occasionally truncates the JSON response. Fix: Cap chunks at 2,500 tokens, not 3,000. Implement a retry-with-reduced-chunk-size fallback. Use a JSON repair library (json-repair) before validation.
4. Compliance verbatim matching misses spoken variations Root cause: Agents say "recording ke liye call ho rahi hai" instead of the exact English phrase "This call may be recorded." Rules-based exact match fails. Fix: Implement three-tier matching: exact string match (0 ms), fuzzy token match (levenshtein < 0.3), and semantic similarity via Sarvam-105B embedding (cosine > 0.85). Flag as compliant if any tier matches.
5. Aggregated dashboard metrics are stale after purge requests Root cause: Superset materialises views for performance. After a DPDP purge, old aggregate values persist in cache. Fix: After each purge, invalidate the relevant Superset datasets via the Superset API. Recompute the affected agent's rolling 30-day metrics from remaining records.
6. Cost blowout from routing all scoring to Sarvam-105B Root cause: A 40-minute call produces ~12 chunks. At $X per 1M tokens, running 105B on all of them for 10,000 calls/day adds up fast. Fix: Route manager summaries and CSAT-only scoring to Sarvam-30B. Reserve Sarvam-105B for compliance-flagged calls or compliance-critical scoring dimensions. This alone reduces per-call AI cost by ~60%.
7. DPDP purge leaves orphaned coaching clips in S3 Root cause: Coaching clips are stored with composite keys that reference the call_id but are not linked by a foreign key in PostgreSQL. A DELETE on call_records does not cascade to clip objects. Fix: Maintain a clips table with a call_id FK and ON DELETE CASCADE. The purge service must explicitly list and delete all S3 objects prefixed with the call_id before deleting the PostgreSQL row.
Solving these issues took us over 80 hours of testing across real Indian call-center audio — the course walks you through each fix with working code.
7. Ready to Build This Yourself?
Understanding the architecture is the easy part. The gap between a clear data-flow diagram and a production-ready pipeline that handles 8 kHz Hinglish audio, DPDP purge requests, and nightly compliance reports is where most projects stall.
The Vernacular Contact-Center QA & Compliance Analytics Platform course on Codersarts Labs gives you everything you need to ship:
✅ Full source code — FastAPI backend, React cockpit, Celery workers, purge service
✅ Video tutorials — step-by-step build walkthrough for all 5 phases
✅ Sample call audio — real Hinglish and Tanglish recordings for local testing
✅ Pre-built scoring rubrics — BFSI collections (RBI), insurance sales (IRDAI), telco CSAT, BPO reporting
✅ RBI + IRDAI + SEBI compliance checklist — ready to customise for your vertical
✅ Docker Compose setup — one command to run the full stack locally
✅ Deployment walkthrough — ECS Fargate + RDS + S3 production setup
✅ DPDP retention and purge service — compliant right-to-be-forgotten implementation
✅ Lifetime access — all future updates included
✅ Community support — ask questions, share rubrics, get feedback
$29. Everything above.
Need a custom rubric for your specific recorder (Genesys / NICE / Avaya / S3) or help integrating with your existing call-recording infrastructure? Book a 1:1 guided session — $99 and we will design your compliance rubric and wire up the integration together.
8. Conclusion
The core architecture of this platform is straightforward once you see it: Saaras v3 transcribes code-mixed calls with speaker diarisation, Sarvam-105B scores each call against a domain-specific compliance rubric and surfaces evidence-anchored coachable moments, and a React cockpit gives QA teams and managers full visibility into 100% of calls — not the 1–2% sampled today.
The simplest viable path to a working prototype is Stack A: Saaras v3 + Sarvam-105B + FastAPI + SQLite + Streamlit. You can have this running on local audio within a weekend. From there, the course guides you through every production hardening step — chunking strategy, DPDP purge, Docker deployment, and BI dashboard integration.
If you are building for a bank, insurer, telco, or BPO that processes vernacular Indian customer calls, this is the platform architecture that solves the problem English-only tools cannot.



Comments