How to Build a Vernacular Citizen-Service Platform with Sarvam AI and the WhatsApp Business API

May 13
13 min read

Section 1 — Introduction: The Pothole That Never Gets Fixed

Picture this: a resident of ward 47 in Bengaluru photographs a gaping pothole, sends it on WhatsApp in Kannada, and waits. The message arrives at a call-centre agent who types it into a legacy portal in English, assigns it to the wrong ward because the geotag was off, and closes it as a duplicate of a three-month-old ticket. No acknowledgement reaches the resident. The SLA clock was never started. The pothole is still there six weeks later.

This is not an edge case — it is the daily reality for municipal corporations, Discoms, and water utilities across India. Urban local bodies receive lakhs of complaints every month across 10+ Indian languages and dialects, and almost every 311-style portal in use today is English-first, photo-blind, and manually routed.

The Vernacular Citizen-Service Platform solves this end-to-end: it accepts complaints in any Indian language via WhatsApp, IVR, or a web PWA; uses Sarvam AI to classify photos, transcribe dialect voice notes, and deduplicate across open tickets; and routes each complaint to the right ward officer with a vernacular acknowledgement drafted automatically.

Real-world deployment targets include:

Municipal corporations (BMC, BBMP, MCD, GHMC) handling water, road, garbage, and streetlight complaints
Electricity Discoms (BESCOM, Tata Power-DDL, MSEDCL) managing outage and billing complaints
Water utilities taking citizen reports of pipe bursts and sewage overflows
State pollution-control boards receiving noise and air-quality violations
Smart Cities Mission Integrated Command and Control Centres (ICCC)
District-level grievance cells and CPGRAMS-style state portals

This post walks through the architecture, tech stack, and implementation phases of building such a platform. It does not include complete source code — that is inside the full course at labs.codersarts.com.

📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

Section 2 — How It Works: The Core Concept

At its heart, this platform is a multi-channel AI intake pipeline wired to a state-machine complaint engine. Before exploring the architecture, it helps to understand why the obvious approach fails — and what this design does differently.

Why naive approaches break down

BPO call centres can handle multiple languages but cost ₹80–150 per ticket, cannot scale to lakhs of complaints, and produce no structured data. Existing ULB portals are English-only, ignore photo evidence, and rely on manual triage. Generic LLM chatbots have no ward-mapping awareness, no SLA enforcement, and no dialect handling — and hallucinate routing decisions.

The real complexity is not language alone. It is the intersection of:

Dialect variation within a single official language (Bhojpuri, Marwari, and Awadhi are all "Hindi" to a language detector, but require different transcription models)
Low-quality photo evidence (phone-camera shots in poor lighting, with the pothole half out of frame)
Geofence ambiguity (citizen geotags are often 200 m off, spanning two wards)
Deduplication at scale (the same pothole generates 50 tickets in 24 hours)

How this architecture solves it

The solution chains four Sarvam AI models — Vision, Saaras v3, Sarvam-105B, and Bulbul v3 — over a FastAPI state machine backed by PostgreSQL + PostGIS.

  CITIZEN CHANNELS
  ─────────────────────────────────────────────────────────────
  WhatsApp Business API ──┐
  IVR (Twilio / Exotel)  ──┤──▶  Intake Service (FastAPI)
  Web PWA                ──┘         │
                                     │  photo?  ──▶  Sarvam Vision (classify + geotag)
                                     │  voice?  ──▶  Saaras v3 (dialect transcription)
                                     │  text    ──▶  pass through
                                     ▼
                              Sarvam-105B (Indus)
                              ├─ Deduplicate vs. PostGIS ward open tickets
                              ├─ Map to department + ward officer
                              └─ Draft vernacular acknowledgement
                                     │
                              ┌──────┴──────┐
                              ▼             ▼
                         State Machine   Bulbul v3
                         (SLA timers,    (IVR TTS
                          escalation)    responses)
                              │
                    ┌─────────┴──────────┐
                    ▼                    ▼
             Ward Officer          Supervisor
              Cockpit              Dashboard
             (React 18)            (React 18)

Think of it as a smart post office: every incoming complaint is automatically sorted by category and neighbourhood (Sarvam Vision + PostGIS), translated for the back office (Mayura), and handed to the correct officer — while the citizen immediately receives a response in their own language (Sarvam-105B + Bulbul).

Section 3 — System Architecture Deep Dive

Architecture layers

The platform is built in five layers, each with a distinct responsibility.

1. Citizen Channel Layer — WhatsApp (primary), IVR, and a Progressive Web App. All three converge on a single intake API that normalises payloads regardless of origin.

2. AI Processing Layer — Four Sarvam models working in sequence: Vision for photo classification, Saaras v3 for voice transcription, Sarvam-105B (Indus) for routing and deduplication, and Bulbul v3 for TTS responses.

3. Orchestration Layer — FastAPI handles the state machine, Celery + Redis manage async jobs (transcription, photo classification), and PostgreSQL + PostGIS stores complaints with ward polygons and SLA timestamps.

4. Dashboard Layer — React 18 + Tailwind serves two separate frontends: the ward-officer cockpit (geo-pinned complaint list, SLA countdown, status updates) and the supervisor dashboard (ward-level SLA compliance, escalation queue, trend charts).

5. Integration Layer — Mayura/Sarvam-Translate converts vernacular summaries to English for senior officers; Twilio or Exotel handles IVR DTMF and telephony.

Component reference table

Component	Role	Technology Options
Citizen intake channel	Receive text, photo, voice from citizens	WhatsApp Business API, IVR (Twilio/Exotel), React PWA
Photo classifier	Identify complaint category from image	Sarvam Vision (primary), fallback to manual triage queue
Voice transcription	Convert dialect voice notes to text	Saaras v3 (Bhojpuri, Marwari, Kumaoni, Awadhi, etc.)
Routing & deduplication	Map ticket to ward + officer; merge duplicates	Sarvam-105B (Indus) with PostGIS proximity queries
Reply generation	Draft citizen acknowledgement in source language	Sarvam-105B with template guardrails
IVR TTS	Speak status updates to phone-in citizens	Bulbul v3 with streaming output
Internal translation	English summaries for senior officers	Mayura / Sarvam-Translate
State machine & SLA	Track complaint lifecycle, trigger escalations	FastAPI + SQLAlchemy enums + Celery beat
Geo data store	Store ward polygons, complaint coordinates, spatial queries	PostgreSQL 15 + PostGIS 3
Async task queue	Non-blocking AI model calls, webhook deliveries	Redis 7 + Celery 5

Data flow walkthrough

Citizen sends a WhatsApp message (text, photo, voice, or combination) to the registered business number.
WhatsApp webhook fires a POST to the intake FastAPI endpoint; the router detects channel and payload type.
Photo path — image is forwarded to Sarvam Vision with a domain-specific prompt ("Classify into: pothole / water_leak / transformer / garbage / streetlight / other"). Vision returns a category and a confidence score; below 0.65 the ticket enters a manual triage queue.
Voice path — audio is forwarded to Saaras v3 with the dialect=auto flag; the transcription is returned with a detected-language tag and attached to the complaint record.
Text path — raw text is passed directly; language detection happens inside Sarvam-105B.
Sarvam-105B receives the normalised complaint (category, transcription or text, detected language, GPS coordinates if present) and performs three tasks: (a) PostGIS query to find open tickets within 50 m in the same ward; (b) if match found, increment reporter_count and close the new ticket as a duplicate; (c) if new, look up the responsible department and ward officer from the routing table and draft an acknowledgement in the citizen's language.
Complaint record is written to PostgreSQL with status NEW, ward_id, officer_id, sla_deadline, and language_code.
Celery beat checks SLA deadlines every 15 minutes; missed SLAs trigger escalation to the zone supervisor and a status update to the citizen via Bulbul TTS (IVR) or WhatsApp template message.
Ward officer sees the complaint on the cockpit map, marks it IN_PROGRESS, adds a resolution note, and closes it.
Citizen receives a closing acknowledgement in their language; the record is archived with audit timestamps.

Two non-obvious design decisions

Decision 1: Confidence-threshold routing for Sarvam Vision. A hard classification is tempting (always trust the model), but phone-camera quality on civic complaints is notoriously poor. Setting a confidence floor of 0.65 and routing low-confidence photos to a manual triage queue prevents mis-classification from skewing SLA reporting — and gives you a labelled dataset to fine-tune the model over time.

Decision 2: Deduplication radius as a configuration parameter. The right deduplication radius varies by issue type: pothole duplicates cluster within 30–50 m, power outages can cover an entire substation zone (500 m+). Making this a per-category config value in the routing table — rather than a hardcoded query — lets each municipal client tune it to their ward geometry without a code change.

Section 4 — Tech Stack Recommendation

Stack A — Beginner / Prototype (build in a weekend)

Layer	Technology	Why
Backend	FastAPI + SQLite	Zero-config, hot-reload, async-ready
AI — photo	Sarvam Vision API	No GPU needed; REST call
AI — voice	Saaras v3 API	Dialect support out of box
AI — routing	Sarvam-105B API	Handles Hindi + 9 other Indian languages
AI — TTS	Bulbul v3 API	Streaming TTS, IVR-ready
Channel	WhatsApp webhook (local ngrok tunnel)	Test with real WhatsApp immediately
Frontend	React 18 + Vite (single cockpit)	Fast HMR, Tailwind works out of box
Queue	In-process (asyncio tasks)	No Redis needed for prototyping

Estimated cost: ~$30–50/month (Sarvam API credits for moderate volume, free SQLite, ngrok free tier for testing).

Stack B — Production-Ready (scales to a city)

Layer	Technology	Why
Backend	FastAPI + Gunicorn + PostgreSQL 15	ACID compliance, concurrent connections
Geo store	PostGIS 3	Ward polygon queries, ST_DWithin deduplication
Queue	Redis 7 + Celery 5	Reliable task delivery, beat scheduler for SLA checks
AI — photo	Sarvam Vision API + confidence fallback	Manual triage queue for low-confidence images
AI — voice	Saaras v3 API with dialect tagging	Dialect label stored for audit
AI — routing	Sarvam-105B API with prompt cache	Reduces latency on repeat complaint types
AI — TTS	Bulbul v3 streaming	Sub-800 ms IVR response
IVR	Twilio or Exotel	India DID coverage, DTMF webhooks
Frontend	React 18 + Vite + Tailwind (two apps)	Cockpit + supervisor dashboard
Auth	Keycloak (RBAC)	Ward officer / supervisor / senior officer roles
Infra	Docker Compose → GKE or EC2	Horizontal scaling for ingestion spikes

Estimated cost: ~$300–600/month (moderate city: 5,000 complaints/day, Sarvam API credits, managed PostgreSQL, two small VMs).

Section 5 — Implementation Phases

Phase 1: Complaint Data Model and Channel Scaffolding

Before any AI is wired up, you need a solid data model. This phase covers designing the complaints table (complaint_id, ward_id, channel, language_code, category, status enum, reporter_count, sla_deadline, officer_id, geom), the wards table (ward_id, ward_name, zone_id, geom polygon), and the routing_rules table (category × department × officer_id × sla_hours × dedup_radius_m).

Key technical decisions:

Choose the SLA clock model: does it pause on weekends? On public holidays? This needs to be a first-class config entity, not hardcoded.
Decide whether WhatsApp phone numbers are hashed or stored in plain text for DPDP compliance. Hashing prevents re-contact but complicates escalation flows.
Define the status enum states early: NEW → ASSIGNED → IN_PROGRESS → RESOLVED / ESCALATED / DUPLICATE. Adding states later requires a migration on a live production table.

Designing a DPDP-compliant PII schema with hashed citizen identifiers, PostGIS ward geometries, and a state-machine-ready status enum is covered in detail in the full course with working, tested code.

Phase 2: Sarvam AI Integration (Vision, Saaras, 105B)

This phase wires the three core AI models into the intake pipeline. You will write a FastAPI route that handles multipart uploads (photo + optional voice + optional text), calls Sarvam Vision for photos, Saaras v3 for voice, and feeds the combined output to Sarvam-105B for routing.

Key technical decisions:

Vision confidence threshold: 0.65 is a reasonable starting point, but test against a local dataset of 200+ complaint photos before shipping.
Saaras v3 dialect parameter: using auto works for most cases but for a Discom deploying in Rajasthan, pre-setting dialect=marwari materially improves accuracy and should be a ward-level config.
Sarvam-105B prompt design for deduplication: the model must receive both the new complaint summary and the top-3 PostGIS-matched candidates. Prompt structure and few-shot examples matter enormously here.

The exact prompt templates for Sarvam-105B deduplication — including the ward-context injection and confidence calibration few-shots — are provided in the full course with working, tested code.

Phase 3: Multi-Channel Intake (WhatsApp + IVR + PWA)

Three channels feed the same normalised intake endpoint. WhatsApp Business API delivers webhooks that carry text, image URLs, and audio URLs. The IVR layer (Twilio or Exotel) transcribes DTMF or connects to Saaras v3 for voice. The PWA uses a standard multipart form.

Key technical decisions:

WhatsApp outbound acknowledgements must use pre-approved template messages; the template library for all 10 languages (approval turnaround is 24–72 hours) must be set up before go-live.
IVR latency: Bulbul v3 TTS must respond within 800 ms for a natural conversation feel. This requires streaming the TTS output and pre-caching the 5–6 most common prompts (welcome, confirmation, escalation status) as static audio files.
PWA offline: citizens in low-connectivity areas should be able to draft a complaint offline and submit when connectivity returns; this requires a service worker and an IndexedDB queue.

Configuring the WhatsApp template library for 10 Indian languages, handling Bulbul v3 streaming, and implementing the PWA offline queue are all covered in the full course with working, tested code.

Phase 4: SLA State Machine, Escalation, and Dashboards

The complaint lifecycle is managed by a FastAPI state machine and a Celery beat scheduler. This phase builds the SLA timer logic, the escalation rules (e.g., auto-escalate to zone supervisor if unacknowledged after 4 hours, to senior officer after 24 hours), and both React dashboards.

Key technical decisions:

The ward-officer cockpit must show complaints on a map (Leaflet.js over OpenStreetMap) with SLA countdown badges. The map tile provider must work in India without a VPN.
The supervisor dashboard needs ward-level SLA compliance as a heatmap — this is a PostGIS aggregate query that can be expensive at scale; materialise it as a 15-minute refresh view.
Role-based access (ward officer, zone supervisor, senior officer, admin) must be enforced at the API level, not just in the UI. Keycloak JWT claims are the recommended approach.

Building the PostGIS SLA heatmap query, the Celery beat escalation scheduler, and the Keycloak RBAC integration for three role tiers is covered in detail in the full course with working, tested code.

Phase 5: DPDP Compliance, Audit Trail, and SI-Grade Deployment

For any ULB or Discom deployment, DPDP (Digital Personal Data Protection Act) compliance is non-negotiable. This phase covers audit logging (every access to citizen PII triggers an immutable log entry), data-retention policies (complaint records archived after 180 days, photos after 90 days), and the Docker Compose → production Kubernetes migration path.

Key technical decisions:

Decide whether to deploy on NIC Cloud (GovCloud India), AWS Mumbai, or on-premises hardware — each has different data-residency implications for citizen photos.
The Docker Compose setup must be tested on a clean Ubuntu 24.04 machine before handing to the SI partner.
Load testing at 10× expected peak (many cities experience complaint spikes after monsoon flooding) must be run with Locust before go-live.

The DPDP-compliant audit schema, data-retention Celery tasks, Docker Compose setup, and Locust load-test scripts are all included in the full course with working, tested code.

Section 6 — Common Challenges

Building a citizen service platform for India surfaces a set of problems that are invisible until you hit them in production. Here are the most common — and how to fix them.

1. Dialect mismatch kills transcription accuracy Problem: Saaras v3 dialect=auto routes Bhojpuri sentences to standard Hindi acoustic models, dropping word-error-rate by 15–20 points. Root cause: Auto-detection relies on the first 3 seconds of audio; short voice notes (< 5 seconds) don't give the model enough signal. Fix: Pre-configure dialect per ward using the routing_rules.dialect_hint column. Ward officers register at onboarding with their primary dialect; that hint is passed to Saaras on every incoming complaint.

2. Vision misclassifies multi-subject photos Problem: A photo of a broken streetlight near a pothole gets classified as streetlight 60% of the time and pothole 40% of the time — sometimes creating two tickets. Root cause: Sarvam Vision (like all vision models) returns the dominant subject; a multi-subject prompt forces a single label. Fix: Add a secondary tags[] array field to the complaint record and use Vision's top-3 classifications (not just top-1). Deduplication logic should match on any overlapping tag, not only the primary category.

3. Geofence ambiguity causes wrong-ward assignment Problem: Citizen GPS is often 100–250 m off (phone GPS in dense urban areas), placing a complaint in the adjacent ward. Root cause: Android GPS in high-rise corridors is notoriously inaccurate; WhatsApp sends the last known location, not a fresh fix. Fix: Use a two-pass ward lookup: first try ST_Within(point, ward_polygon), then fall back to ST_DWithin(point, ward_polygon, 300) with the nearest ward. Log the fallback for audit; a human officer can correct the ward on the cockpit.

4. WhatsApp template rejections block go-live Problem: Meta rejects 20–30% of first-time template submissions for ULBs because government-branded messages need additional business verification. Root cause: Meta's template review team flags messages that mention government body names without verified Official Business Account status. Fix: Apply for Official Business Account status at least 3 weeks before go-live. Draft all 10-language templates in the approval queue simultaneously; do not wait for one language to clear before submitting the next.

5. Deduplication collapses unrelated complaints Problem: Two different residents on the same street report a pothole and a broken sewer cover within 50 m of each other. The deduplication radius collapses them into one ticket. Root cause: The deduplication query matches on proximity alone, ignoring category. Fix: Add AND category = incoming_category to the ST_DWithin query. Also check that the open ticket's status is not RESOLVED — a resolved pothole should not absorb a fresh complaint.

6. Bulbul TTS latency spikes under load Problem: IVR callers hear 2–3 seconds of silence during peak hours (Monday mornings, post-monsoon flood events). Root cause: Bulbul v3 streaming has a cold-start overhead of ~400 ms; when multiple IVR calls arrive simultaneously, the first chunk is delayed further by API rate limits. Fix: Pre-render the 6 most-used IVR prompts (welcome, complaint-received, status-in-progress, status-resolved, escalation-notice, office-closed) as static Bulbul audio files during deployment. Only dynamic responses (ticket number read-back, officer name) use real-time TTS.

7. DPDP audit trail grows unbounded Problem: After 6 months of operation, the audit_log table has 40M rows and slows every dashboard query. Root cause: Every status change, every officer access, and every AI model call is logged to the same table with no partitioning. Fix: Partition audit_log by month using PostgreSQL range partitioning from day one. Archive partitions older than 180 days to object storage (S3 / NIC Object Store) using a Celery beat task.

Solving these issues took us over 120 hours of testing across five municipal datasets — the course walks you through each fix with working code.

Section 7 — Ready to Build This Yourself?

Understanding the architecture is not the same as shipping production code. The gap between a whiteboard diagram and a platform that a real ULB or Discom can put in front of citizens involves hundreds of small decisions: the exact Sarvam-105B prompt that handles Marwari dialect routing, the PostGIS query that handles ward boundary edge cases, the Celery task that sends escalation WhatsApp templates without hitting rate limits.

The Vernacular Citizen-Service Platform course on labs.codersarts.com gives you everything you need to go from zero to a deployable, SI-grade system:

✅ Full source code for all five implementation phases
✅ 12+ video walkthroughs, one per major component
✅ Docker Compose setup — one command to run the full stack locally
✅ Pre-built ward-mapping templates for BMC, BBMP, and GHMC geofences
✅ Sample complaint dataset (5,000 labelled complaints across 5 categories, 6 languages)
✅ Sarvam AI prompt library — Vision, Saaras, 105B, and Bulbul configs for civic use cases
✅ RBAC role definitions for ward officer, zone supervisor, and senior officer
✅ IVR scripts in 10 Indian languages (Hindi, Kannada, Tamil, Telugu, Bengali, Marathi, Gujarati, Odia, Punjabi, Malayalam)
✅ DPDP-compliant audit schema and data-retention Celery tasks
✅ Deployment walkthrough: Docker Compose → AWS Mumbai / NIC Cloud
✅ Lifetime access + free updates as Sarvam AI models evolve
✅ Community Slack for questions, peer review, and client-deployment stories

$29. Everything above.

Get the Full Course → labs.codersarts.com

Need to adapt this for a specific ULB, Discom, or state portal? Book a 1:1 Guided Session at $99 — a live walkthrough of your exact deployment scenario, including custom workflow design and integration with your existing GIS or CRM.

Section 8 — Conclusion

The Vernacular Citizen-Service Platform chains four Sarvam AI models — Vision, Saaras v3, Sarvam-105B, and Bulbul v3 — over a FastAPI state machine backed by PostgreSQL + PostGIS, turning a chaotic multi-language complaint inbox into a structured, SLA-governed workflow that reaches citizens in their own language and dialect. The hardest problems — dialect variation, geofence ambiguity, deduplication at ward scale, and DPDP compliance — are solvable with the right architectural choices, and none of them require a 12-month SI build.

The simplest place to start: stand up Stack A (FastAPI + SQLite + Sarvam APIs + ngrok) over a single weekend, connect a test WhatsApp number, and send a photo of a pothole from your phone. Once you see Sarvam Vision classify it and Sarvam-105B draft a reply in your language, the rest of the architecture clicks into place.

The full build — source code, datasets, prompt library, Docker setup, and deployment guide — is waiting for you at labs.codersarts.com.

How to Build a Vernacular Citizen-Service Platform with Sarvam AI and the WhatsApp Business API

Section 1 — Introduction: The Pothole That Never Gets Fixed

Section 2 — How It Works: The Core Concept

Why naive approaches break down

How this architecture solves it

Section 3 — System Architecture Deep Dive

Architecture layers

Component reference table

Data flow walkthrough

Two non-obvious design decisions

Section 4 — Tech Stack Recommendation

Stack A — Beginner / Prototype (build in a weekend)

Stack B — Production-Ready (scales to a city)

Section 5 — Implementation Phases

Phase 1: Complaint Data Model and Channel Scaffolding

Phase 2: Sarvam AI Integration (Vision, Saaras, 105B)

Phase 3: Multi-Channel Intake (WhatsApp + IVR + PWA)

Phase 4: SLA State Machine, Escalation, and Dashboards

Phase 5: DPDP Compliance, Audit Trail, and SI-Grade Deployment

Section 6 — Common Challenges

Section 7 — Ready to Build This Yourself?

Section 8 — Conclusion

Recent Posts

Comments