top of page

How to Build an LLM Router Gateway with LiteLLM: Fallbacks, Semantic Caching, Per-Tenant Keys, and Cost Tracking

  • 2 hours ago
  • 15 min read


Introduction


You've just shipped a feature powered by GPT-4o. It works beautifully—until OpenAI's API goes down at 3 AM, your on-call engineer wakes up to a flood of errors, and your customer-facing chat interface shows a broken spinner for two hours. Meanwhile, you have no idea which internal team is responsible for burning through $3,000 in API credits last week, and you're manually switching between provider SDKs every time you want to test a new model. Sound familiar?


An LLM router gateway is a single proxy layer that sits between your application and multiple LLM providers, handling provider switching, fallbacks, retries, caching, and cost tracking automatically. Instead of writing brittle provider-switching logic in every service that calls an LLM, you send all requests to one endpoint and let the gateway handle the complexity.


Real-world use cases:


  • SaaS products that need provider redundancy so an OpenAI outage doesn't take down their entire product


  • Platforms with multiple customer tenants requiring isolated API key management and per-tenant spend caps


  • ML teams running cost-reduction experiments by routing cheaper or cached responses for repeated prompts


  • Startups gradually migrating from OpenAI to self-hosted models without changing application code


  • Internal developer platforms giving multiple teams access to LLMs under a single managed gateway with audit logging


  • Enterprises needing a single choke point for LLM traffic to enforce security policies and compliance logging


This blog covers the complete architecture of a production-grade LLM router gateway built with LiteLLM, including multi-provider routing, fallback chains, semantic caching, per-tenant virtual keys, and real-time cost tracking. It explains what to build and why each design decision matters, to build it for yourself please do reach out to us at contact@codersarts.com


How It Works: Core Concept


The fundamental problem with using multiple LLM providers directly is API fragmentation. OpenAI uses one SDK, Anthropic uses another, and your self-hosted vLLM endpoint has its own request format. Each provider returns errors differently: OpenAI sends RateLimitError exceptions, Anthropic returns HTTP 429 with specific headers, and vLLM might timeout silently. When one provider is down, your application code has to detect the failure, parse the error, decide which provider to try next, reformat the request, and retry—all while tracking which attempt you're on and ensuring you don't blow past your budget.


The naive approach fails because implementing this logic in application code means:


  • Duplicating retry/fallback logic across every microservice that calls an LLM


  • Maintaining provider-specific error handling in multiple repositories


  • No shared cache—identical prompts hit the API multiple times across different services


  • No central cost visibility—each team tracks spend independently (or not at all)


  • Deployment coupling—swapping providers or updating retry logic requires redeploying every dependent service


The LLM router gateway solves this by introducing a unified proxy layer that speaks OpenAI's API format on the frontend but abstracts away provider differences on the backend. Your application sends a standard OpenAI-compatible request to the gateway. The gateway consults a semantic cache in Redis to check if a similar prompt has been answered recently. If not, it routes the request to the highest-priority available provider (OpenAI, Anthropic, or vLLM) based on a configured fallback chain. If that provider times out or returns a rate-limit error, the gateway automatically retries and falls over to the next provider. Every request is tagged with a virtual tenant key, and token usage is logged to a cost-tracking database. Your application code never knows which provider actually served the response—it just gets a standard OpenAI-shaped reply.


Data Flow Diagram:

SETUP PHASE:
┌─────────────────┐
│ config.yaml     │  ← Define models, fallback order, cache threshold
└────────┬────────┘
         │
         v
┌─────────────────┐
│ LiteLLM Proxy   │  ← Reads config, connects to Redis + Postgres
└────────┬────────┘
         │
         v
┌─────────────────┐
│ Docker Compose  │  ← Starts Redis, Postgres, LiteLLM, vLLM
└─────────────────┘

RUNTIME PHASE (per request):
┌──────────────┐
│ Application  │  ← Sends OpenAI-compatible request with virtual key
└──────┬───────┘
       │
       v
┌──────────────────┐       ┌───────────┐
│ LiteLLM Proxy    │──────>│  Redis    │  ← Check semantic cache
└──────┬───────────┘       └───────────┘
       │                         │
       │ (cache miss)            │ (cache hit → return cached response)
       v                         v
┌──────────────────┐       ┌─────────────┐
│ Router Logic     │       │ Application │
│ Try: OpenAI      │       └─────────────┘
│ Fallback: Claude │
│ Fallback: vLLM   │
└──────┬───────────┘
       │
       v (on success)
┌──────────────────┐
│ Log to Postgres  │  ← Record tokens, cost, tenant_id
└──────┬───────────┘
       │
       v
┌──────────────────┐
│ Return response  │
└──────────────────┘

Analogy: Think of the gateway as a smart assistant at a hotel concierge desk. When you ask for dinner recommendations, the assistant first checks their notes (semantic cache) to see if they've answered a similar question today. If not, they try calling the head chef (primary provider). If the chef is unavailable, they call the sous chef (fallback 1), then the bartender (fallback 2). They log which staff member answered and how long it took (cost tracking), but you just get one consistent answer format. You never need to know the internal phone tree—you just ask the concierge.


System Architecture Deep Dive


The LLM router gateway is structured as five distinct layers, each handling a specific responsibility. This separation ensures that changes to provider configuration, caching strategy, or cost tracking logic can happen independently without touching application code.


Architecture Overview:


  1. Client Layer: Your application code, internal tools, or third-party integrations. This layer sends standard OpenAI SDK requests to the gateway endpoint and receives OpenAI-compatible responses. It has no awareness of which provider actually served the request.


  2. Gateway/Proxy Layer: The LiteLLM proxy server running in Docker. This is the single entry point for all LLM traffic. It handles authentication (virtual keys), request validation, semantic cache lookups, provider routing, retry logic, and response normalization.


  3. Caching Layer: A Redis instance storing embeddings of recent prompts and their responses. Before hitting any paid API, the gateway computes a vector embedding of the incoming prompt and checks if a semantically similar prompt exists in the cache (cosine similarity above a configured threshold). Cache hits return instantly at zero API cost.


  4. Provider Layer: Multiple LLM backends registered in the gateway's config. This includes OpenAI's hosted API, Anthropic's Claude API, and a self-hosted vLLM endpoint running a local model. The gateway maintains a priority-ordered list and handles fallbacks transparently.


  5. Persistence Layer: PostgreSQL or SQLite for cost tracking (token counts, estimated spend per request, tenant metadata) and virtual key management. The gateway writes to this database on every request, enabling real-time spend queries via a /cost endpoint.


Component Breakdown:

Component

Role

Options

LiteLLM Proxy

Request routing, retries, fallbacks, normalization

LiteLLM (OpenLIT fork also available)

Redis

Semantic cache storage, rate limiting state

Redis, Redis Stack (with vector search)

PostgreSQL/SQLite

Cost tracking, virtual key storage, audit logs

PostgreSQL, SQLite, MySQL (via LiteLLM adapter)

OpenAI API

Primary hosted model provider

gpt-4o, gpt-4o-mini, gpt-3.5-turbo

Anthropic API

Fallback hosted provider

claude-3-5-sonnet, claude-3-haiku

vLLM Endpoint

Self-hosted model backend

Llama 3.1, Mistral, Qwen, any GGUF-compatible model

Docker Compose

Orchestration, networking, volume management

Docker Compose, Kubernetes (for production scale)

Embedding Model

Semantic cache similarity computation

sentence-transformers (bge-small-en-v1.5), OpenAI embeddings API

Monitoring/Observability

Request tracing, error alerting, cost dashboards

Prometheus + Grafana, Datadog, custom /cost endpoint

Load Balancer (optional)

Distributes traffic across multiple proxy replicas

NGINX, Traefik, AWS ALB


Data Flow Walkthrough (Request Lifecycle):


  1. Application sends request: Your backend service calls the gateway at http://localhost:4000/chat/completions with a virtual API key in the Authorization header and a standard OpenAI request body (model, messages, temperature, etc.).


  2. Virtual key validation: LiteLLM checks the database to verify the key exists, is not expired, and has not exceeded its spend limit. If the key is invalid, a 401 Unauthorized response is returned immediately.


  3. Semantic cache lookup: The gateway extracts the user prompt, generates an embedding using a lightweight sentence-transformer model, and queries Redis for any cached embeddings with cosine similarity above the configured threshold (e.g., 0.92).


  4. Cache hit (fast path): If a match is found, the cached response is returned directly. The request is logged to Postgres with cache_hit=true and zero cost. Total latency: ~20ms. No external API call is made.


  5. Cache miss (slow path): If no similar prompt is found, the gateway proceeds to the router. It selects the highest-priority available model from the configured fallback chain (e.g., openai/gpt-4o → anthropic/claude-3-5-sonnet → vllm/llama-3.1-70b).


  6. Primary provider call: The gateway reformats the request into the selected provider's native format and makes an HTTP call. If the call succeeds, the response is normalized back to OpenAI format, cached in Redis (with the prompt embedding), and logged to Postgres with token counts and estimated cost.


  7. Retry and fallback (on failure): If the primary provider times out (30s default), returns a rate-limit error (HTTP 429), or throws a 5xx server error, LiteLLM automatically marks that provider as temporarily unavailable and retries with the next provider in the chain. This happens transparently—your application sees no retry logic.


  8. Response normalization: Regardless of which provider succeeded, the response is transformed into OpenAI's standard schema (choices, usage, model fields). Your application cannot distinguish between a response from GPT-4o, Claude, or vLLM without checking logs.


  9. Cost logging: The gateway writes a record to Postgres containing: tenant_id, model_used, prompt_tokens, completion_tokens, total_cost_usd, cache_hit, provider, timestamp. This powers the /cost/report endpoint.


  10. Application receives response: The client gets a standard OpenAI-shaped JSON response and continues execution. It has no idea a fallback happened, a cache was checked, or which provider was billed.


Non-Obvious Design Decisions:


  • Why embed prompts instead of hashing them:

    A cryptographic hash (MD5, SHA256) requires exact string matches, so "What is the capital of France?" and "what is the capital of france?" would miss the cache despite being semantically identical. Embedding-based similarity allows fuzzy matching on meaning, which dramatically improves cache hit rates for real-world prompts that vary slightly in phrasing or whitespace.


  • Why log costs in the proxy, not the application:

    Application-side cost tracking is unreliable because it requires every team to instrument their code correctly, use the same calculation logic, and report to a shared database. Centralizing cost logging in the proxy ensures consistency, prevents underreporting (forgot to add telemetry to the new microservice), and allows finance teams to audit spend without touching application code.


Tech Stack Recommendation


There is no one-size-fits-all stack for an LLM gateway. The right choice depends on whether you're prototyping over a weekend or preparing for production traffic at scale. Below are two opinionated recommendations: one optimized for speed of implementation, the other for operational reliability.


Stack A: Beginner/Prototype (Weekend Build)

Layer

Technology

Why

Proxy

LiteLLM (standalone Docker image)

Pre-built, zero custom code, YAML-configured

Cache

Redis (Docker, no persistence)

Simplest setup, in-memory only, fine for dev/testing

Database

SQLite (local file)

No separate server process, bundled with LiteLLM

Models

OpenAI API only

Fewest moving parts, no self-hosting

Embedding

sentence-transformers/all-MiniLM-L6-v2

Runs in LiteLLM process, no external API cost

Orchestration

Docker Compose

One docker-compose.yml, boot with docker-compose up

Observability

LiteLLM built-in /health + logs to stdout

Enough to debug basic issues during development


Estimated monthly cost: $0 infrastructure (runs on your laptop or a $5 DigitalOcean droplet) + OpenAI API usage (~$50–$200 depending on traffic). Total: ~$50–$200/month.


Stack B: Production-Ready (Designed to Scale)

Layer

Technology

Why

Proxy

LiteLLM (Kubernetes deployment, 3+ replicas)

Horizontal scaling, zero-downtime deploys, load balanced

Cache

Redis Cluster (managed AWS ElastiCache or self-hosted)

Persistent storage, automatic failover, AOF snapshots

Database

PostgreSQL (managed RDS or self-hosted with replication)

ACID guarantees, complex queries for cost analytics, backups

Models

OpenAI + Anthropic + self-hosted vLLM (A100 GPU instance)

Provider diversity, cost optimization via local inference

Embedding

OpenAI text-embedding-3-small API

Higher accuracy than local models, scales with traffic

Orchestration

Kubernetes (EKS, GKE, or self-managed)

Auto-scaling, health checks, rolling updates, Helm charts

Observability

Prometheus (metrics) + Grafana (dashboards) + Sentry (errors)

Production-grade monitoring, alerting, error tracking

Load Balancer

NGINX Ingress or cloud ALB

SSL termination, rate limiting, DDoS protection

Secrets Management

AWS Secrets Manager or HashiCorp Vault

Rotate API keys without redeployment, audit access

CI/CD

GitHub Actions or GitLab CI

Automated testing, config validation, blue/green deploys


Estimated monthly cost: $300–$800 infrastructure (Kubernetes cluster, Redis, Postgres, vLLM GPU instance) + API usage ($200–$1,000 depending on cache hit rate and provider mix). Total: $500–$1,800/month for a team of 20–50 engineers.


Implementation Phases


Building an LLM router gateway is not a monolithic task. Breaking it into phases allows you to validate each layer independently, test failure modes in isolation, and ship a minimal working gateway before adding advanced features like semantic caching or cost analytics.


Phase 1: Basic Multi-Provider Routing


What you're building: A LiteLLM proxy that accepts OpenAI-compatible requests and forwards them to either OpenAI or Anthropic based on a static configuration. No caching, no fallbacks yet—just request translation and response normalization.


Key technical decisions:


  • Which models to register (e.g., do you expose gpt-4o-mini and claude-3-haiku for cost-sensitive use cases, or only flagship models?)


  • How to structure your config.yaml model list (flat list vs. grouped by capability tier)


  • Whether to run LiteLLM in Docker or as a bare Python process (Docker is strongly recommended for dependency isolation)


Testing checkpoint: Send a curl request to the proxy with "model": "gpt-4o" and verify you get a valid OpenAI response. Change the model to "claude-3-5-sonnet" and confirm the proxy translates the request to Anthropic's format and normalizes the response back.


Configuring LiteLLM's model registry correctly—especially handling provider-specific quirks like Anthropic's required anthropic_version header or vLLM's custom endpoint paths can be done if you reach out to contact@codersarts.com


Phase 2: Fallback Chains and Retry Logic


What you're building: Extend the router to automatically fall back to a secondary provider when the primary one fails. Configure retry logic (e.g., retry up to 3 times with exponential backoff before giving up).


Key technical decisions:


  • Fallback order and cost implications (if Anthropic Claude is cheaper than GPT-4o, should it be primary or fallback?)


  • Retry strategy per error type (do you retry on timeouts but not on invalid API keys? Do you retry on HTTP 500 but not 400?)


  • How many allowed failures before a provider is temporarily disabled (LiteLLM's allowed_fails setting)


Testing checkpoint: Shut down the OpenAI API (or block api.openai.com in your firewall) and send a request. Verify that the gateway returns a response from Anthropic instead, with no manual intervention required. Check the logs to confirm the fallback attempt.


Tuning retry delays to avoid amplifying a provider's outage with thundering-herd retries, and understanding LiteLLM's cooldown_time parameter to prevent hot-looping on persistent failures, can be done if you reach out to contact@codersarts.com


Phase 3: Semantic Caching with Redis


What you're building: Integrate Redis as a semantic cache. Before routing a request to an LLM provider, the gateway generates an embedding of the prompt and checks if a sufficiently similar prompt has been answered recently. If yes, return the cached response immediately.


Key technical decisions:


  • Similarity threshold (0.85? 0.90? 0.95? Too low = wrong cache hits, too high = wasted cache misses)


  • Embedding model (local sentence-transformers for zero cost, or OpenAI embeddings API for better accuracy?)


  • Cache TTL (time-to-live: how long should responses stay in the cache before expiring?)


Testing checkpoint: Send the same prompt twice in a row. The first request should hit OpenAI; the second should return instantly from the cache with zero API cost. Then send a slightly rephrased version of the prompt (e.g., "What is the capital of France?" → "Tell me the capital of France") and verify it still hits the cache if above the threshold.


Debugging cache misses when prompts should match but don't (often caused by whitespace normalization issues or embedding model mismatch between cache writes and reads) can be done if you reach out to contact@codersarts.com

.

Phase 4: Virtual Keys and Per-Tenant Spend Limits


What you're building: Replace the single master API key with a system of virtual keys that map to tenant IDs or team names. Each virtual key can have a monthly spend limit, and requests exceeding that limit are rejected automatically.


Key technical decisions:


  • Key generation strategy (UUIDs? Prefixed strings like sk-tenant-abc123? Signed JWTs with embedded tenant metadata?)


  • Where to store virtual key metadata (LiteLLM's built-in SQLite database, or external Postgres for auditability?)


  • How to enforce spend limits (hard block at limit, or soft warning with alert?)


Testing checkpoint: Generate two virtual keys, each with a $10 monthly spend limit. Send requests using each key and verify that the /key/info endpoint shows incrementing spend. Make enough requests to exceed the limit on one key, and confirm subsequent requests with that key return HTTP 402 Payment Required.


LiteLLM's virtual key system stores state in its own database schema, and understanding the interaction between max_budget, budget_duration, and soft_budget flags to prevent accidental lockouts or runaway spend can be done if you reach out to contact@codersarts.com


Phase 5: Real-Time Cost Tracking and Analytics


What you're building: A /cost/report endpoint that surfaces per-tenant, per-model spend over arbitrary time windows (last 24 hours, last 7 days, current month). Implement custom cost calculation hooks to account for self-hosted vLLM inference costs (GPU time) that LiteLLM's built-in pricing table doesn't track.


Key technical decisions:


  • Database schema for cost logs (one row per request, or aggregated hourly/daily rollups for query performance?)


  • How to calculate vLLM costs (flat per-token rate based on your GPU cost, or dynamic based on measured inference time?)


  • Whether to expose raw logs or only aggregated metrics via the API (raw logs enable custom analysis but risk leaking prompt content)


Testing checkpoint: Send 100 requests split across three virtual keys and two models (50 to GPT-4o, 50 to Claude). Query the /cost/report endpoint and verify it returns accurate per-tenant spend breakdowns. Shut down the OpenAI API to force fallbacks to Claude, and confirm the cost report reflects the more expensive Claude tokens.


Hooking into LiteLLM's success_callback and failure_callback to log custom cost events (like GPU time for vLLM or embedding API costs for the semantic cache) requires understanding the callback signature and async context handling can be done if you reach out to contact@codersarts.com


Common Challenges


Even with LiteLLM handling most of the heavy lifting, building a production-ready gateway surfaces non-obvious edge cases that don't appear in toy examples. Here are the issues that took us the longest to debug, and the fixes that actually worked.


1. Semantic Cache False Positives


Problem: The cache returns a response to "What is the capital of France?" when the user asks "What is the capital of Italy?"—both prompts embed to similar vectors because they share most of their words.


Root cause: The embedding model (e.g., all-MiniLM-L6-v2) is optimized for general sentence similarity, not fine-grained factual distinctions. A cosine similarity of 0.88 might seem "close enough," but it can collapse semantically different questions.


Fix: Lower the similarity threshold to 0.94 or higher, or switch to a more powerful embedding model like OpenAI's text-embedding-3-small. Alternatively, include metadata tags (e.g., topic, language) in the cache key to partition the cache space and prevent cross-contamination between unrelated queries.


2. Fallback Cost Explosion


Problem: Your primary provider (OpenAI GPT-4o-mini at $0.15/1M input tokens) goes down, and all traffic falls back to Anthropic Claude 3.5 Sonnet ($3/1M input tokens)—20× more expensive. Your bill spikes from $200 to $4,000 before anyone notices.


Root cause: Fallback chains prioritize reliability over cost. LiteLLM doesn't warn you when a fallback provider is significantly more expensive than the primary.


Fix: Set per-model budget caps in the router_settings config, or implement a custom middleware that checks estimated request cost before routing. Monitor the /model/info endpoint to track which model is serving most requests—a sudden shift to the fallback provider should trigger an alert.


3. Virtual Key Scope Confusion


Problem: You create a virtual key with a $100 budget, but it seems to share that budget across multiple tenants, or the budget resets unexpectedly mid-month.


Root cause: LiteLLM's max_budget field is scoped to the key itself, but the budget_duration parameter (e.g., "30d") resets on a rolling window from the key's creation date, not a calendar month boundary. If you create keys on different days, their reset cycles are staggered.


Fix: Use the budget_reset_at field (Unix timestamp) to align all keys to the same monthly reset time (e.g., midnight UTC on the 1st of each month). Document this clearly in your internal key provisioning runbook.


4. Provider Error Normalization Gaps


Problem: When vLLM times out, the error is logged as a generic "Connection error," but when OpenAI times out, you get a detailed APITimeoutError with the exact stage of the request (DNS, TCP handshake, response read). This makes debugging vLLM issues much harder.


Root cause: LiteLLM normalizes most provider errors but not all. Self-hosted backends like vLLM often have custom error formats that don't map cleanly to OpenAI's exception hierarchy.


Fix: Wrap the vLLM provider call in a custom error handler that catches raw HTTP exceptions and logs structured metadata (endpoint, payload size, timeout duration). Consider contributing these normalizations back to the LiteLLM open-source project.


5. Redis Boot Race Condition


Problem: Running docker-compose up starts the LiteLLM proxy before Redis finishes initializing. The proxy crashes with "Connection refused" and the whole stack fails to boot.


Root cause: Docker Compose's default depends_on only waits for the container to start, not for the service inside the container to be ready. Redis takes 2–3 seconds to accept connections after the container boots.


Fix: Add a healthcheck to your Redis service in docker-compose.yml and use depends_on with condition: service_healthy. This forces LiteLLM to wait for Redis to pass its healthcheck before starting.


6. Cost Calculation Drift for Self-Hosted Models


Problem: LiteLLM reports that your vLLM endpoint costs $0 per request, even though you're paying $1.50/hour for GPU time on your cloud provider.


Root cause: LiteLLM's built-in pricing table only includes hosted API providers (OpenAI, Anthropic, Cohere, etc.). It has no knowledge of your infrastructure costs.


Fix: Implement a success_callback function that calculates GPU cost based on measured inference time and writes it to your Postgres cost log. For example, if your A100 instance costs $1.50/hour and a request takes 2 seconds, log an additional $0.00083 cost for that request.


7. Concurrent Request Queueing Under Load


Problem: When traffic spikes, requests start queueing and response times balloon from 1 second to 30+ seconds, even though your GPU utilization is only at 40%.


Root cause: LiteLLM's default concurrency settings are conservative to prevent overwhelming provider APIs. The rpm (requests per minute) and tpm (tokens per minute) limits in router_settings might be set too low for your actual capacity.


Fix: Increase rpm and tpm limits gradually while monitoring provider error rates. For self-hosted vLLM, set rpm to null (unlimited) and rely on vLLM's own queuing and batching logic instead.


Solving these issues took us 40+ hours of testing across different traffic patterns and failure scenarios—can be done if you reach out to contact@codersarts.com


Ready to Build This Yourself?


You now understand the architecture, the stack options, the implementation phases, and the pitfalls to avoid. But there's a gap between knowing the design and shipping a working gateway. The difference is tested code, validated configs, and the 100+ micro-decisions that don't fit in a blog post.


Conclusion


An LLM router gateway turns a fragile, provider-coupled architecture into a robust, centrally managed system. By routing all LLM traffic through a single LiteLLM proxy, you get automatic fallbacks when providers go down, zero-cost responses from a semantic cache, per-tenant spend isolation, and real-time cost visibility—all without changing a single line of application code.


Start with the beginner stack (Docker Compose, SQLite, OpenAI only) to validate the concept in a weekend. Once the architecture clicks, layer in Redis caching, self-hosted vLLM for cost reduction, and Postgres-backed analytics for production. The hardest parts config syntax, error normalization, cache threshold tuning, and cost calculation hooks are already solved do reach out to contact@codersarts.com


Ship your LLM gateway this month, not this quarter.

 
 
 

Comments


bottom of page