top of page

How to Deploy vLLM in Production: OpenAI-Compatible API, Tensor Parallelism on 2 GPUs, and Docker — Complete Guide

  • 22 hours ago
  • 14 min read


Introduction


You've finally convinced your team to self-host an LLM. You've chosen a 7B parameter model, spun up a cloud instance with two A100s, and written a basic Python script to load the model and generate text. Then reality hits: your inference server processes one request at a time, leaving 90% of your GPU compute idle. Concurrent users wait in line. Memory overflows mid-generation. And worst of all, migrating your existing OpenAI client code to hit your new server requires rewriting half your application.


This guide shows you how to deploy a production-grade vLLM inference server — a containerized, multi-GPU LLM endpoint that exposes an OpenAI-compatible REST API, handles concurrent requests efficiently via continuous batching, and scales to real-world production workloads.


Real-world use cases:


  • Teams self-hosting an LLM to avoid OpenAI API costs at scale

  • Enterprises with data-privacy requirements that prevent sending data to third-party APIs

  • ML engineers benchmarking open-source models before committing to a model in production

  • Startups building LLM-powered products on a budget using consumer or cloud A100/H100 instances

  • Researchers needing a high-throughput inference endpoint for large-scale evaluation runs

  • Developers migrating an existing OpenAI-based application to a self-hosted model with zero client-side code changes


This blog covers the core architecture, technology stack, implementation phases, and critical challenges you'll encounter when deploying vLLM in production. What this is NOT: a full source code tutorial. We focus on architecture, design decisions, and the non-obvious gotchas that take days to debug. For working, tested code and configurations, see the course link at the end.

📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

How It Works: Core Concept


Most LLM inference tutorials show you how to load a model and generate a single response. This naive approach is catastrophically inefficient in production. When you process requests one at a time, your GPU sits idle during memory transfers, tokenization overhead, and I/O waits. A $40,000 H100 running at 10% utilization is an expensive space heater.


The fundamental problem: Transformer models operate on batches of sequences, but traditional serving frameworks treat each HTTP request as an independent job. Request A arrives, gets fully processed (prompt encoding → autoregressive generation → response), then Request B starts. This serial processing leaves massive GPU throughput on the table.


How vLLM solves this: vLLM implements continuous batching (also called iteration-level batching). Instead of waiting for Request A to finish, vLLM groups all pending requests into a single batch on every forward pass. As soon as Request A generates one token, that slot in the batch becomes available for a new request. If Request C arrives mid-generation, it joins the batch immediately — no waiting for the current batch to finish.

Think of it like airport security. Traditional batching is "wait until this group of 20 passengers clears security, then start the next group." Continuous batching is "as soon as someone exits, pull the next person from the queue" — the conveyor belt never stops.


Additionally, vLLM introduces:


  • Tensor parallelism to split a model across multiple GPUs when a single GPU can't hold it

  • Prefix caching to store the KV-cache of repeated prompt prefixes (system prompts, few-shot examples) so they're not recomputed on every request

  • PagedAttention to manage GPU memory in 4KB blocks like an operating system manages RAM, eliminating fragmentation


ASCII data flow diagram:


SETUP PHASE:
  [Docker container start]
      ↓
  [Download model from HuggingFace Hub] → [Cache to mounted volume]
      ↓
  [vLLM engine init: split model across GPU 0 & GPU 1 via tensor parallelism]
      ↓
  [FastAPI server starts on port 8000]

RUNTIME PHASE (per request):
  [HTTP POST /v1/chat/completions]
      ↓
  [Request enters continuous batching queue]
      ↓
  [vLLM scheduler: group pending requests into micro-batch]
      ↓
  [Check prefix cache: system prompt already computed?]
      ↓
  [Forward pass on GPU 0 & GPU 1 (tensor-parallel sync)]
      ↓
  [Generate 1 token, add to KV-cache]
      ↓
  [Is sequence done? No → loop | Yes → stream response chunk]
      ↓
  [Release batch slot, pull next request from queue]

System Architecture Deep Dive


Architecture Overview


A production vLLM deployment consists of five layers:


1. Client LayerAny HTTP client that speaks OpenAI's API spec — existing apps using openai Python SDK, LangChain, custom REST clients — hit your vLLM endpoint with zero code changes beyond swapping the base_url.


2. Reverse Proxy / Ingress LayerNginx or Traefik terminates SSL, handles rate limiting, and routes traffic to the vLLM container. In cloud environments, this is your load balancer.


3. Application LayerThe vLLM FastAPI server exposes /v1/completions and /v1/chat/completions endpoints. This layer translates OpenAI-format requests into vLLM engine calls and streams responses back.


4. Inference Engine LayerThe vLLM engine itself — manages the continuous batching scheduler, prefix cache, KV-cache allocation via PagedAttention, and orchestrates tensor-parallel execution across GPUs.


5. Model & Compute LayerPyTorch model weights loaded from Hugging Face Hub, split across two GPUs. CUDA kernels execute attention, FFN, and sampling operations. nvidia-docker runtime bridges the container to physical GPUs.


Component Breakdown

Component

Role

Options

Model Checkpoint

7B parameter LLM weights

Mistral-7B-Instruct-v0.2, Meta-Llama-3-8B-Instruct, Vicuna-7B

Inference Framework

Batching, memory management, serving

vLLM, TGI (Text Generation Inference), Triton, llama.cpp

GPU Runtime

Container access to NVIDIA GPUs

nvidia-docker, nvidia-container-toolkit

Parallelism Strategy

Split model across GPUs

Tensor parallelism (2–8 GPUs), Pipeline parallelism (8+ GPUs), Data parallelism (multiple replicas)

API Server

HTTP interface

FastAPI (vLLM built-in), custom Flask/Starlette wrapper

Reverse Proxy

SSL, load balancing, rate limiting

Nginx, Traefik, Caddy, cloud LB (AWS ALB, GCP LB)

Container Orchestration

Deploy, scale, restart

Docker Compose, Kubernetes, Nomad, ECS

Model Cache Storage

Persist HF downloads across restarts

Docker volume, EFS/NFS mount, local SSD

Quantization

Reduce VRAM usage

fp16, bfloat16, AWQ (4-bit), GPTQ (4-bit), none (fp32)

Monitoring

Metrics, logs, alerts

Prometheus + Grafana, CloudWatch, Datadog


Data Flow Walkthrough


Step 1: User sends POST /v1/chat/completions with messages array [{role: "user", content: "Explain tensors"}]


Step 2: FastAPI handler validates the request, extracts parameters (max_tokens, temperature, stream), and converts the OpenAI chat format into a vLLM-compatible prompt string


Step 3: Request enters vLLM's continuous batching queue with priority metadata (not FIFO — shorter prompts may get prioritized)


Step 4: Scheduler wakes on the next iteration (every ~10ms), checks pending requests, groups them into a micro-batch that fits GPU memory constraints (--max-num-batched-tokens)


Step 5: For each request, vLLM checks the prefix cache: "Has this system prompt / few-shot prefix been seen before?" If yes, reuse the stored KV-cache; if no, compute and cache it


Step 6: Batch is dispatched to GPU 0 and GPU 1 in tensor-parallel mode — each GPU computes half the attention heads, synchronizes with NCCL collectives, and produces the next token logits


Step 7: Sampling operation (top-p, temperature) runs on logits to select the next token ID


Step 8: If stream=true, the server emits a Server-Sent Event (SSE) chunk data: {"choices": [{"delta": {"content": "A tensor"}}]}; if stream=false, buffer until EOS


Step 9: KV-cache updated with the new token's key/value vectors, stored in PagedAttention blocks


Step 10: Is the sequence done (hit max_tokens or generated EOS token)? No → loop to Step 6. Yes → release the batch slot and pull the next waiting request into the batch


Step 11: Final response returned to client as {"choices": [{"message": {"content": "..."}}]}


Non-Obvious Design Decisions


Decision 1: Tensor parallelism over pipeline parallelism for 2 GPUs

With only two GPUs, tensor parallelism is the clear winner. Tensor-parallel splits each layer horizontally across GPUs — both GPUs work on every forward pass, so there's no pipeline bubble (idle time waiting for the previous stage). Pipeline parallelism shines at 8+ GPUs where communication overhead would otherwise dominate, but for 2 GPUs, the synchronization cost of tensor-parallel is negligible via NVLink/PCIe.


Decision 2: Continuous batching instead of static batching

Static batching (wait until you have N requests, then process the batch) is simpler to implement but catastrophic for tail latency. If you batch every 10 requests and Request #1 arrives alone, it waits until 9 more requests arrive — potentially seconds. Continuous batching sacrifices implementation complexity for predictable latency: every request starts processing within one scheduler iteration (~10ms).


Decision 3: OpenAI-compatible API instead of a custom interface

Exposing a vLLM-native API would be technically cleaner (no translation overhead), but real-world production apps are already written against OpenAI's spec. LangChain, Semantic Kernel, every LLM orchestration framework — they all assume OpenAI's JSON schema. By implementing the same API surface, you enable drop-in replacement: change one environment variable (OPENAI_BASE_URL) and the entire app switches to your self-hosted endpoint.


Tech Stack Recommendation


Stack A: Beginner / Prototype (Weekend Build)

Layer

Technology

Why

Model

Mistral-7B-Instruct-v0.2

Best instruction-following quality in the 7B class; permissive license

Inference

vLLM 0.4.0+

Single pip install, OpenAI-compatible endpoint out of the box

Container

Docker + Docker Compose

Simplest deployment — one docker-compose up command

GPU Runtime

nvidia-docker (nvidia-container-toolkit)

Standard NVIDIA-maintained runtime, works on Ubuntu 20.04+

Quantization

fp16 (default)

No setup required, halves VRAM vs. fp32, minimal quality loss

Proxy

None (direct port exposure)

Acceptable for internal testing; avoid in production

Monitoring

Docker logs + nvidia-smi polling

Zero-setup visibility into GPU util and request logs


Estimated monthly cost: $600–$800 (e.g., AWS p3.8xlarge with 4×V100 16GB, using 2 GPUs, reserved pricing) or $0 if running on existing on-prem hardware.


Stack B: Production-Ready (Scale to Thousands of Users)

Layer

Technology

Why

Model

Meta-Llama-3-8B-Instruct (AWQ quantized)

State-of-the-art 8B model; AWQ 4-bit drops VRAM to ~5GB per GPU with <2% quality loss

Inference

vLLM 0.4.2+ with prefix caching enabled

Latest version has critical memory leak fixes and improved batching

Container

Kubernetes with GPU node pools

Auto-scaling, rolling updates, pod restarts on OOM

GPU Runtime

nvidia-device-plugin for Kubernetes

Schedules pods to GPU nodes, handles device allocation

Quantization

AWQ 4-bit

4× VRAM reduction vs. fp16; enables larger batches or longer contexts

Proxy

Nginx Ingress Controller + Let's Encrypt

SSL termination, rate limiting (X requests/min per IP), path-based routing

Load Balancer

Cloud provider LB (AWS ALB, GCP LB)

Health checks, auto-failover, geographic distribution

Monitoring

Prometheus (vLLM metrics exporter) + Grafana

Real-time dashboards: tokens/sec, batch utilization, queue depth, P95 latency

Logging

Fluentd → CloudWatch / ELK stack

Centralized request logs, error tracking, audit trail

Autoscaling

Horizontal Pod Autoscaler (HPA) on request queue depth

Spin up additional replicas when queue >100 requests


Estimated monthly cost: $1,200–$2,000 (2× A100 40GB instances, load balancer, monitoring infrastructure, assuming 70% utilization on reserved/spot pricing).


Implementation Phases


Phase 1: Environment Setup & Model Download


What you're building: A GPU-enabled Docker container that successfully downloads a 7B model from Hugging Face Hub and verifies GPU visibility.


Key technical decisions:


  • Which model checkpoint? Mistral-7B-Instruct-v0.2 for quality, or Llama-3-8B for the latest architecture. Check the license: Mistral is Apache 2.0 (fully permissive); Llama-3 requires Meta's acceptable use policy agreement.


  • Where to cache the model? Inside the container (bloats image size to 15+ GB), or in a mounted volume (recommended)? Volume mounts persist across container restarts and enable sharing the cache across multiple vLLM containers.


  • Which CUDA version? Match your host's NVIDIA driver version. vLLM Docker images are tagged by CUDA version (e.g., vllm/vllm-openai:latest-cuda12.1). Mismatched CUDA/driver versions cause silent failures or degraded performance.



Phase 2: vLLM Configuration & Tensor Parallelism


What you're building: A vLLM engine that splits the 7B model across two GPUs and successfully runs a test inference.


Key technical decisions:


  • Tensor-parallel size: Must be a divisor of the model's number of attention heads. Mistral-7B has 32 heads, so valid values are 1, 2, 4, 8. For 2 GPUs, set --tensor-parallel-size=2.


  • GPU memory utilization: --gpu-memory-utilization=0.9 reserves 90% of VRAM for KV-cache. Setting this too high (0.95+) risks OOM under load; too low (0.7) wastes capacity. The optimal value depends on your batch size and max sequence length.


  • Max model length vs. max num batched tokens: --max-model-len=4096 is the per-sequence context window. --max-num-batched-tokens=8192 is the total tokens across all sequences in a batch. If you allow 10 concurrent 4K-context requests, you need 10 × 4096 = 40960 batched tokens.



Phase 3: OpenAI-Compatible API Integration


What you're building: A FastAPI server (vLLM's built-in server) that exposes /v1/completions and /v1/chat/completions endpoints and correctly handles streaming responses.


Key technical decisions:


  • Which OpenAI API version to target? vLLM implements the v1 API. Clients using the legacy Completion vs. new ChatCompletion classes need different endpoints.


  • Streaming vs. non-streaming? Streaming (Server-Sent Events) gives users immediate feedback for long generations but complicates error handling (you can't return a 500 status mid-stream). Non-streaming buffers the entire response, which risks timeouts for long outputs.


  • How to handle unsupported parameters? OpenAI's API has ~20 parameters (frequency_penalty, presence_penalty, logit_bias, etc.). vLLM supports a subset. Do you silently ignore unsupported params (risks user confusion) or return a 400 error (breaks compatibility)?



Phase 4: Continuous Batching & Prefix Caching


What you're building: A vLLM engine configured to maximize throughput via continuous batching and minimize redundant compute via prefix caching.


Key technical decisions:


  • Enable prefix caching? --enable-prefix-caching flag. This is a huge win if your workload has repeated system prompts (chatbots, agents) but adds memory overhead for the cache itself. Benchmark with and without.


  • Scheduler policy: --scheduler-policy=fcfs (first-come first-served) vs. shortest-job-first. FCFS is fairer but can lead to head-of-line blocking when a long request delays short ones.


  • Swap space: --swap-space=4 (GB) allows vLLM to offload KV-cache blocks to CPU RAM when GPU memory is full. Useful for spiky traffic but adds latency (PCIe transfers).



Phase 5: Docker Compose & Deployment


What you're building: A docker-compose.yml file that orchestrates the vLLM container, mounts the model cache volume, maps GPU devices, and optionally includes an Nginx reverse proxy.


Key technical decisions:


  • Volume mount strategy: Named volume (vllm-cache:/root/.cache/huggingface) vs. bind mount (./model-cache:/root/.cache/huggingface). Named volumes are Docker-managed (simpler) but harder to inspect; bind mounts give you direct filesystem access.


  • Port mapping: Expose vLLM directly on 8000:8000 (simple) or route through Nginx on 80:80 → vllm:8000 (production-ready). The latter enables SSL, rate limiting, and load balancing.


  • Restart policy: restart: unless-stopped ensures the container restarts after crashes or host reboots, critical for production uptime.



Common Challenges


Challenge 1: NCCL Initialization Hangs on Multi-GPU Setup


Problem: When you run vLLM with --tensor-parallel-size=2, the server starts but hangs indefinitely at "Initializing distributed environment." No error message, no timeout — just silence.


Root cause: NCCL (NVIDIA's collective communication library) requires peer-to-peer GPU communication. If your host has NVLink, this works out of the box. If not (e.g., consumer GPUs over PCIe), NCCL falls back to host memory transfers, which requires specific environment variables (NCCL_P2P_DISABLE=1) to prevent it from trying direct P2P and hanging.


Fix: Set environment variable NCCL_P2P_DISABLE=1 in your Docker Compose file or runtime command. For cloud instances, verify that the instance type supports GPU-to-GPU communication (AWS p3/p4 instances yes, g4dn instances no).


Challenge 2: Out-of-Memory Errors Mid-Request Under Load


Problem: vLLM runs fine for small workloads, but when you hit it with 20 concurrent requests, you get CUDA out of memory errors mid-generation, even though your GPUs have "enough" VRAM according to nvidia-smi.


Root cause: vLLM's memory estimator (--gpu-memory-utilization=0.9) reserves space for the model weights and a fixed KV-cache pool. But the actual KV-cache size depends on the number of sequences in the batch AND their lengths. If you set --max-num-batched-tokens too high, vLLM tries to allocate more KV-cache blocks than fit in VRAM, and the allocation fails mid-inference after some blocks are already committed.


Fix: Lower --max-num-batched-tokens from the default (often 8192+) to a value that fits your VRAM budget. Formula: available_VRAM = total_VRAM × gpu-memory-utilization - model_size. Then max_batched_tokens = available_VRAM / (bytes_per_token × num_layers × 2). Or just set it to 2048 and increase gradually while load testing.


Challenge 3: Model Download Timeout on First Container Start


Problem: docker-compose up runs for 15 minutes, then fails with "Connection timeout" while downloading the model from Hugging Face Hub.


Root cause: Hugging Face Hub rate-limits unauthenticated downloads and may throttle large files (a 7B model in fp16 is ~14 GB). Without a HF token, downloads can stall. Additionally, Docker's default network timeout is 60 seconds, too short for multi-GB files.


Fix: (1) Set environment variable HF_TOKEN=your_huggingface_token in Docker Compose to authenticate and bypass rate limits. (2) Pre-download the model outside the container using huggingface-cli download and mount it as a volume, or (3) Use a Docker image that already includes the model weights (bloated but reliable for production).


Challenge 4: Client Receives Malformed JSON in Streaming Mode


Problem: When using stream=true, your OpenAI Python client crashes with JSONDecodeError: Expecting value: line 1 column 1 even though non-streaming requests work fine.


Root cause: vLLM's streaming implementation sends Server-Sent Events (SSE) in the format data: {json}\n\n. Some HTTP clients (especially older versions of the requests library) don't handle SSE correctly and try to parse the raw stream as a single JSON object instead of line-delimited events.


Fix: Use the official openai Python SDK v1.0+, which has built-in SSE support, or manually parse the stream: split on \n\n, strip data: prefix, parse each chunk as JSON. Never use response.json() on a streaming response.


Challenge 5: Inconsistent Latency Spikes Every ~30 Seconds


Problem: Your vLLM endpoint usually responds in 200ms, but every ~30 seconds, a random request takes 2+ seconds. Load is constant, GPU utilization looks normal.


Root cause: Python's garbage collector (GC) runs periodically and can freeze the process for 100–500ms when cleaning up large objects (like KV-cache blocks). vLLM's PagedAttention allocates thousands of small tensors, triggering frequent GC.


Fix: Disable automatic GC and manually trigger it during idle periods, or increase the GC threshold (gc.set_threshold(10000, 10, 10)). In production, monitor GC pause time with gc.get_stats() and tune accordingly.


Challenge 6: vLLM API Returns Different Tokens Than OpenAI for Same Prompt


Problem: You send the same prompt to OpenAI's API and your vLLM endpoint with identical parameters (temperature=0), and get completely different outputs.


Root cause: "Identical parameters" aren't actually identical. OpenAI uses proprietary sampling logic, undocumented post-processing, and often applies content filters that modify logits. Even at temperature=0 (greedy decoding), implementation details like floating-point precision, tie-breaking in top-k/top-p, and the exact tokenizer vocab can cause divergence.


Fix: Accept that vLLM and OpenAI will never be byte-for-byte identical. For reproducibility, pin temperature=0, set a fixed seed, and compare outputs on quality (are they semantically equivalent?) rather than exact string match. If you need OpenAI-level consistency, you need OpenAI's API.


Challenge 7: Docker Container Sees 0 GPUs Despite nvidia-smi Working on Host


Problem: nvidia-smi works perfectly on your host machine, showing 2 GPUs, but inside the Docker container, nvidia-smi says "No devices found."


Root cause: The NVIDIA Container Toolkit isn't installed, or Docker isn't configured to use the nvidia runtime. Simply installing nvidia-docker2 isn't enough — you must edit /etc/docker/daemon.json to set the default runtime or explicitly pass --gpus all to docker run.


Fix: Install nvidia-container-toolkit: sudo apt-get install -y nvidia-container-toolkit, then edit /etc/docker/daemon.json to add "default-runtime": "nvidia", restart Docker daemon (sudo systemctl restart docker), and verify with docker run --rm nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi.



Ready to Build This Yourself?


Understanding the architecture is one thing. Shipping production code that actually works when 100 users hit your endpoint simultaneously is another.


Here's what you get in the full course ($24.99):


Complete Docker Compose setup — copy-paste docker-compose.yml with all GPU configs, volume mounts, and environment variables already tuned

Tested vLLM configuration files — production-ready --tensor-parallel-size, --gpu-memory-utilization, and --max-num-batched-tokens values for Mistral-7B and Llama-3-8B

OpenAI-compatible client examples — Python scripts using the openai SDK to test completions, chat, and streaming against your endpoint

Prefix caching setup guide — enable and verify that repeated system prompts are hitting the cache, not recomputing

Load testing scripts — Locust configurations to simulate 50–200 concurrent users and measure P50/P95/P99 latency

Quantization comparison — side-by-side benchmarks of fp16 vs. AWQ vs. GPTQ on quality, VRAM, and throughput

Nginx reverse proxy config — SSL termination, rate limiting (100 req/min per IP), and health check routing

Video walkthroughs — 3 hours of screencasts showing every docker-compose up, every debug step, every config tweak

Lifetime access — all future updates, new vLLM versions, additional model support

Private community — Slack workspace where you can ask questions and share benchmarks with other users

Deployment checklists — pre-flight checks before going live, monitoring setup, cost optimization tips

Troubleshooting playbook — decision trees for diagnosing OOM errors, NCCL hangs, slow inference, API mismatches


$24.99. Everything above.


👉 [Get the Full Course → labs.codersarts.com]


Need hands-on help? Book a 1:1 guided session ($99) where a Codersarts engineer pair-programs with you to get your specific model, GPU setup, and cloud environment running end-to-end. Includes up to 2 hours of live debugging, architecture review, and production-readiness checklist. [Schedule your session →]


Conclusion

Deploying vLLM in production is not just "install a library and run a script." It's tensor-parallel configuration, continuous batching tuning, VRAM budgeting, nvidia-docker runtime debugging, and OpenAI API compatibility — each with non-obvious failure modes that take hours to diagnose.


The architecture described here — OpenAI-compatible endpoint, tensor parallelism across 2 GPUs, continuous batching, prefix caching, and Docker Compose orchestration — is the foundation every self-hosted LLM deployment needs. Start with Stack A (Mistral-7B, Docker Compose, no quantization) to prove the concept, then graduate to Stack B (AWQ quantization, Kubernetes, monitoring) when you're ready to scale.


Ready to ship it? The full course gives you the tested code, configurations, and video walkthroughs to go from git clone to production in a weekend. [Start building → labs.codersarts.com]

 
 
 

Comments


bottom of page