0
Act 4

Mastery

7 / 10

Inference Serving (vLLM)

Act 4 · ~5 min

Theory

Why naive deployment fails: a single GPU handles prompt evaluation (prefill) fast, then decodes one token per forward pass. Without batching, each request occupies the GPU exclusively — utilization drops to 20–30% under concurrent load.

vLLM's two key innovations:

MechanismWhat it does
Continuous batchingNew requests join the active batch mid-generation; GPU stays saturated
PagedAttentionKV cache stored in non-contiguous pages (like OS virtual memory); eliminates fragmentation
Request inarrives
Prefillprompt tokens processed
Decode (batched)tokens generated; new requests join continuously
Request lifecycle: naive vs. continuous batching.

OpenAI-compatible API — vLLM starts a server at /v1/chat/completions. Existing OpenAI clients need only a base_url change.

Deployment options:

  • Bare vllm serve command on any CUDA host
  • FastAPI wrapper adding auth, routing, or RAG context injection
  • SageMaker HuggingFace endpoint (managed autoscaling)
  • Kubernetes deployment with GPU node selectors

Metrics to instrument: requests/sec, time-to-first-token (TTFT), tokens/sec. These feed LLMOps alerting — the topic covered next.