Load balancing web APIs vs LLM APIs
Same skeleton, different sport
Two system designs, side by side: load balancing a traditional web API and load balancing an LLM (Large Language Model) inference API. The diagrams look almost the same. The operational reality behind the boxes is what makes LLM Ops a different sport – and a useful exercise for any SRE who knows web ops cold to map onto familiar territory.
Hardware first, because everything below follows from it. A traditional web API replica is a language-runtime process (Gunicorn, Node, Go binary) inside a Docker container, scheduled by Kubernetes onto a commodity x86 box – think AWS c6i.4xlarge or m6i.4xlarge, 16 vCPU and 32–64 GB RAM, around $0.70/hr on-demand, fungible, ready in seconds. An LLM API replica is a dedicated inference engine (vLLM, TGI – Text Generation Inference, TensorRT-LLM) inside a container with GPU devices passed through, scheduled onto a GPU node – AWS p5.48xlarge with 8× NVIDIA H100 80 GB or p4d with 8× A100, $40–98/hr on-demand, 30 s–3 min to load weights into HBM (High Bandwidth Memory), and scarce enough that capacity is something to queue for. Same word "replica," different cost class, different scarcity, different blast radius when one dies.
The web API design uses familiar building blocks:
The LLM API design rhymes – box names look similar, but every layer carries extra responsibility:
The mental map – web concept → LLM concept
- NGINX → Smart router (Envoy + LiteLLM). Same job, but it now reads the request body. It hashes prompt prefixes for sticky routing, counts tokens (not requests) for rate limits, and routes on per-replica queue depth. The L7 (Layer 7) LB just gained opinions about content.
- Gunicorn worker → vLLM/TGI process. A web worker handled one request to completion. The inference engine runs one process per GPU shard and interleaves tokens from many in-flight requests in each forward pass – continuous batching. The queue that used to sit in front of the worker now lives inside it, invisible to standard health probes.
- Sticky sessions → KV-cache affinity. Sticky sessions were the exception (auth state, file uploads). For LLMs, prefix affinity is the default – same conversation to the same replica reuses the KV-cache (Key-Value cache) and cuts TTFT (time-to-first-token) drastically. Hash-on-prefix first, fall back to least-loaded only when the prefix is cold.
- DB connection pool → GPU memory budget. The new "pool" is HBM. Weights eat ~140 GB for a 70B-param fp16 (16-bit float) model; every active request burns KV-cache on top (hundreds of MB to several GB depending on context length). Overcommit and the GPU OOMs (Out Of Memory) – there is no graceful "wait in queue" at this layer, only at the router.
- Redis response cache → Prefix cache. Same intent (skip recomputation), different unit. The cache stores attention states by prompt prefix, not JSON by URL. A router-level prefix-cache hit can skip the GPU entirely.
- Scale on CPU/RPS → Scale on TTFT and queue depth. RPS (requests per second) is meaningless when one request can be 1000× another in compute. Scaling signals become p95 TTFT, in-flight tokens, and batch fullness. The painful part: cold start to load weights into a fresh GPU is 30 s–3 min vs. seconds for a Gunicorn process – HPA (Horizontal Pod Autoscaler) windows and pre-warmed standby capacity become a design problem, not a tuning detail.
- Rolling deploy → Long drains and cascade failover. A streaming completion can run for minutes, so
terminationGracePeriodSecondsjumps to 10+ min and the router has to stop new traffic to a draining pod without breaking in-flight streams. A failed request retried on a peer with no GPU headroom just fails again, so production failover cascades down a model tier ladder – big model → smaller model → cached canned response. There's also a new failure class with no web equivalent: semantic failures (gibberish, infinite loops, hallucinated tool calls) that pass every health probe.
The skeleton is the same. Every box gained a constraint that web infrastructure never had to worry about. The good news for SREs: most of the muscle memory transfers – autoscaling, blue/green, observability, rate limiting, circuit breakers all still apply. It just gains a layer that has to reason about what's inside the request and what's resident on which GPU, neither of which a web LB ever cared about.