Eugene Chernenko

AI, Engineering Management, Distributed Systems, SRE, Productivity

Load balancing web APIs vs LLM APIs

2026-05-08

Same skeleton, different sport

Two system designs, side by side: load balancing a traditional web API and load balancing an LLM (Large Language Model) inference API. The diagrams look almost the same. The operational reality behind the boxes is what makes LLM Ops a different sport – and a useful exercise for any SRE who knows web ops cold to map onto familiar territory.

Hardware first, because everything below follows from it. A traditional web API replica is a language-runtime process (Gunicorn, Node, Go binary) inside a Docker container, scheduled by Kubernetes onto a commodity x86 box – think AWS c6i.4xlarge or m6i.4xlarge, 16 vCPU and 32–64 GB RAM, around $0.70/hr on-demand, fungible, ready in seconds. An LLM API replica is a dedicated inference engine (vLLM, TGI – Text Generation Inference, TensorRT-LLM) inside a container with GPU devices passed through, scheduled onto a GPU node – AWS p5.48xlarge with 8× NVIDIA H100 80 GB or p4d with 8× A100, $40–98/hr on-demand, 30 s–3 min to load weights into HBM (High Bandwidth Memory), and scarce enough that capacity is something to queue for. Same word "replica," different cost class, different scarcity, different blast radius when one dies.

The web API design uses familiar building blocks:

Load balancing a traditional web API Client to NGINX/ALB load balancer fanning to Gunicorn/Node API replicas backed by Redis cache and Postgres database. Client Load balancer NGINX / ALB / HAProxy L7, round-robin API replica Gunicorn / Node / Go Stateless, CPU-bound API replica Gunicorn / Node / Go Stateless, CPU-bound API replica Gunicorn / Node / Go Stateless, CPU-bound Cache + database Redis + Postgres Shared state, sub-ms

The LLM API design rhymes – box names look similar, but every layer carries extra responsibility:

Load balancing an LLM inference API Client to a smart router (Envoy + LiteLLM) fanning to GPU replicas running vLLM/TGI with continuous batching, backed by a prefix cache and a smaller fallback model. Client Smart router Envoy / LiteLLM / custom Token + prefix-cache aware GPU replica vLLM / TGI on H100 Batching + KV-cache GPU replica vLLM / TGI on H100 Batching + KV-cache GPU replica vLLM / TGI on H100 Batching + KV-cache Prefix cache + fallback Redis + smaller model Reuse, cascade degradation

The mental map – web concept → LLM concept

The skeleton is the same. Every box gained a constraint that web infrastructure never had to worry about. The good news for SREs: most of the muscle memory transfers – autoscaling, blue/green, observability, rate limiting, circuit breakers all still apply. It just gains a layer that has to reason about what's inside the request and what's resident on which GPU, neither of which a web LB ever cared about.