Continuous Batching
A serving technique that inserts new requests into running batches as soon as other requests complete, instead of waiting for batch completion.
Continuous batching immediately fills GPU slots – 2-5x higher inference throughput.
Explanation
With static batching, short requests wait for long ones. Continuous batching immediately inserts new requests when slots free up. Result: Higher GPU throughput, lower latency for short requests, better utilization.
Marketing Relevance
Continuous batching is standard in modern inference servers (vLLM, TGI). Enables 2-5x higher throughput for production LLM APIs.
Example
vLLM with continuous batching achieves ~2000 tokens/s on A100, compared to ~500 tokens/s with static batching (same model).
Common Pitfalls
Requires KV-Cache management (PagedAttention). More complex implementation than static batching. Memory fragmentation with many short requests.
Origin & History
Continuous batching was popularized in 2022-2023 through Orca (Microsoft) and vLLM (UC Berkeley). Now standard for production LLM serving.
Comparisons & Differences
Continuous Batching vs. Static Batching
Static batching waits for all requests in batch; Continuous immediately inserts new ones when slots free up.