Skip to main content
    Skip to main contentSkip to navigationSkip to footer
    Artificial Intelligence

    Continuous Batching

    Also known as:
    Dynamic Batching
    In-Flight Batching
    Iteration-Level Batching
    Updated: 2/9/2026

    A serving technique that inserts new requests into running batches as soon as other requests complete, instead of waiting for batch completion.

    Quick Summary

    Continuous batching immediately fills GPU slots – 2-5x higher inference throughput.

    Explanation

    With static batching, short requests wait for long ones. Continuous batching immediately inserts new requests when slots free up. Result: Higher GPU throughput, lower latency for short requests, better utilization.

    Marketing Relevance

    Continuous batching is standard in modern inference servers (vLLM, TGI). Enables 2-5x higher throughput for production LLM APIs.

    Example

    vLLM with continuous batching achieves ~2000 tokens/s on A100, compared to ~500 tokens/s with static batching (same model).

    Common Pitfalls

    Requires KV-Cache management (PagedAttention). More complex implementation than static batching. Memory fragmentation with many short requests.

    Origin & History

    Continuous batching was popularized in 2022-2023 through Orca (Microsoft) and vLLM (UC Berkeley). Now standard for production LLM serving.

    Comparisons & Differences

    Continuous Batching vs. Static Batching

    Static batching waits for all requests in batch; Continuous immediately inserts new ones when slots free up.

    Related Services

    Related Terms

    👋Questions? Chat with us!