TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Large language models (LLMs) like Meta's Llama3, Mistral's Mixtral and Cohere's Command-R+ offer powerful text generation but serving inference requests requires optimising batching strategies. This blog explores the differences between static and continuous batching for LLM inference. Static batching improves throughput for offline workloads, while continuous batching offers more flexibility for both offline and online use cases. Our experiment results show that increasing batch size boosts throughput, though saturation occurs beyond a certain point. The best approach depends on your specific workload requirements.
Large Language Models (LLMs) like Meta's Llama 3, Mistral's Mixtral, and Cohere's Command-R+ offer powerful text generation capabilities. However, delivering fast and efficient LLM inference requires smart choices around batch inference strategies. This blog explores the difference between static batching and continuous batching in LLM batch inference, and when to use each for maximum speed and throughput. We also share key findings from our benchmark using vLLM on NVIDIA A100 GPUs.
Also Read: Optimising AI inference for performance and Efficiency
Understanding Batching
Batching refers to grouping multiple LLM inference requests to improve GPU utilisation. This increases tokens-per-second (TPS) throughput during text generation tasks.
For example, by processing prompts together, LLM batch inference can significantly reduce idle time on GPUs. However, larger batch sizes also increase latency. The more prompts you wait to a group, the more time each user may wait for a response.
When considering LLM inference batching, we must distinguish between two workload types:
- Offline inference: Text generation in non-interactive applications like nightly inference for daily insights the next morning. In this case, latency is not an issue, and throughput (tokens per second) is the primary metric to optimise.
- Online inference: Text generation for interactive sessions like chatbots. Here, latency matters because it relates to user experience. In this case, both latency and throughput should be balanced.
In both cases, choosing the right batch size in LLM workloads is crucial.
Static vs. Continuous Batching in LLM Inference: Key Differences
Our analysis focuses on the difference between static and continuous batching LLM:
Static Batching
In static batching, prompts are collected together into a batch and then processed simultaneously by the model. These inputs are passed as multi-dimensional tensors to the LLM, instantiated in-memory, commonly using tools like vLLM.
This approach works best for offline workloads, where response time isn't a concern, and maximum throughput is the goal.
For our tests, we used vLLM continuous batching in controlled offline scenarios to evaluate performance across increasing batch sizes in LLM setups.
Continuous Batching
In continuous batching, prompts are sent individually to the inference engine but are dynamically grouped during processing. This is a form of real-time LLM inference batching, where requests are scheduled and preempted in live systems. It’s ideal for scenarios where workloads vary dynamically.
Unlike static batching, LLM continuous batching works well in both offline and online environments. It simulates production-like behaviour—interacting through APIs like OpenAI's—and requires robust backend systems to manage real-time batch inferencing.
Experimental Setup
For benchmarking, we used Meta’s Llama3-70B Instruct model with open-ended generation. The model was deployed on a 2x NVIDIA A100 setup using tensor parallelism.
We tested with prompt limits of up to 1024 tokens per response, representing real-world LLM inference batching where outputs vary in length. These experiments help compare static vs. continuous batching approaches under load.
Static vs. Continuous Batching: Performance Results from vLLM
The key findings from our experiments are as follows:
- Batch size boosts throughput: From batch size 1 (no batching) to 64, we saw consistent increases in tokens-per-second (TPS). This confirms that larger batch sizes in LLM inference improve throughput until saturation occurs.
- Saturation beyond batch size 64: Beyond 64 prompts, the system showed diminishing returns, especially with continuous batching inference.
We also observed KV cache memory pressure and reduced GPU efficiency in logs. - Static vs. Continuous: Mixed results: Static batching performed better at higher batch sizes (e.g., 64), suggesting that for offline use, it offers better peak performance.
The overhead in continuous batching scheduling (especially for uneven prompt lengths) caused performance drops.
Conclusion
The decision between static and continuous batching comes down to workload type. If your use case is offline (like nightly generation), choose static batching for higher throughput and better GPU efficiency.
For real-time applications like chat or assistants, use LLM continuous batching for dynamic scheduling and better latency control. Both approaches are essential components of an optimised LLM batch inference pipeline.
Want to test your own batching strategy? Sign up today to deploy LLMs on high-end NVIDIA GPUs with Hyperstack.
FAQs
What is LLM batch inference?
LLM batch inference refers to processing multiple language model requests simultaneously to improve GPU efficiency and overall throughput during inference.
What’s the difference between static and continuous batching?
Static batching groups requests together before inference begins, ideal for offline workloads. Continuous batching dynamically schedules incoming requests, making it suitable for both online and offline use.
When should I use static batching for LLMs?
Static batching is best for offline tasks where latency isn’t a concern, such as summarising large documents or running batch jobs overnight.
Is continuous batching suitable for real-time applications?
Yes, continuous batching inference is ideal for chatbots, assistants, or any workload that requires low latency and dynamic request handling.
How does batch size affect LLM performance?
Larger batch sizes in LLM inference typically improve throughput, but past a certain point, system saturation can reduce efficiency and increase memory pressure.
What is vLLM and how does it support batching?
vLLM is an inference engine optimised for LLMs. It supports both static and continuous batching and enables high-throughput performance using modern GPU hardware.
Which batching strategy works better for Llama 3 or Mixtral models?
For long or complex generations with Llama 3 or Mixtral, static batching often yields higher throughput. Continuous batching provides more flexibility for mixed, real-time requests.
Can I switch between static and continuous batching in the same system?
Yes, many systems can support both strategies. The choice depends on whether you're prioritising throughput, latency, or workload flexibility.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?