TABLE OF CONTENTS
The model passes the evaluation. The latency report comes back and it is 3× too slow for production. This is where most inference work actually begins.
Getting a model to work and getting it to run efficiently are two different problems. The first is a research problem. The second is an engineering one and it has well-understood solutions.
Our latest guide covers five techniques that reliably close the gap between a validated model and a production-ready deployment: precision reduction, graph optimisation, request batching, multi-GPU parallelism and production serving.
Key Takeaways
A quick-reference summary before you read on. Each technique below maps to a row here.
|
Technique |
Latency |
Throughput |
Complexity |
|
Mixed Precision (FP16/BF16/FP8) |
Neutral |
↑↑ ~2× |
Medium |
|
INT8 Quantisation |
↓ Faster |
↑↑ 2–4× |
High |
|
TensorRT / Graph Fusion |
↓ Lower |
↑↑↑ |
High |
|
Dynamic Batching |
Slight ↑ |
↑↑ |
Medium |
|
Multi-GPU Parallelism |
↓ Lower |
↑↑↑ |
High |
1. Mixed Precision and Quantisation
Most models are trained in FP32, full 32-bit floating point. At inference time, that precision is almost never necessary, and carrying it forward costs memory bandwidth and compute cycles you don't need to spend.
Switching to FP16 or BF16 activates Tensor Core acceleration on NVIDIA GPUs (A100, H100) and typically delivers around a 2× throughput improvement with no meaningful accuracy loss. On NVIDIA H100s, FP8 inference via TensorRT-LLM pushes this further, useful for LLMs where memory bandwidth is the primary constraint.
INT8 quantisation takes the reduction further: 2-4× faster inference, roughly half the memory footprint. The trade-off is real as the INT8 requires a calibration step on representative data, and skipping it introduces accuracy degradation that compounds across layers. Use post-training quantisation when the model is fixed or quantisation-aware training when you have room to retrain.
Tools: TensorRT, TensorRT-LLM, ONNX Runtime quantisation tools, NVIDIA Triton
Converting an ONNX model to a TensorRT FP16 engine:
trtexec --onnx=model.onnx --saveEngine=model_fp16.trt --fp16 --workspace=8192
This produces a compiled engine with FP16 arithmetic and 8GB workspace. Load it directly into Triton or a C++ inference loop. The resulting model_fp16.trt runs on GPU with fused, precision-reduced kernels, no further changes to serving code required.
One thing to watch: INT8 calibration should use inputs representative of your production distribution. Calibrating on synthetic data and deploying on real-world inputs is a common source of silent accuracy regressions.
Beyond quantisation, model pruning (removing low-impact weights) and knowledge distillation (training a smaller student model to mimic the original) can further reduce model size and inference cost. These require offline retraining but pair well with quantisation when model size is extreme and slight accuracy loss is acceptable.
2. Graph Compilation and Kernel Fusion
A model isn't just a set of weights; it's a graph of operations. Out of the box, frameworks execute those operations sequentially, with overhead at each boundary: memory reads, kernel launches, synchronisation points. Graph compilers collapse that overhead.
TensorRT is the reference implementation for NVIDIA GPUs. It analyses your model graph, fuses compatible operations (e.g. convolution + bias + activation into a single kernel), selects the fastest algorithm for each fused op based on your hardware and compiles the result into an optimised engine. The performance delta between an un-optimised PyTorch model and its TensorRT equivalent is often 3–5× on the same hardware.
ONNX Runtime with the TensorRT Execution Provider gives you most of the same gains through a more portable interface. Enable full graph optimisation and FP16 in Python:
Tools: TensorRT, ONNX Runtime (TensorRT EP, CUDA EP), XLA (TensorFlow), torch.compile (PyTorch 2.0)
import onnxruntime as ort
opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.enable_profiling = True
opts.append_execution_provider('TensorrtExecutionProvider', {
'device_id': 0,
'trt_fp16_enable': True,
'trt_int8_enable': False,
'trt_max_workspace_size': 1 << 30
})
session = ort.InferenceSession('model.onnx', opts)
The session now runs with full graph fusion and FP16 Tensor Core acceleration on GPU 0. Profiling is enabled; the output JSON will show per-op timing, which is useful for identifying any remaining bottlenecks.
For TensorFlow workloads, XLA JIT compilation achieves similar fusion. Enable it per-session with tf.config.optimizer.set_jit(True) or globally with the environment variable TF_XLA_FLAGS=--tf_xla_auto_jit=1. PyTorch 2.0's torch.compile() offers a Python-native path to kernel fusion without leaving the framework.
One caveat: graph compilers add startup compilation overhead. For models with highly dynamic input shapes or very low total inference volume, the fixed compilation cost may not pay off. Benchmark end-to-end before committing.
3. Request Batching: Static and Dynamic
A GPU is a massively parallel processor. Running a single inference request at a time underutilises it by an order of magnitude. Batching is how you close that gap.
The principle is straightforward: process multiple inputs in a single kernel launch. Where a single NVIDIA H100 might handle one image in 2ms, it can often process 32 images in 4ms, 16× the throughput for 2× the wall-clock time per request. The ceiling is GPU memory; find the largest batch that fits and run there.
- Static batching works well for offline or near-real-time pipelines where you control input aggregation. Set it at the framework level like max_batch_size in Triton's config, batch_size in TorchServe's YAML and your framework handles the rest. The trade-off: a fixed batch size either idles the GPU when requests are sparse or forces queuing when they arrive faster than your batch fills.
- Dynamic batching solves this for online serving. Instead of waiting for a fixed batch, the server holds incoming requests for a configured maximum delay (e.g. 10ms) and dispatches whatever has arrived. NVIDIA Triton implements this natively:
Tools: NVIDIA Triton (dynamic_batching), TorchServe (batch_delay), vLLM (continuous batching for LLMs)
# Triton config.pbtxt — BERT with dynamic batching
name: "bert"
platform: "onnxruntime_onnx"
max_batch_size: 16
input { name: "input_ids" data_type: TYPE_INT32 dims: [-1] }
output { name: "output" data_type: TYPE_FP32 dims: [-1,768] }
dynamic_batching {
preferred_batch_size: [4, 8, 16]
max_queue_delay_microseconds: 5000
}
instance_group [{ count: 2, kind: KIND_GPU }]
This config tells Triton to target sub-batch sizes of 4, 8, or 16, hold requests for up to 5ms to form a batch, and run two concurrent GPU model instances. Under moderate concurrency, GPU utilisation goes from 20–30% to 70–85% without any change to the model.
The latency-throughput trade-off here is deliberate. A 5ms queue delay is fine for batch classification. It's unacceptable for an interactive chatbot with a 100ms P99 SLA. Set max_queue_delay_microseconds to match your latency budget, not to maximise throughput.
Batching isn't the only way to keep GPUs busy. Overlapping CPU-GPU data transfers with computation using CUDA streams, pinned memory (pin_memory=True in PyTorch DataLoaders), and flags like --useCudaGraph in TensorRT reduces idle time between kernel launches and can meaningfully improve throughput without changing batch size.
4. Multi-GPU Inference: Parallelism That Scales
Single-GPU optimisations have a ceiling. Once you've reduced precision, fused kernels, and filled the batch queue, the next gains come from spreading work across multiple GPUs.
Data parallelism is the starting point and the simplest to implement. Run N independent model copies on N GPUs, each handling a subset of incoming requests. In Triton, this is a single config change:
instance_group [
{ kind: KIND_GPU, count: 2, gpus: [0, 1] }
]
Two model instances, two GPUs, roughly 2× throughput. Linear scaling holds until the bottleneck shifts to the load balancer or network stack.
Tensor parallelism is the right tool when the model doesn't fit in a single GPU's VRAM. The weight matrices are sharded across GPUs; each processes its slice in parallel and the results are aggregated via an All-Reduce operation. vLLM implements this cleanly for large language models:
Tools: vLLM (tensor_parallel_size), DeepSpeed, Megatron-LM, HuggingFace Accelerate, NVIDIA MIG
import vllm
llm = vllm.LLM('facebook/opt-30b', tensor_parallel_size=4)
output = llm.generate('Summarise the following:')
This distributes the 30B parameter model across 4 GPUs. vLLM handles the sharding, the inter-GPU communication, and the KV cache management. From the calling code, the interface is identical to single-GPU inference.
Hardware choice matters here. NVLink (SXM form factor) GPUs like NVIDIA H100 SXM, NVIDIA A100 SXM have 600GB/s bidirectional bandwidth between GPUs. PCIe variants are limited to PCIe bandwidth, which is roughly 10× lower. For tensor parallelism, All-Reduce operations run constantly; on PCIe hardware, that communication overhead can erase the compute gains. If you're running distributed inference on PCIe GPUs, prefer pipeline parallelism, which passes activations forward in a single direction and reduces round-trip communication.
For scenarios where you need to share a single physical GPU across multiple isolated workloads, NVIDIA MIG partitions an A100 or H100 into up to 7 independent instances, each with dedicated VRAM and compute. Each MIG instance behaves like an independent GPU with guaranteed QoS, no noisy-neighbour effects, no memory bleed between tenants.
5. Production Serving: Frameworks, Containers and Benchmarking
The four techniques above are model-level and runtime-level optimisations. This one is about the serving layer as the infrastructure that takes an optimised model and puts it in front of production traffic reliably.
- NVIDIA Triton Inference Server is the standard for GPU inference at scale. It handles dynamic batching, concurrent model execution, model ensembles, and supports multiple backends (TensorRT, ONNX, PyTorch, TensorFlow) from a single deployment. It runs in Docker via NVIDIA's NGC container registry and reads models from a filesystem repository, straightforward to version and redeploy.
- TorchServe is the PyTorch-native alternative. Simpler setup, .mar model archives, YAML configuration, built-in benchmarking suite. The recommended worker configuration: set torch.set_num_threads(1) per worker to avoid CPU thread contention, and configure workers as (num_GPUs / num_models) for GPU deployments.
Tools: NVIDIA Triton, TorchServe, FastAPI (custom serving), NVIDIA NGC base images, Kubernetes + KEDA
Containerisation is non-negotiable for reproducibility. The NVIDIA NGC images pin CUDA, cuDNN, and NCCL versions to known-good combinations. A minimal Triton Dockerfile:
FROM nvcr.io/nvidia/tensorrt:23.02-py3
COPY my_model /models/my_model
ENV NVIDIA_VISIBLE_DEVICES all
ENTRYPOINT ["tritonserver", "--model-repository=/models"]
The model repository maps directly to Triton's config.pbtxt structure. Build once, deploy anywhere; the driver version matches.
Benchmarking before shipping is the discipline that holds all of this together.
Treat benchmarking as a gate, not an afterthought. Always warm the model with dummy inferences first, test with representative production inputs, and separate cold-start load time from steady-state inference metrics.
Run NVIDIA Triton Perf Analyser to stress the server across a range of batch sizes and concurrencies before any configuration change goes to production:
perf_analyzer -m resnet50 \
--concurrency-range 1:16 \
--max-batch-size 32 \
--measurement-interval 500
This sweeps concurrency from 1 to 16 simultaneous clients, with batches up to 32, and reports P50/P90/P99 latency and throughput at each setting. The output shows where throughput plateaus and where latency starts degrading; the inflection point is your operating configuration.
A few benchmarking disciplines worth enforcing: always warm the model with a few dummy inferences before measuring (cold-start compilation skews results); use representative input sizes from your actual production distribution; separate model load time from inference time in your metrics; and record GPU utilisation alongside latency, a fast P99 at 40% GPU utilisation means you have headroom, a fast P99 at 98% means you're one traffic spike from degradation.
Technique Comparison
You can use the table below as a decision aid, not a ranking. The right technique depends on your model size, SLA, hardware and traffic pattern.
|
Technique |
Latency Impact |
Throughput |
Cost Efficiency |
Complexity |
Risk |
|
Mixed Precision (FP16/BF16/FP8) |
Neutral |
↑↑ (~2×) |
High |
Medium |
Minor accuracy loss |
|
INT8 Quantisation |
↓ Reduced |
↑↑ (2–4×) |
High |
High |
Accuracy drop if miscalibrated |
|
TensorRT / Graph Fusion |
↓ Lower |
↑↑↑ |
High |
High |
Compile overhead at startup |
|
Static Batching |
↑ Higher/request |
↑↑↑ |
High |
Low |
Slower per-request latency |
|
Dynamic Batching |
Slight ↑ (queue) |
↑↑ |
High |
Medium |
Added queueing delay |
|
Multi-GPU (Data Parallel) |
↓ Lower |
↑↑↑ (N×) |
High |
Medium |
Higher cost if underused |
|
Model/Pipeline Parallelism |
↓ Lower |
↑↑ |
Medium |
High |
Programming complexity |
Run These Techniques on Hardware That Doesn't Fight You
Precision reduction gives you 2× throughput. Graph fusion gives you another 3-5×. Multi-GPU parallelism scales from there.
Hyperstack GPU VMs give you NVIDIA H100 and NVIDIA A100 infrastructure with NVLink-class interconnect for multi-GPU workloads and predictable performance benchmarks you can actually plan around. Spin up a VM, run these configurations, and measure on hardware where the numbers are yours.
FAQs
What is the fastest way to improve inference performance without retraining the model?
Switching to mixed precision such as FP16 or BF16 is usually the quickest win. It activates Tensor Core acceleration on GPUs like NVIDIA A100 and NVIDIA H100 and often delivers around 2× throughput with minimal or no accuracy impact.
When should I use INT8 quantisation instead of FP16 or BF16?
INT8 is best when you need maximum throughput and memory efficiency, especially for large-scale deployments. It requires proper calibration on real data. If accuracy is critical and you cannot validate calibration quality, FP16 or BF16 is the safer option.
Does TensorRT always improve performance over native frameworks?
In most cases yes, especially on NVIDIA GPUs. NVIDIA TensorRT applies graph fusion and kernel optimisation that can deliver 3–5× speedups. However, for highly dynamic models or very low traffic workloads, the compilation overhead may not be worth it.
How do I choose between static and dynamic batching?
Static batching works well when you control input flow, such as offline jobs. Dynamic batching is better for real-time systems with unpredictable traffic. Tools like NVIDIA Triton Inference Server let you balance latency and throughput by configuring queue delay based on your SLA.
When is multi-GPU inference necessary?
Use multiple GPUs when a single GPU is fully utilised or when the model does not fit in memory. Data parallelism scales throughput linearly, while tensor parallelism is required for very large models. High-speed interconnects like NVLink on GPUs such as NVIDIA H100 SXM significantly improve multi-GPU efficiency.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?