TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Key Takeaways
-
Distributed inference is not a default step in AI deployment. It is a response to specific, observable constraints.
-
Three signals tell you your team needs it: single-node VRAM is exhausted, p99 latency climbs under concurrency, or your model cannot fit on one GPU.
-
The techniques that address those constraints, tensor parallelism, pipeline parallelism, KV cache management, and prefill/decode disaggregation, each place specific demands on the infrastructure underneath them.
-
Teams that underspecify infrastructure discover the gaps under production pressure, not during planning.
-
The infrastructure decision arrives earlier in the process than most teams expect.
Most teams do not start thinking about distributed inference at the right moment. They start thinking about it when something breaks.
p99 latency spikes. A model too large to load on a single GPU. Request queues that back up the moment traffic climbs past a threshold the team thought was comfortable. By that point, the infrastructure decision is no longer a planning question. It is an incident.
In our latest blog, we cover what distributed inference is, the three signals that tell you when your team needs it and what happens when the infrastructure is not built for the job.
What Is Distributed Inference?
Distributed inference is the practice of splitting a model computation across multiple GPUs, nodes, or servers to serve requests at scale. Rather than loading an entire model onto a single device and processing requests sequentially, the workload is distributed so that inference can run faster, handle more concurrent requests, and operate models too large to fit on a single GPU.
It is not the same problem as that of distributed training. Training distributes gradient computation across GPUs to reduce wall-clock time on a bounded job with a finish line. Inference distributes live request handling to maintain latency and throughput under continuous, unpredictable load. The failure modes are different. The tooling is different. The infrastructure requirements are different.
A team that ran distributed training successfully should not assume that experience transfers directly to production inference at scale. The overlap is smaller than it looks.
Three Signals Your Team Actually Needs It
Distributed inference is not a default step in AI deployment. It is a response to specific, observable constraints. Before committing to the architecture, it helps to know which constraint you are actually solving for.
1. Your model no longer fits on a single GPU
This is the clearest trigger, but it is also the one most often misdiagnosed. Modern frontier models exceed the VRAM capacity of a single GPU at practical inference precision. Llama 3.1 405B needs roughly 810 GB at FP16, 405 GB at FP8, and around 203 GB at INT4. An H100 with 80 GB or H200 with 141 GB cannot hold it at any reasonable precision. Even a B200 with 192 GB is borderline at INT4.
Before jumping to distribution, the question every team should answer first is what precision and quantisation schemes have already been tried such as FP8, INT8, INT4, AWQ, GPTQ. Quantisation is cheaper operationally than distribution, and for many workloads it closes the gap. When it does not — and for frontier models at long context, it usually does not — distribution stops being a design choice. The question then becomes which distribution strategy fits the model architecture.
2. p99 latency rises with concurrency despite low GPU utilisation
This is the signal teams most often misread. Inference latency has two components that users feel separately: time to first token (TTFT), the delay before streaming starts, and inter-token latency (ITL), sometimes called time per output token (TPOT), which is the spacing between tokens once streaming begins.
If GPU utilisation is moderate but p99 TTFT or ITL degrades as request concurrency rises, the bottleneck is not compute. It is memory bandwidth, KV cache pressure, or request scheduling. Adding more compute to that problem does not fix it. Distributed inference with the right architecture does, specifically by separating prefill and decode, so they stop competing for the same tensor cores and HBM.
3. Throughput is capped and batching has hit its limit
Batching requests together is the standard first step to improve inference throughput on a single node. It works until it does not. When batch size is constrained by VRAM, or when latency targets prevent you from holding requests long enough to fill out a useful batch, you have reached the ceiling of single-node optimisation. Distributed inference is the architecture that exists on the other side of that ceiling.
If none of these signals is present, distributed inference may be an architecture for a problem you do not yet have. That is worth knowing, too. The overhead of building and operating a distributed inference system is real, and teams that introduce it before they need it spend engineering capacity on complexity that is not yet earning its place.
The Techniques and What Each One Requires
Distributed inference is a family of techniques, not one. Each addresses a different constraint and places different demands on the infrastructure underneath.
Data parallelism replicates the whole model and load balances requests across copies, useful when the model fits on one GPU and the bottleneck is volume. Model parallelism splits the model itself across devices and has three subtypes: tensor, pipeline, and expert parallelism.
Production deployments typically combine several at once: data parallelism for throughput, tensor parallelism within a node for latency, pipeline parallelism across nodes for scale.
- Tensor parallelism splits the operations within each layer across GPUs. A single matrix multiplication is partitioned row or column-wise so every GPU computes a shard of the same layer, then synchronises via an all-reduce. It reduces per-request latency but requires extremely low latency, high bandwidth GPU-to-GPU communication, typically NVLink or NVSwitch within a node. If the fabric introduces variance, tensor parallelism degrades faster than its gains justify. It is only as good as the network it runs on.
- Pipeline parallelism assigns different layers to different GPUs, processing requests in stages. It tolerates higher inter-node latency and scales to very large models. The trade-off is pipeline bubbles and idle GPU time between stages, unless batching is managed carefully. Pipeline parallelism buys model scale, tensor parallelism buys speed. Most deployments combine both.
- Expert parallelism applies to a mixture of expert models such as DeepSeek V3 and recent Mixtral and Llama variants, where a router activates only a subset of experts per token. Experts are distributed across GPUs and every token triggers an all-to-all collective. It makes serving large MoE (Mixture of experts) models practical but places extreme demands on collective performance, typically requiring high bandwidth InfiniBand or 800 GbE class fabrics.
- KV cache management addresses the memory constraint that emerges at scale. The KV cache stores intermediate attention computations so the model does not recompute them on every generation step. At scale, with long contexts and many concurrent users, it can consume most of the VRAM. When it fills, production engines preempt in-flight requests by pausing generation, swapping KV blocks to host memory, or recomputing on resume, which shows up as non-deterministic tail latency spikes. PagedAttention, used in vLLM, mitigates this by allocating KV memory in fixed-size blocks rather than contiguous buffers, reducing fragmentation and increasing concurrency before preemption.
- Prefill/decode disaggregation separates the two phases of inference that compete for different resources on shared hardware. Prefill processes the prompt in one large forward pass and is compute-bound. Decode generates tokens one at a time and is memory bandwidth bound. On the same node, a long prefill stalls decode for other in-flight requests and TTFT spikes unpredictably. Disaggregation routes each phase to a dedicated pool and can significantly improve goodput under bursty load.
One technique worth naming, even though it lives within a node, is continuous batching. Modern engines such as vLLM, TensorRT-LLM, SGLang, NVIDIA Dynamo, and TGI add and evict requests from a running batch mid-generation. Combined with PagedAttention, it delivers one of the largest single-node throughput gains and should be exhausted before taking on the operational cost of distribution.
|
Technique |
What it splits |
Primary benefit |
Key requirement |
|---|---|---|---|
|
Data parallelism |
Full model replicas |
Throughput |
Standard networking |
|
Tensor parallelism |
Ops within a layer |
Per request latency |
NVLink or NVSwitch within the node |
|
Pipeline parallelism |
Layers across GPUs |
Model scale |
Moderate inter-node bandwidth |
|
Expert parallelism |
MoE experts |
Serving large MoE |
Fast all-to-all communication |
|
KV cache management |
Memory usage |
Concurrency and long context |
VRAM and fast offload |
|
Prefill/decode disaggregation |
Inference phase |
Predictable TTFT |
Dedicated hardware pools |
What Goes Wrong When Infrastructure Is Underspecified
The techniques above assume consistent, predictable access to compute, memory and network. On shared, multi-tenant infrastructure, that assumption does not hold reliably.
Tensor parallelism is sensitive to network variance. Co-located workloads on shared infrastructure introduce contention at exactly the layer where consistency matters most. Benchmark results from a quiet cluster and a loaded cluster are not the same number. Engineering teams making architecture decisions based on throughput figures gathered on shared infrastructure are making those decisions on data that will not be reproduced in production.
KV cache behaviour is equally affected. Memory pressure from other tenants changes eviction patterns in ways that are opaque and outside your control. What presents as a performance tuning problem is often a tenancy problem wearing a different mask.
For teams in regulated industries, the exposure extends beyond performance. Shared infrastructure means shared risk boundaries. Isolation controls that clear up an InfoSec review at the policy level may not be held at the infrastructure layer. Audit evidence that references shared-tenancy environments raises questions that take time to answer. That time tends to appear at the worst moments: during security reviews, procurement cycles, or client due diligence conversations that were supposed to close a deal.
None of this means shared infrastructure cannot run inference for workloads. It means the techniques that make distributed inference perform well require conditions that shared infrastructure does not consistently provide. The gap between what you test and what you run in production is where SLA commitments get missed.
What the Right Infrastructure Looks Like
Hyperstack's Secure Private Cloud is ideal for distributed inference workloads at enterprise scale.
Single-tenant, dedicated GPU clusters remove the source of the variance described above. The network bandwidth your inference architecture was designed to use is the bandwidth it gets under load at peak traffic, with no competing workloads changing the conditions underneath your benchmarks. Tensor parallelism performs as measured. KV cache eviction behaves as configured. Prefill and decode disaggregation operate without resource contention from other tenants.
Networking is selected to match the communication patterns of your inference architecture specifically. RoCE Ethernet or InfiniBand fabrics are chosen based on workload scale and the latency profile your distribution strategy requires. NVIDIA ConnectX-8 SuperNICs are deployed where GPU-to-GPU communication bandwidth is the limiting factor. The right fabric for tensor parallelism is not the same as the right fabric for pipeline parallelism across many nodes. The deployment is designed around your architecture, not the other way around.
Storage is specified to match inference and pipeline demands. Local NVMe handles high-throughput KV-cache and fast checkpointing; persistent volumes store model weights across sessions, and secure object storage ensures durable retention. NVIDIA CMX extends GPU memory capacity with a dedicated, pod-level KV cache tier, reducing latency and improving throughput for long-context inference workloads.
The Secure Private Cloud deployment model is bespoke and not self-serve. In practice, your requirements are reviewed, the architecture is designed against your workload and compliance obligations and the environment is built and validated against agreed acceptance criteria before production traffic touches it. For engineering directors accountable for delivery dates and audit readiness, that process is not an overhead. It is the basis on which commitments to the rest of the organisation can be made with confidence.
The decision to move to distributed inference is not a single moment. It is a sequence of signals that accumulate until the question becomes unavoidable. The teams that manage it well are the ones who read those signals early and make the infrastructure decision before it is forced on them by a production incident.
Secure Your Infrastructure with Hyperstack Private Cloud
If your team is approaching any of the three thresholds described here, the infrastructure conversation is worth having now.
Speak to the Hyperstack Team about our Secure Private Cloud deployments to ensure your scaling remains predictable and protected.
FAQs
What is distributed inference and why is it used?
Distributed inference splits model execution across GPUs or nodes to improve latency, throughput, and enable models too large for one device.
How is distributed inference different from distributed training?
Training distributes compute to finish jobs faster, while inference distributes live requests to maintain latency, throughput, and reliability under unpredictable demand.
When should a team consider distributed inference?
When models exceed single-GPU memory, latency increases under concurrency, or throughput plateaus despite batching, indicating architectural limits of single-node inference systems.
What causes p99 latency spikes during inference?
Latency spikes often result from memory bandwidth limits, KV-cache pressure, or scheduling contention rather than insufficient compute, especially under concurrent request loads.
What is the KV cache and why does it matter?
KV cache stores intermediate attention states, reducing recomputation, but at scale it consumes VRAM heavily, requiring efficient management to avoid latency spikes.
What is prefill and decode disaggregation?
It separates prompt processing and token generation onto different resources, preventing contention and ensuring consistent time-to-first-token under varying workload conditions.
Why does infrastructure choice matter for distributed inference?
Distributed inference depends on predictable network, memory, and compute performance; shared infrastructure introduces variability that impacts latency, throughput, and system reliability.
What role does storage play in inference performance?
Storage supports KV-cache operations, model persistence, and data durability, while technologies like NVIDIA CMX extend memory capacity to reduce long-context latency.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?