TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Key Takeaways
- Distributed inference bottlenecks are usually caused by infrastructure limitations, not model architecture or software configuration issues.
- High-speed interconnects like InfiniBand significantly reduce all-reduce latency and improve multi-node inference scalability.
- Paged attention dramatically improves KV cache utilisation and reduces GPU memory fragmentation during long-context inference workloads.
- Excessive tensor parallelism can reduce GPU efficiency due to increased synchronisation and communication overhead between devices.
- Predictable inference performance requires purpose-built infrastructure with reliable networking, storage bandwidth, failover capacity and operational visibility.
The model passes every test on a single node. Latency is within target, throughput is acceptable, and GPU utilisation looks clean. Then you scale to a 64-GPU cluster, and the numbers fall apart. Latency doubles. Throughput per GPU drops. A single node failure stalls the entire job. The model did not change. The infrastructure assumptions did.
Distributed inference challenges do not announce themselves during development. They surface in production at scale under real load. The gap between single-node performance and multi-node reality is where engineering schedules slip and SLA commitments become difficult conversations. Our latest blog names the most common distributed inference issues, explains the mechanism behind each one and sets out what a resolution looks like in practice.
Challenge 1: All-Reduce Communication Overhead
In tensor-parallel inference, model weights are split across multiple GPUs. Each GPU processes a portion of the computation. Before the output of each layer can move forward, the partial results from all GPUs must be aggregated. It is a collective operation called all-reduce.
All-reduce requires every participating GPU to send and receive data from every other GPU in the group. On a ring topology, total cluster traffic scales linearly with the number of devices, and per-step latency scales with the ring length —both of which degrade performance as you scale GPU count. At 16-bit precision with a hidden dimension of 8,192, the two all-reduces per transformer layer (one after self-attention, one after the MLP) move on the order of 1 GB of activation data combined at typical prefill batch sizes. For a 96-layer model, that puts total cross-GPU traffic on the order of 100 GB per forward pass, not counting KV cache transfers.
The right networking fabric is load-bearing here. InfiniBand NDR (400 Gb/s per port) or RoCE with RDMA support over lossless Ethernet reduces the per-all-reduce latency from milliseconds to microseconds. Beyond fabric choice, reducing tensor parallelism degree where batch size allows and using pipeline parallelism across nodes instead cuts all-reduce frequency. The correct degree of tensor parallelism depends on the model architecture and the available inter-GPU bandwidth, not a default setting.
Challenge 2: KV Cache Fragmentation Across Nodes
In multi-node inference, the KV cache for a single request may be spread across GPUs on different nodes. When requests complete and new ones arrive, the freed memory blocks are non-contiguous. New requests cannot always use the available space efficiently, leaving GPU memory underutilised while the system reports it as free.
GPU memory allocators were not designed for the append-only, variable-length access pattern of KV cache. A GQA-based 70B model at FP16, such as Llama 3 70B, costs roughly 320 KB of KV cache per token per request. At a batch size of 64 with sequences averaging 8,000 tokens, that is approximately 160 GB of KV cache demand. When requests finish at different times and memory is returned in fragments, the allocator cannot always coalesce those fragments into contiguous blocks large enough for incoming requests.
Paged attention, the mechanism underlying vLLM and similar serving frameworks, addresses fragmentation by dividing KV cache into fixed-size blocks (pages), analogous to virtual memory in an operating system. Requests do not need contiguous memory; they hold a set of pages that can be scattered across the memory space. This raises effective GPU memory utilisation from around 40-60% under naive allocation to over 90% under paged allocation. The infrastructure implication is that the serving framework matters as much as the hardware.
Challenge 3: Load Imbalance Across Replicas
Distributed inference typically runs multiple model replicas in parallel, with a load balancer routing incoming requests across them. If requests vary significantly in sequence length, some replicas process short prompts and sit idle while others are backed up on long-context requests for seconds at a time.
A naive round-robin or random load balancer has no visibility into the current state of each replica's KV cache, queue depth, or estimated completion time. A replica processing a 32,000-token request takes roughly 8 times longer to complete than one processing a 4,000-token request. During that time, it cannot accept new work, while other replicas cycle through multiple short requests. At scale, this produces a GPU utilisation profile that looks efficient in aggregate, but masks severe per-replica imbalance.
Continuous batching, where the serving engine slots new requests into a running batch as previous ones complete, rather than waiting for the entire batch to finish, resolves the static batching imbalance problem. Beyond that, request-aware routing that considers each replica's current queue depth and estimated time-to-completion reduces head-of-line blocking. Tools like vLLM and TGI implement continuous batching by default. This works when inter-node latency is low enough that the coordination overhead of dynamic batching does not erode the gains.
Challenge 4: Tensor Parallelism Degree vs. Batch Efficiency
Increasing the tensor parallelism degree and splitting the model across more GPUs reduces per-device memory pressure and allows larger models to fit. It also increases all-reduce frequency and reduces the arithmetic intensity per GPU. Beyond a certain point, adding GPUs to a tensor-parallel group makes inference slower, not faster.
Tensor parallelism is communication-bound, not compute-bound, at small batch sizes. At batch size 1, a single forward pass through a 70B model split across 8 GPUs spends more time on all-reduce synchronisation than on matrix multiplication. The compute-to-communication ratio improves as batch size increases, because the all-reduce cost is amortised across more tokens. At batch size 1 with TP=8, measured GPU utilisation on NVIDIA A100 clusters often sits below 30% despite the model being too large to fit on fewer GPUs.
The optimal tensor parallelism degree is the minimum required to fit the model in GPU memory, not the maximum available. For a 70B model at FP16, such as Llama 3 70B, weights alone require approximately 140 GB, which can fit across 2 NVIDIA H100 SXM GPUs (each with 80 GB HBM3) — though practical serving deployments typically require additional memory headroom for KV cache, runtime buffers and long-context workloads, which is why TP=4 is more common in production. Pipeline parallelism across nodes, combined with data parallelism across replicas, typically delivers better throughput-per-GPU than pushing tensor parallelism to a degree higher than memory constraints require.
Challenge 5: Fault Tolerance: One Node Failure Stalls the Job
In a tensor-parallel or pipeline-parallel inference deployment, all participating GPUs must be available for the forward pass to complete. A single GPU failure — such as a hardware fault, driver crash or NVLink error — renders the entire model replica unavailable until the failed node is replaced, or the workload is rescheduled.
Distributed inference is a tightly coupled operation. Each GPU holds a shard of the model weights and is a required participant in every forward pass. There is no partial result without all shards. Unlike distributed training where a checkpoint allows the job to restart from a known state, inference has no equivalent recovery mechanism for in-flight requests. Those requests are dropped and any SLA commitment around response time breaks.
The resolution operates at two levels. At the infrastructure level, the hardware must be reliable enough that node failures are rare events rather than expected operational conditions. This means dedicated, single-tenant GPU environments where hardware health monitoring is owned by the operator, not inferred from workload degradation. At the architecture level, model replicas should be sized to allow failover: if a replica fails, the load balancer must redirect to a surviving replica without cascading the failure into a queue backlog that overwhelms remaining capacity. Spare capacity is not wasted as it is the difference between a degraded service and a dropped one.
Challenge 6: Prefill/Decode Disaggregation Latency
Inference has two distinct phases: prefill (processing the input prompt in a single forward pass) and decode (generating output tokens one at a time). These phases have different compute profiles. Prefill is compute-bound and processes many tokens in parallel. Decode is memory-bandwidth-bound and processes one token per step. Running both on the same GPU fleet means each phase interferes with the other.
When prefill and decode share the same GPUs, long prefill operations occupy the device and block decode steps for other in-flight requests. A 128,000-token prefill on an NVIDIA H100 takes approximately 10-15 seconds. Every decode step for every other request in the batch waits behind it. This produces a recognisable latency pattern: time-to-first token is unpredictable, and perceived responsiveness degrades sharply under concurrent load even when per-request GPU utilisation looks acceptable.
Prefill/decode disaggregation routes the two phases to separate GPU pools. Prefill-optimised nodes handle prompt processing; decode-optimised nodes handle generation. This removes head-of-line blocking and makes time-to-first-token predictable. The trade-off is infrastructure complexity: disaggregation requires a high-speed network fabric between prefill and decode nodes to transfer KV cache entries without introducing transfer latency that erodes the gains. At 320 KB per token for a GQA-based 70B model, transferring a 128k-context KV cache between nodes requires approximately 40 GB of data movement per request — and this scales linearly with concurrent disaggregated requests, which is why it only works within acceptable latency bounds over InfiniBand or RoCE.
Challenge 7: Cold-Start Latency for Large Model Loading
When a model replica starts after a crash recovery, a scale-out event or a maintenance window, it must load model weights from storage into GPU memory before it can serve requests. For a 70B model at FP16, that is 140 GB per replica. At typical NVMe-to-GPU transfer rates, this takes 60-120 seconds per replica before the first request can be served.
GPU HBM is volatile. Weights must be reloaded from persistent storage every time a replica initialises. The transfer rate is bounded by the storage I/O path: a single NVMe drive saturates at around 7 GB/s sequential read. Loading 140 GB sequentially takes 20 seconds at that rate, and real-world paths through drivers, PCIe bus and CUDA memory allocation add overhead. Across a 16-replica deployment recovering from a cluster restart, the cumulative time before full serving capacity is restored can exceed 20 minutes.
Two mechanisms reduce cold-start latency in practice. First, parallel weight loading across multiple NVMe devices, which is straightforward on high-density GPU servers where storage bandwidth scales with the number of drives. Second, keeping at least one warm replica available during maintenance windows rather than restarting all replicas simultaneously — this trades infrastructure costs for service continuity. At the storage architecture level, parallel filesystems with high aggregate bandwidth (deployed as part of a purpose-built inference environment) reduce per-replica load time from minutes to seconds.
Fix Distributed Inference Issues at the Infrastructure Level
Most of the challenges of distributed inference described here are not software problems. They are infrastructure problems with software symptoms. All-reduce overhead disappears when the networking fabric matches the workload. KV cache fragmentation is manageable when memory allocation is purpose-built for inference access patterns. Fault tolerance is achievable when hardware reliability and operational monitoring are owned at the infrastructure level, not improvised by the application team.
The distributed inference drawbacks that show up in production, such as unpredictable latency, throughput that does not scale linearly with GPUs, and slow recovery from faults, are almost always symptoms of the same root cause: running demanding workloads on infrastructure that was not designed for them.
Hyperstack runs large-scale distributed inference on NVIDIA B300/H100 SXM clusters with:
- Up to 350 Gbps InfiniBand inter-node networking
- Managed Kubernetes and SLURM orchestration
- Reserved capacity with committed and predictable pricing
If your inference workload runs in a regulated environment or requires full data isolation, Hyperstack's Secure Private Cloud gives you a dedicated, single-tenant GPU deployment with no shared tenancy, cross-tenant exposure or noisy-neighbour variance on networking or memory.
Choose the operational model that fits how your team works: Metal Only for full stack ownership, Managed Metal for infrastructure offload, Managed Platform for Kubernetes or SLURM cluster management or Dedicated Cloud for a fully managed inference environment with dynamic GPU allocation via portal and API.
The infrastructure is yours. The performance is predictable.
FAQs
Why does inference performance degrade when scaling from a single node to a multi-node cluster?
Single-node benchmarks do not expose inter-node communication overhead. In distributed inference, GPUs constantly exchange partial results through operations like all-reduce. As cluster size grows, network latency, synchronisation overhead and KV cache transfers become dominant bottlenecks, reducing throughput-per-GPU and increasing latency.
What causes all-reduce communication overhead in distributed inference?
All-reduce occurs when GPUs participating in tensor parallelism must exchange and aggregate outputs after each layer. At large GPU counts, this creates massive east-west traffic across the cluster. Without high-bandwidth, low-latency networking like InfiniBand or RoCE with RDMA, communication time can exceed actual compute time.
Why can adding more GPUs make inference slower instead of faster?
Tensor parallelism increases communication overhead between GPUs. At low batch sizes, GPUs spend more time synchronising through all-reduce operations than performing matrix computations. Beyond the minimum number of GPUs required to fit the model into memory, increasing tensor parallelism often reduces overall efficiency and throughput.
What is prefill/decode disaggregation and why does it matter?
Inference consists of two phases with different hardware requirements: prefill is compute-heavy, while decode is memory-bandwidth-heavy. Running both on the same GPUs creates contention and unpredictable latency. Prefill/decode disaggregation separates these phases onto dedicated GPU pools, improving responsiveness and stabilising time-to-first-token performance under load.
Why is infrastructure critical for reliable distributed inference?
Most large-scale inference failures are infrastructure-related rather than model-related. Network fabric, storage bandwidth, GPU reliability, memory allocation strategy and failover design directly determine latency consistency, scaling efficiency and fault tolerance. Infrastructure purpose-built for distributed inference reduces communication bottlenecks, improves recovery times and ensures predictable performance at scale.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?