TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Key Takeaways
- Adding GPUs doesn't guarantee better performance. Communication overhead between devices can increase latency before it improves throughput, making interconnect bandwidth the real bottleneck in multi-GPU inference setups.
- The parallelism strategy must match your workload. Tensor, pipeline, and sequence parallelism each suit different latency profiles. Choosing wrong for your sequence lengths and SLA targets can silently destroy production performance.
- Shared infrastructure introduces unpredictable variance. Multi-tenant network fabrics cause p99 latency to shift day to day, making acceptance-test benchmarks unreliable and SLA commitments difficult to honour.
- Topology decides your performance ceiling. Eight GPUs on NVLink versus eight across Ethernet nodes are fundamentally different systems. Hardware topology must be validated against model size before any software-level tuning begins.
- Regulated workloads need auditable, single-tenant infrastructure. Data residency, network isolation, and access controls are not software configurations. They require dedicated infrastructure built to meet DORA, PRA SS2/21, and EU AI Act requirements.
You added more GPUs. The model loads. The latency is worse.
Nobody warned you that scaling inference horizontally is a coordination problem, not a procurement one. The hardware is the easy part. Getting GPUs to work together without eating each other’s gains is where most teams lose weeks they didn’t budget for.
Multi-GPU inference is now a requirement for any organisation running large language models in production. Models that barely fit on a single 80GB GPU yesterday are table stakes today. But the jump from single-GPU to multi-GPU inference surfaces a category of problems that don’t show up in benchmarks until you’re already past your go-live date.
Here’s what’s actually happening and the questions your infrastructure team needs to answer before you’re staring at a latency regression in production.
Why Multi-GPU Inference Fails to Scale Linearly
The assumption most teams bring to multi-GPU inference is additive: two GPUs should give roughly twice the throughput of one. In practice, the relationship is much messier.
The root cause is communication overhead. In single-GPU inference, all the computation happens in one place. The model weights sit in VRAM, attention is computed locally, and the result is returned. Add a second GPU and the model must now be partitioned across devices, and at every partition boundary the GPUs need to synchronise.
That synchronisation has a cost. Every forward pass generates inter-GPU communication: intermediate activations, KV cache updates, and attention outputs all need to be passed between devices. The latency of that communication depends on the interconnect topology between GPUs, the bandwidth available across it, and how well the partitioning strategy maps to the model's architecture. Collective operations like all-reduce, handled in practice by libraries such as NCCL, are where this cost surfaces most directly.
The key question to ask your infrastructure team: what is the actual bandwidth available between our GPUs, and does the model’s communication pattern fit it? Two GPUs on the same NVLink fabric behave very differently from two GPUs communicating over PCIe or across nodes. If the interconnect is the bottleneck, adding GPUs will make tail latency worse before it makes throughput better.
Multi-GPU LLM Inference and the Parallelism Decision
Not all GPU parallelism for inference is the same. The dominant strategies — tensor parallelism, pipeline parallelism, sequence parallelism, and expert parallelism for MoE models — have different communication profiles, different latency characteristics, and different failure modes under load.
- Tensor parallelism splits individual matrix operations across GPUs. It requires extremely tight synchronisation at every layer, usually via an all-reduce operation, which means it only performs well when GPU-to-GPU bandwidth is very high. It’s the right choice when you have fast interconnects and need to minimise time-to-first-token but the synchronisation cost means it rarely scales efficiently beyond eight GPUs on a single node.
- Pipeline parallelism distributes entire layers across GPUs, so GPUs communicate less frequently but in larger chunks. The latency profile is different: pipeline parallelism increases time-to-first-token because the first GPU must finish its layers before the next one starts, but it can improve overall throughput on long sequences. The question isn’t which is better in the abstract. It’s what fits your latency budget and sequence length profile.
- Sequence and context parallelism partition along the sequence dimension for very long contexts, with their own coordination cost around attention. And expert parallelism applies specifically to Mixture-of-Experts models (Mixtral, DeepSeek-V3, Qwen MoE), where token routing creates all-to-all communication patterns that neither tensor nor pipeline parallelism accounts for. If you are serving an MoE model, the parallelism decision is materially different.
The Right Questions Here
- What is the latency profile of our actual workload?
- Are we optimising for time-to-first-token, throughput, tail latency at the 99th percentile, or tokens per second per GPU — the metric that ultimately drives cost per generated token?
- Which parallelism strategy does our interconnect topology support without becoming the bottleneck?
If your infrastructure team can't answer that last question in terms of specific interconnect specs, that's a signal the supporting hardware choice hasn't been validated for the workload.
One distinction sits underneath all of this: prefill and decode behave like different workloads. Prefill (processing the input prompt) is compute-bound; decode (generating tokens one at a time) is memory-bandwidth-bound and largely sequential. Modern serving frameworks like vLLM, TensorRT-LLM and SGLang increasingly disaggregate the two onto different GPU pools. The time-to-first-token versus throughput trade-off is, fundamentally, this distinction.
The Hidden Problem: Performance Variance You Can’t Reproduce
Here’s something that rarely appears in infrastructure procurement discussions: multi-GPU inference on shared infrastructure introduces variance that makes it very hard to trust your own benchmarks.
On a multi-tenant cluster, your GPUs are physically dedicated but the network fabric typically isn’t. Other tenants’ traffic affects your all-reduce latency. The result is that the same model, the same query, the same batch size, on the same cluster, produces meaningfully different latency numbers on different days, sometimes within the same hour.
This matters more for inference than training. Training can tolerate some variance because you’re measuring aggregate throughput over long runs, and outliers average out. Inference SLAs can’t absorb the same tolerance. If your p99 latency is 450ms one day and 900ms the next, you have an SLA problem that doesn’t have a software fix.
The right question: On this infrastructure, is our end-to-end latency deterministic across days and time of day? Can we reproduce our acceptance testing results after go-live? If the answer involves the words “approximately” or “typically”, the variance is already in.
This is one of the core arguments for dedicated allocation in production multi-GPU inference environments. When your model weights, VRAM and network fabric are not shared with any other tenant, the performance floor doesn’t shift. Hyperstack’s Secure Private Cloud provides exactly that: fully reserved GPU, CPU, memory and networking resources per customer with no oversubscription across tenants. That means the benchmark you ran during commissioning is the performance your team can plan against, sprint by sprint, not just on a good day.
Multi-GPU Latency Optimisation Starts with the Right Questions
Most conversations about multi-GPU latency optimisation focus on the model level: quantisation, KV cache compression and continuous batching. These are important. They’re also downstream of infrastructure decisions that are much harder to change after the fact.
Before you get to model-level tuning, there are infrastructure-level questions that determine the ceiling:
- What is the topology of the GPU cluster? Eight H100s connected via NVLink/NVSwitch can sustain GPU-to-GPU communication bandwidth on the order of hundreds of GB/s per GPU — enough to run tensor-parallel inference on a 70B-class model like Llama 3 70B with a viable time-to-first-token. Eight GPUs spread across four dual-GPU nodes connected via Ethernet behave like a different system entirely, even at the same nominal GPU count. The topology determines which parallelism strategies are viable and what the practical throughput ceiling is before model-level optimisation begins.
- What networking fabric is between nodes, and was it selected for this workload’s scale? RoCE (RDMA over Converged Ethernet) and InfiniBand have different performance characteristics and operational tradeoffs. RoCE makes sense when you want RDMA performance with Ethernet operational familiarity. InfiniBand is typically chosen when workload scale requires a dedicated low-latency fabric and the performance/cost tradeoff justifies it. If your infrastructure team selected the fabric without reference to your specific model size and parallelism strategy, that’s worth revisiting before you scale.
- What is the storage configuration, and can it support the checkpoint and KV cache access patterns of your inference workload? Inference pipelines that manage large KV caches or operate with frequent model-swapping need storage that can sustain high-throughput access without stalling the GPU pipeline. Local NVMe and persistent shared storage serve different parts of this problem.
- What does “high availability” mean for this deployment, specifically? For inference workloads, an unplanned GPU failure mid-request is not the same problem as an infrastructure outage. The operational model needs to account for both, with different response characteristics.
Getting Multi-GPU Inference Right in Regulated Environments
For engineering teams running AI in regulated industries (financial services, healthcare, public sector), multi-GPU inference surfaces a set of constraints that don’t appear in most infrastructure playbooks.
Data residency requirements can constrain which regions and facilities your GPU cluster can be deployed in. If the model handles regulated data at inference time, the inference infrastructure itself may be in scope for audit. Shared-tenancy networking, where your inference traffic shares a fabric with other customers, introduces questions about data isolation that InfoSec teams will raise and that have no clean answer on public infrastructure.
The Questions Worth Asking Before Deployment
- What is the tenancy model of the network fabric, not just the compute?
- Is our inference traffic logically or physically isolated from other customers?
- Can we produce audit evidence of access controls applied at the infrastructure level?
For organisations where these questions are gating factors, Hyperstack’s Secure Private Cloud is built to address them directly. It’s a single-tenant deployment with dedicated infrastructure, a segregated network fabric and controlled access defined as part of the build, delivered as a bespoke environment and not via a self-serve UI. For teams deploying under DORA, UK PRA SS2/21 or EU AI Act high-risk requirements, the deployment model aligns to how regulated environments are expected to be commissioned, governed and audited.
The Questions That Determine Whether You Get This Right
Multi-GPU inference done well is an engineering discipline, not a configuration exercise. The teams that get it right are the ones that ask infrastructure-level questions before they commit to a model serving architecture, not after they’ve discovered that the p99 latency profile doesn’t hold under production load.
The short version of those questions:
- What is the GPU-to-GPU interconnect bandwidth, and which parallelism strategies does it support at our model size?
- Is the performance we measure during commissioning reproducible under production conditions, at the same time of day, on different days?
- Is the network fabric between nodes dedicated to our workload, or shared?
- Does our storage configuration support the access patterns of inference at scale: KV cache, checkpointing, model artefacts?
- For regulated workloads, is the infrastructure's tenancy model auditable, and does it meet our InfoSec review criteria?
If your infrastructure team can answer all of these with specifics and not with “approximately” or “typically”, you’re in a position to get multi-GPU inference right. If not, the architecture review should happen before deployment, not during an incident.
Run Regulated AI Workloads on Purpose-Built Infrastructure
Hyperstack’s Secure Private Cloud is a single-tenant, dedicated GPU environment built for production AI in regulated industries, with deterministic performance, controlled network fabric and an operations model that answers the questions above with contractual commitments, not estimates.
FAQs
What is multi-GPU inference and why does it matter?
Multi-GPU inference is the practice of distributing a large language model across multiple GPUs to run inference at scale. It matters because modern LLMs often exceed the VRAM capacity of a single GPU, making multi-GPU setups a production necessity rather than an optimisation choice.
Why is my latency worse after adding more GPUs?
Adding GPUs introduces inter-device communication overhead. Every forward pass requires synchronisation between GPUs for activations, KV cache updates, and attention outputs. If the interconnect bandwidth is insufficient for the model's communication pattern, latency increases rather than decreases.
What is tensor parallelism vs pipeline parallelism in LLM inference?
Tensor parallelism splits individual matrix operations across GPUs and requires high-frequency, high-bandwidth synchronisation, making it best for low time-to-first-token on fast interconnects. Pipeline parallelism distributes entire model layers across GPUs, communicating less often but in larger chunks, which suits throughput on long sequences but is slower to the first token.
What causes p99 latency spikes in multi-GPU inference?
On shared infrastructure, the network fabric between GPUs is typically multi-tenant. Other workloads competing for bandwidth cause all-reduce latency to vary, producing p99 spikes that cannot be fixed in software. Dedicated, single-tenant networking is the most reliable way to remove this source of variance and stabilise tail latency.
How do I run LLM inference in a regulated industry like finance or healthcare?
Regulated environments require data residency compliance, physically or logically isolated network fabrics, and auditable access controls at the infrastructure level. Public cloud shared-tenancy models often cannot satisfy these requirements. Single-tenant, dedicated GPU infrastructure built to standards such as DORA or the EU AI Act is the appropriate deployment model.
What is a single-tenant GPU cloud and when do I need one?
A single-tenant GPU cloud provides fully reserved compute, memory, and networking resources with no sharing across customers. You need one when your workload requires deterministic, reproducible latency, when regulated data is processed at inference time, or when your InfoSec team requires auditable network isolation as a condition of deployment.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?