Is inference slowing you down or costing more than it should?
As models grow larger, inference becomes harder to optimise. It’s where many teams hit their biggest bottlenecks. Whether you're deploying in production or fine-tuning in research, delays and inefficiencies can lead to high latency, rising costs and a poor user experience.
The right GPU can change that.
In this blog, we compare the most popular GPUS for LLM workloads: the NVIDIA A100 NVLink and the NVIDIA H100 SXM5. We ran benchmarks with vLLM, a high-performance inference engine built for throughput and low latency on Hyperstack’s ultimate GPU cloud.
We ran an in-house benchmark on Hyperstack using vLLM’s official benchmarking suite, simulating real-world inference workloads. Here’s what the setup looked like:
Model: Meta Llama 3.1 70B
Batch Size: 64
Max Model Length: 4096 tokens
Deployment Environment: Hyperstack VMs (NVIDIA A100 NVLink and NVIDIA H100 SXM5)
The focus was on token throughput, which directly affects response time and user experience.
The NVIDIA H100 SXM5 outperformed the NVIDIA A100 NVLink by 2.8x in terms of tokens generated per second. While the NVIDIA H100 SXM5 is only 1.7x more expensive, it delivers significantly higher cost-efficiency for inference tasks.
That means you get:
Lower latency for interactive applications
Higher throughput for batch inference pipelines
Better ROI for AI teams deploying LLMs at scale
If you’re looking to accelerate inference while keeping costs under control, the NVIDIA H100 SXM5 is the clear choice. With 2.8x the performance at only 1.7x the cost, it delivers more value per token than the NVIDIA A100 NVLink when deployed on our platform optimised for LLM workloads at scale.
Accelerate LLM Inference on Hyperstack
Run your LLM workloads with NVIDIA H100 SXM GPUs on Hyperstack, starting at $2.40/hr.
The benchmark was run using the Llama 3.1 70B model with a batch size of 64 and a maximum model length of 4096 tokens. The in-house vLLM benchmarking suite was used for consistency.
The NVIDIA H100 SXM5 delivers approximately 2.8 times more inference throughput, generating 3311 tokens per second compared to 1148 tokens per second on the NVIDIA A100 NVLink.
Yes, the NVIDIA H100 SXM5 is only about 1.7 times more expensive but provides 2.8 times the throughput, making it more cost-effective for LLM inference workloads.
You can easily deploy and run inference workloads on Hyperstack’s platform, using the NVIDIA H100 SXM5 GPUs on-demand. Visit our console here to log in and get started with our high-performance cloud GPU platform.
What is the cost of NVIDIA H100 SXM on Hyperstack?
You can deploy the powerful NVIDIA H100 SXM5 GPU on-demand for $2.40/hr on Hyperstack.