Hyperstack - Performance Benchmarks

LLM Inference Benchmark: Comparing NVIDIA A100 NVLink vs NVIDIA H100 SXM

Written by Damanpreet Kaur Vohra | May 20, 2025 8:25:17 AM

Is inference slowing you down or costing more than it should?

As models grow larger, inference becomes harder to optimise. It’s where many teams hit their biggest bottlenecks. Whether you're deploying in production or fine-tuning in research, delays and inefficiencies can lead to high latency, rising costs and a poor user experience.

The right GPU can change that.

In this blog, we compare the most popular GPUS for LLM workloads: the NVIDIA A100 NVLink and the NVIDIA H100 SXM5. We ran benchmarks with vLLM, a high-performance inference engine built for throughput and low latency on Hyperstack’s ultimate GPU cloud.

Benchmark Setup

We ran an in-house benchmark on Hyperstack using vLLM’s official benchmarking suite, simulating real-world inference workloads. Here’s what the setup looked like:

  • Model: Meta Llama 3.1 70B

  • Batch Size: 64

  • Max Model Length: 4096 tokens

  • Deployment Environment: Hyperstack VMs (NVIDIA A100 NVLink and NVIDIA H100 SXM5)

The focus was on token throughput, which directly affects response time and user experience.

Inference Throughput Comparison

Which GPU Delivered Higher Throughput: Result

The NVIDIA H100 SXM5 outperformed the NVIDIA A100 NVLink by 2.8x in terms of tokens generated per second. While the NVIDIA H100 SXM5 is only 1.7x more expensive, it delivers significantly higher cost-efficiency for inference tasks.

That means you get:

  • Lower latency for interactive applications

  • Higher throughput for batch inference pipelines

  • Better ROI for AI teams deploying LLMs at scale

Conclusion

If you’re looking to accelerate inference while keeping costs under control, the NVIDIA H100 SXM5 is the clear choice. With 2.8x the performance at only 1.7x the cost, it delivers more value per token than the NVIDIA A100 NVLink when deployed on our platform optimised for LLM workloads at scale.

Accelerate LLM Inference on Hyperstack

Run your LLM workloads with NVIDIA H100 SXM GPUs on Hyperstack, starting at $2.40/hr.

Similar Reads

FAQs

What model and settings were used for this benchmark?

The benchmark was run using the Llama 3.1 70B model with a batch size of 64 and a maximum model length of 4096 tokens. The in-house vLLM benchmarking suite was used for consistency.

How much faster is the NVIDIA H100 SXM5 compared to the NVIDIA A100 NVLink?

The NVIDIA H100 SXM5 delivers approximately 2.8 times more inference throughput, generating 3311 tokens per second compared to 1148 tokens per second on the NVIDIA A100 NVLink.

Is the performance of the NVIDIA H100 SXM5 worth the cost?

Yes, the NVIDIA H100 SXM5 is only about 1.7 times more expensive but provides 2.8 times the throughput, making it more cost-effective for LLM inference workloads.

How can I run my inference workloads on Hyperstack using the NVIDIA H100 SXM5?

You can easily deploy and run inference workloads on Hyperstack’s platform, using the NVIDIA H100 SXM5 GPUs on-demand. Visit our console here to log in and get started with our high-performance cloud GPU platform.

What is the cost of NVIDIA H100 SXM on Hyperstack?

You can deploy the powerful NVIDIA H100 SXM5 GPU on-demand for $2.40/hr on Hyperstack.