TABLE OF CONTENTS
NVIDIA H100 SXM On-Demand
Key Takeaways
-
Picking the wrong GPU can waste 10× your budget.
-
Quantisation (INT8/4) is the highest-impact, lowest-effort optimisation available: it typically delivers 2-4× throughput improvement with less than 1% accuracy loss.
-
Dynamic and continuous batching can lift GPU utilisation from 20–30% to 70–85%, cutting cost per token by 2-5× with no hardware changes.
-
KV-cache management is critical for long-context workloads - without it, VRAM fills fast, forcing extra GPUs and inflating cost.
-
Autoscaling from a static GPU pool to an elastic one cuts compute spend by 50-70%; Hyperstack supports VM provisioning in under one minute.
-
Cost per million tokens is the only metric that matters. NVIDIA H200 GPUs cost more per hour than NVIDIA H100s but can deliver up to 35× more tokens per watt, making them significantly cheaper in practice.
Cost-efficient LLM inference on Hyperstack or any GPU cloud means maximising tokens per dollar by optimising every layer of the stack, from model and runtime to VM selection and orchestration.
In practice, this means choosing the right GPU size, applying quantisation, optimising KV cache usage, using modern runtimes like TensorRT-LLM, vLLM, and TGI, batching effectively, autoscaling dynamically, and tracking the right metrics. Together, these can deliver 2× to 10× cost savings in real deployments.
The table below shows the highest-impact techniques you can apply to run cost-effective inference workloads.
|
Strategy |
Impact (efficiency gain) |
Effort / Complexity |
|
Quantisation (INT8/4) |
2–4× throughput increase |
Low–Medium (many libs available) |
|
Distillation / Pruning |
2–10× model size reduction |
High (requires retraining/tuning) |
|
Dynamic Batching |
2–5× throughput (vs per-request) |
Medium (configurable in Triton / vLLM) |
|
Token-Level Scheduling |
Up to ~9× throughput (Alibaba Aegaeon) |
High (complex custom schedulers) |
|
Speculative Decoding |
2–3× generation speedup |
Medium (needs auxiliary models) |
|
KV-Cache Offloading |
≈20% cost cut for long contexts |
Medium (runtime support needed) |
|
vLLM (PagedAttention) |
2–4× throughput vs TorchServe |
Low (use open-source vLLM) |
|
TensorRT-LLM (FP8) |
~3–4.6× speedup on H100 vs A100 |
High (NVIDIA-specific setup) |
|
Right-Sizing GPUs (H100/H200) |
Up to 35× cost improvement |
Low (benchmark + selection) |
|
MIG / GPU Sharing |
~25–80% better utilisation (varies) |
Medium (infrastructure support) |
|
Autoscaling / Scale-to-Zero |
~50–70% cost reduction vs static |
Medium (infra/config needed) |
1. Right-Size GPU & Virtual Machine Selection
Don't overprovision. Identify the smallest GPU/VM that meets your latency and throughput SLA. Benchmarks and profiling are key: measure your model's VRAM footprint, maximum batch size, and tokens/sec on candidate VMs before choosing. If your model fits with headroom on an NVIDIA A100 or NVIDIA H100, there is no need to jump to a larger, much pricier card. Analyses show that picking the wrong GPU can waste 10× your budget.
-
Memory headroom: Make sure your VM has enough free VRAM for KV-cache and batching. If your context window or batch pushes one GPU to the limit, you will need more GPUs — or to upgrade from an H100 to an H200 — just to fit the model, and pay for that extra hardware. If an H100 handles your workload easily, using an H200 or multiple GPUs raises the cost per token significantly.
-
Compute vs bandwidth: Many LLM tasks are memory-bandwidth-limited, especially decoding. If your workload is near memory capacity on an NVIDIA H100, a larger NVIDIA H200 (141 GB) may be worth it. For smaller models, a cheaper A100 or L40S can have a lower cost per token - Spheron found an NVIDIA A100 beat an NVIDIA H100 on cost per token for a 17B MoE model.
-
VM pricing: Always factor actual VM pricing - on-demand versus reserved versus Spot - into the decision. A 2× cheaper VM that delivers 1.5× the throughput is often the best buy. You can use Hyperstack's live pricing APIs to compare tokens per dollar across VM types. Hyperstack owns its hardware in key locations while maintaining global availability, which is what enables this pricing transparency and flexibility — and avoids the margin stacking typical of reseller clouds.
-
Benchmark your workload: The only way to know is to test your model on real traffic. Measure tokens/sec at target latencies on each candidate GPU, then compute $/token = (GPU $/hr) / (tokens/sec × 3600). Vendor FLOPS claims are meaningless in isolation - actual tokens output per dollar is what counts.
2. Aggressive Model Compression
Shrinking your model directly cuts inference cost. Modern techniques allow drastic compression with minimal accuracy loss:
-
Quantisation (INT8/4, FP4): Convert weights and activations to lower precision. Moving from FP16 to INT8 or INT4 (AWQ/GPTQ methods) typically yields 2–4× throughput on the same GPU. NVIDIA H100 and NVIDIA H200 support FP8/FP4 natively. NVIDIA reports H100 (FP8) achieving 3–4.6× the throughput of NVIDIA A100 (FP16) on a 6B LLM. State-of-the-art 4-bit methods keep accuracy within ~1% of full precision.
-
Pruning: Remove unneeded weights or neurons. Structured pruning can cut a model's size by 30-60% while preserving 98-99% of accuracy, directly lowering compute and memory needs.
-
Distillation: Train a smaller student model to mimic a large teacher. The result can be 5-10× smaller while retaining ~95% of the original's performance. Distilled models are particularly well-suited for ultra-low latency or on-device inference.
Implementation & Tooling
Quantisation tools are widely accessible through industry-standard libraries like Hugging Face Transformers, bitsandbytes, and the Intel Neural Compressor.
For immediate results, you can source off-the-shelf 4-bit models directly from Hugging Face for instant deployment. While more advanced techniques like pruning and distillation require higher engineering effort, they provide significant long-term throughput benefits.
3. KV-Cache and Memory Strategies
Long-context inference is often memory-bound. Every decoded token adds keys and values to the cache, and each step re-reads that KV cache from GPU memory. If your workload involves long contexts, unoptimised caches can waste gigabytes of VRAM - limiting your batch size and forcing extra GPUs. Here are the key tactics:
-
KV-Cache Offloading: Move part of the KV cache to CPU RAM or fast SSD during decoding. vLLM, TensorRT-LLM, and HF TGI all support this. One study found enabling KV offloading saved ~20% of cost in offline batch workloads; for online workloads, it enabled the use of cheaper VMs, which in some cases reduced costs by up to ~74%. Hyperstack VMs have ample host RAM to ensure the serving runtime is configured to spill cache when needed.
-
PagedAttention / vLLM: The vLLM engine virtualises KV-cache usage so no memory is wasted on fragmentation, achieving near-zero KV duplication and flexible sharing between requests. In practice, vLLM boosts throughput 2–4× over naive servers while supporting much longer sequences.
-
Sequence Chunking / Key Sharing: For multi-turn scenarios, reuse KV cache where possible or encode common prompt prefixes once. Some frameworks support prefix caching to reuse work across requests.
-
Disaggregated Memory: Cutting-edge research (e.g. NVIDIA Dynamo) explores splitting prefill and decoding across nodes. Hyperstack's SXM-based GPUs (H100/H200 SXM) have NVLink (600 GB/s), which benefits multi-GPU caching compared to PCIe.
Insight: If your LLM's context often exceeds ~8K tokens or your GPU VRAM is fully consumed by KV cache at runtime, prioritise KV optimisation. Without it, you will require significantly more GPUs just to fit the working set—causing infrastructure costs to scale rapidly.
4. Optimised Inference Runtimes
The serving framework you choose can drastically affect your cost. Modern LLM servers include batching and kernel optimisations out of the box and picking the wrong one means leaving money on the table:
-
TensorRT-LLM (with Triton): NVIDIA's open-source engine, tuned for NVIDIA H100/H200. Supports FP8, in-flight batching, prefix caching, and kernel fusions. H100 FP8 achieves ~3–4.6× the throughput of NVIDIA A100 FP16 on the same model. Use it when ultra-low latency on NVIDIA hardware is required - setup is more complex but yields maximum performance.
-
Hugging Face Text Generation Inference (TGI): A robust, HF-maintained server with easy Docker deployment. TGI v3 adds multi-token caching and is 13× faster on long prompts. A solid default if you value ease of deployment and multi-model support.
-
vLLM: Focused on continuous token-level batching and advanced memory paging. Delivers 2–4× throughput improvements for large models, integrates easily with Hugging Face models, and is particularly effective for high-concurrency or long-context workloads.
-
TorchServe / Custom REST: For simpler use cases or vision models, Triton or TorchServe work. They support dynamic batching and containerisation but may not include the latest LLM optimisations. If using Triton, enable dynamic_batching in the config (e.g. max delay ~5 ms).
Key Point: A poorly chosen runtime wastes money. On NVIDIA, TensorRT-LLM generally wins for raw efficiency, while vLLM and TGI excel for flexibility and ease of deployment. Testing multiple engines and measuring performance is the only way to guarantee the best ROI.
5. Dynamic Batching & Token-Level Scheduling
Naive one-request-at-a-time serving leaves your GPUs mostly idle. Effective request batching and scheduling is one of the most overlooked levers for cutting cost.
-
Dynamic Batching (Request-level): Hold incoming requests up to a short timeout (a few ms) and batch them. This often moves GPU utilisation from ~20–30% to ~70–85%. Triton has built-in support - in config.pbtxt set dynamic_batching { preferred_batch_size: [4, 8] max_queue_delay_microseconds: 5000 }. Tune the max delay to match your SLA.
-
Continuous (Token-Level) Batching: For autoregressive models, vLLM's continuous batching lets new requests join a running batch between decoding steps. NVIDIA's TensorRT-LLM uses in-flight batching to similar effect. Alibaba's Aegaeon used token-level scheduling across multiple LLMs and cut GPU usage by 82% while boosting throughput ~9×. Even at a smaller scale, token batching can improve throughput by 2–5×.
-
Queue Management: Implement a queue or proxy in front of your GPUs to shape arrival rates and improve batch fill. Pick a strategy per workload: micro-batches (2–8) for chatbots, full batches (dozens) for async tasks.
-
CUDA Streams and Overlap: Overlap data transfers with compute and enable TensorRT's CUDA Graphs (--useCudaGraph) to reduce kernel launch overhead. These performance gains come at no additional infrastructure cost.
6. Multi-GPU & Multi-Tenancy
Use these techniques when you need to scale beyond one card or share GPUs across workloads:
-
Data Parallel (Multi-GPU): Run N copies of the model on N GPUs, each handling ~1/N of requests. In Triton, set instance_group { count: 2, kind: KIND_GPU, gpus: [0, 1] } to spin up two instances on two GPUs. Near-linear throughput scaling applies until another bottleneck appears.
-
Tensor Parallel / Pipeline Parallel: If a model is too large for one GPU, shard it. vLLM and DeepSpeed/Megatron handle tensor parallelism across cards. NVLink (SXM GPUs) is very beneficial here; PCIe-only GPUs may bottleneck on all-reduce bandwidth.
-
NVIDIA MIG (Multi-Instance GPU): For multiple smaller models or tenants, MIG slices a single NVIDIA A100/H100 into 2–7 independent GPUs, each with guaranteed compute and memory - ideal for isolating workloads and enforcing SLAs.
-
Fractional or Time-Sliced GPUs: Advanced schedulers time-share GPU threads across models. Alibaba's research suggests dynamic pooling outperforms static MIG slices for highly variable, multi-model traffic.
7. Auto-Scaling & Scale-to-Zero
Static, always-on GPUs waste your money when idle. Use cloud autoscaling to match your capacity to actual demand:
-
Queue-Driven Scaling: Monitor request queue length or in-flight load, and spin GPU VMs up or down accordingly. Kubernetes tools (KEDA, Karpenter) autoscale pods or nodes on these metrics. Hyperstack supports fast VM provisioning (under one minute) and suspend/hibernate - idle VMs can be paused completely.
-
Spot VMs: For non-critical batch workloads, Spot VMs on Hyperstack offer a 20% discount compared to on-demand pricing. Run nightly analytics or batch jobs on Spot VMs, then shut them down. Reserve on-demand VMs for the latency-critical hot path.
-
Scale-to-Zero: For very intermittent traffic, consider a serverless model - wake GPU VMs only when queries arrive. Pre-warmed standby or hibernation can approximate scale-to-zero and avoid cold-start penalties.
-
Elastic Workload Mix: Separate real-time from batch traffic. Keep a few GPUs hot for low-latency APIs, and use auto-scaled burst clusters for heavy off-peak inference. GMI Cloud handles ~10× peak spikes via elastic scaling.
Don't pay for idle GPUs. Moving from static to elastic GPU pools can cut costs ~50–70%. Invest effort in proper autoscaling policies and capacity-buffer thresholds.
8. Cost Metrics & Observability
Use tokens per dollar (or cost per million tokens) as your master metric. Instrument your system to expose GPU utilisation, tokens/sec, latencies (p50/p95/p99), and cost per 1K inferences or per million tokens.
-
Monitoring Tools: Export metrics to Prometheus and Grafana - vLLM has Prometheus endpoints, Triton exposes Stats, and Hyperstack VM metrics are available. Key dashboards: GPU utilisation, queue depth, batch size, tokens/sec.
-
Alerts & Dashboards: Set alerts on low utilisation (wasted resources) or rising cost per token. Track trends: is an optimisation actually lowering $/token?
-
Token vs Hour: Cost per hour is misleading. NVIDIA showed that H200 GPUs were 2× the cost per hour of H100 but delivered ~35× more tokens per watt - yielding 17.5× lower cost per token. Plot cost per token, not rental rates. Hourly rates also hide billing extras — hyperscaler bills typically include ingress and egress charges that don't appear in the GPU rate, while Hyperstack charges none.
-
Capacity Planning: Continuously benchmark new models and GPUs using MLPerf Inference or custom load tests (e.g. Triton Perf Analyser) to quantify throughput at scale.
-
Cloud Billing Integration: In Hyperstack, enable detailed billing and cost management to attribute spend to inference jobs. This helps identify idle VMs or runaway jobs before costs compound.
Good observability closes the loop. Without it, you risk spending on optimisations that do not pay off. Cache hit-rate monitoring, for example, tells you whether a result cache is effective - or whether batching is actually happening as expected.
Techniques vs. Impact / Cost
|
Technique |
When to Use |
Impact |
Complexity (Dev/Ops) |
|
Quantisation (INT8/4) |
Large LLMs where model fits GPU after compressing. |
2–4× throughput; ~50–80% cost drop |
Low: tools available; test accuracy |
|
Pruning / Distillation |
When accuracy trade-off is acceptable. |
2–10× smaller models |
High: requires training data/effort |
|
Dynamic Batching |
Online services with moderate latency SLAs. |
2–5× throughput (GPU util ↑) |
Medium: config Triton/vLLM |
|
Token-Level Scheduling |
Mixed or bursty multi-model inference. |
Up to ~9× throughput |
High: custom scheduler needed |
|
Speculative Decoding |
Generation-heavy workloads (long responses). |
2–3× speedup |
Medium: need secondary model |
|
KV Cache Offloading |
Long contexts that exceed single-GPU memory. |
~20% cost savings (batch case) |
Medium: runtime support |
|
vLLM / PagedAttention |
Long-context LLMs with high concurrency. |
2–4× throughput |
Low: use library & set up |
|
TensorRT-LLM (FP8) |
NVIDIA GPUs, ultra-low latency needs. |
~4× throughput (H100 vs A100) |
High: NVIDIA toolchain |
|
Right-sized GPU |
Always (match model to GPU). |
Up to 35× tokens/sec gain |
Low: measure & pick |
|
MIG / GPU Sharing |
Many small models or tenants. |
≈+20–50% utilisation |
Medium: infrastructure support |
|
Auto-Scaling |
Variable traffic workloads. |
~50–70% cost reduction vs static |
Medium: infra setup |
|
Cache Frequent Queries |
Apps with repeated inputs (search, FAQ). |
Large latency & cost win if >10% hit rate. |
Low: add caching layer |
A Quick Checklist
Here's a quick checklist that you can use to cut inference costs:
-
Benchmark & Profile: Measure tokens/sec, VRAM use, and latency on candidate VMs. Compute cost per million tokens.
-
Optimise Model: Try INT8/4 quantisation (e.g. AWQ) and measure accuracy. Prune or distil if latency is still too high.
-
Tune Runtime: Deploy on Triton, vLLM, or TGI with dynamic batching enabled. Test batch sizes until latency degrades.
-
Manage KV Cache: Monitor GPU memory during inference. Enable KV offload or use vLLM's PagedAttention for long prompts.
-
Scale Smart: Hook up an autoscaler (KEDA/Karpenter) on queue length or GPU utilisation. Define scale-to-zero if feasible.
-
Choose GPU Wisely: Use multi-GPU or MIG only if needed. Avoid oversized VMs that sit idle.
-
Monitor & Iterate: Track GPU utilisation, latency (p99), and cost per token over time. Continually adjust batching, model, or scale rules.
-
Document & Automate: Containerise the stack, script benchmarking, and encode best practices in CI pipelines.
By systematically applying these strategies - and measuring every step - you can minimise inference cost on Hyperstack GPUs while meeting SLA requirements. Teams regularly see 40–70% reductions in compute spend and 2–5× improvements in throughput by combining these techniques.
Start Running Cost-Efficient Inference on Hyperstack
Hyperstack provides your team with instant access to NVIDIA H100 and H200 GPU VMs. With ultra-fast provisioning, seamless autoscaling, and Spot VM discounts, our platform is engineered to drive your cost per token to the absolute minimum. Stop overpaying for idle GPU hours and start optimising.
Ready to optimise your infrastructure?
Run your inference workloads on Hyperstack today.
FAQs
What is the most important metric for measuring LLM inference cost?
Cost per million tokens (or tokens per dollar) is the master metric. GPU cost per hour is misleading in isolation - a more expensive VM can deliver a lower cost per token if it has higher throughput. Always benchmark your actual workload and compute $/token = (GPU $/hr) / (tokens/sec × 3600).
Which GPU should I choose for inference workloads on Hyperstack?
It depends on your model size and latency requirements. H100 and H200 VMs are best for large models (70B+) or latency-critical production workloads. For smaller models (7B–13B), an A100 or L40S can deliver lower cost per token. Always benchmark on real traffic before committing to a VM type.
How much can quantisation reduce my inference costs?
Quantisation (INT8 or INT4) typically delivers 2–4× throughput improvement on the same GPU, which translates directly to lower cost per token. State-of-the-art 4-bit methods like AWQ and GPTQ keep accuracy within approximately 1% of full precision. For many production workloads, this is the single highest-impact, lowest-effort optimisation available.
What is dynamic batching and why does it matter?
Dynamic batching groups multiple incoming requests together before sending them to the GPU, rather than processing each one individually. This can move GPU utilisation from 20–30% up to 70–85%, delivering 2–5× throughput improvement at the same hardware cost. Both Triton Inference Server and vLLM support dynamic batching out of the box.
What are Spot VMs on Hyperstack and when should I use them?
Spot VMs on Hyperstack offer a 20% discount compared to on-demand pricing. They are best suited for non-critical or interruptible workloads - such as batch inference jobs, nightly analytics, or model-validation runs - where brief interruptions are acceptable. Reserve on-demand VMs for latency-critical, real-time inference paths.
How does vLLM's PagedAttention reduce inference costs?
PagedAttention virtualises the KV cache similarly to how an operating system pages memory, eliminating fragmentation and minimising wasted VRAM. This allows vLLM to pack more concurrent requests into the same GPU memory, increasing throughput by 2–4× compared to naive serving approaches like TorchServe - with no additional hardware cost.
When should I use TensorRT-LLM instead of vLLM?
TensorRT-LLM is the better choice when raw throughput and minimal latency on NVIDIA hardware are the priority and you can invest in the more complex setup (model conversion to TRT engines, NVIDIA-specific toolchain). It achieves approximately 3–4.6× the throughput of A100 FP16 on an H100 using FP8 precision. vLLM is a better default if you prioritise deployment flexibility, faster setup, or multi-model support.
How does autoscaling reduce GPU costs in practice?
Static GPU VMs left running when idle are pure wasted spend. Autoscaling tools like KEDA or Karpenter monitor queue depth or GPU utilisation and spin VMs up or down accordingly. Moving from a static GPU pool to an elastic one typically cuts compute costs 50–70%. Hyperstack supports fast VM provisioning (under one minute), making autoscaling practical for most workloads.
Subscribe to Hyperstack!
Enter your email to get updates to your inbox every week
Get Started
Ready to build the next big thing in AI?