<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">
Reserve here

NVIDIA H100 SXMs On-Demand at $2.40/hour - Reserve from just $1.90/hour. Reserve here

Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close
|

Updated on 21 Apr 2026

How to Run Cost-Efficient Inference Workloads: A Guide

TABLE OF CONTENTS

NVIDIA H100 SXM On-Demand

Sign up/Login

Key Takeaways

  • Picking the wrong GPU can waste 10× your budget.

  • Quantisation (INT8/4) is the highest-impact, lowest-effort optimisation available: it typically delivers 2-4× throughput improvement with less than 1% accuracy loss.

  • Dynamic and continuous batching can lift GPU utilisation from 20–30% to 70–85%, cutting cost per token by 2-5× with no hardware changes.

  • KV-cache management is critical for long-context workloads - without it, VRAM fills fast, forcing extra GPUs and inflating cost.

  • Autoscaling from a static GPU pool to an elastic one cuts compute spend by 50-70%; Hyperstack supports VM provisioning in under one minute.

  • Cost per million tokens is the only metric that matters. NVIDIA H200 GPUs cost more per hour than NVIDIA H100s but can deliver up to 35× more tokens per watt, making them significantly cheaper in practice.

 Cost-efficient LLM inference on Hyperstack or any GPU cloud means maximising tokens per dollar by optimising every layer of the stack, from model and runtime to VM selection and orchestration.

In practice, this means choosing the right GPU size, applying quantisation, optimising KV cache usage, using modern runtimes like TensorRT-LLM, vLLM, and TGI, batching effectively, autoscaling dynamically, and tracking the right metrics. Together, these can deliver 2× to 10× cost savings in real deployments.

The table below shows the highest-impact techniques you can apply to run cost-effective inference workloads.

Strategy

Impact (efficiency gain)

Effort / Complexity

Quantisation (INT8/4)

2–4× throughput increase

Low–Medium (many libs available)

Distillation / Pruning

2–10× model size reduction

High (requires retraining/tuning)

Dynamic Batching

2–5× throughput (vs per-request)

Medium (configurable in Triton / vLLM)

Token-Level Scheduling

Up to ~9× throughput (Alibaba Aegaeon)

High (complex custom schedulers)

Speculative Decoding

2–3× generation speedup

Medium (needs auxiliary models)

KV-Cache Offloading

≈20% cost cut for long contexts

Medium (runtime support needed)

vLLM (PagedAttention)

2–4× throughput vs TorchServe

Low (use open-source vLLM)

TensorRT-LLM (FP8)

~3–4.6× speedup on H100 vs A100

High (NVIDIA-specific setup)

Right-Sizing GPUs (H100/H200)

Up to 35× cost improvement

Low (benchmark + selection)

MIG / GPU Sharing

~25–80% better utilisation (varies)

Medium (infrastructure support)

Autoscaling / Scale-to-Zero

~50–70% cost reduction vs static

Medium (infra/config needed)

1. Right-Size GPU & Virtual Machine Selection

Don't overprovision. Identify the smallest GPU/VM that meets your latency and throughput SLA. Benchmarks and profiling are key: measure your model's VRAM footprint, maximum batch size, and tokens/sec on candidate VMs before choosing. If your model fits with headroom on an NVIDIA A100 or NVIDIA H100, there is no need to jump to a larger, much pricier card. Analyses show that picking the wrong GPU can waste 10× your budget.

  • Memory headroom: Make sure your VM has enough free VRAM for KV-cache and batching. If your context window or batch pushes one GPU to the limit, you will need more GPUs — or to upgrade from an H100 to an H200 — just to fit the model, and pay for that extra hardware. If an H100 handles your workload easily, using an H200 or multiple GPUs raises the cost per token significantly.

  • Compute vs bandwidth: Many LLM tasks are memory-bandwidth-limited, especially decoding. If your workload is near memory capacity on an NVIDIA H100, a larger NVIDIA H200 (141 GB) may be worth it. For smaller models, a cheaper A100 or L40S can have a lower cost per token - Spheron found an NVIDIA A100 beat an NVIDIA H100 on cost per token for a 17B MoE model.

  • VM pricing: Always factor actual VM pricing - on-demand versus reserved versus Spot - into the decision. A 2× cheaper VM that delivers 1.5× the throughput is often the best buy. You can use Hyperstack's live pricing APIs to compare tokens per dollar across VM types. Hyperstack owns its hardware in key locations while maintaining global availability, which is what enables this pricing transparency and flexibility — and avoids the margin stacking typical of reseller clouds.

  • Benchmark your workload: The only way to know is to test your model on real traffic. Measure tokens/sec at target latencies on each candidate GPU, then compute $/token = (GPU $/hr) / (tokens/sec × 3600). Vendor FLOPS claims are meaningless in isolation - actual tokens output per dollar is what counts.

2. Aggressive Model Compression

Shrinking your model directly cuts inference cost. Modern techniques allow drastic compression with minimal accuracy loss:

  • Quantisation (INT8/4, FP4): Convert weights and activations to lower precision. Moving from FP16 to INT8 or INT4 (AWQ/GPTQ methods) typically yields 2–4× throughput on the same GPU. NVIDIA H100 and NVIDIA H200 support FP8/FP4 natively. NVIDIA reports H100 (FP8) achieving 3–4.6× the throughput of NVIDIA A100 (FP16) on a 6B LLM. State-of-the-art 4-bit methods keep accuracy within ~1% of full precision.

  • Pruning: Remove unneeded weights or neurons. Structured pruning can cut a model's size by 30-60% while preserving 98-99% of accuracy, directly lowering compute and memory needs.

  • Distillation: Train a smaller student model to mimic a large teacher. The result can be 5-10× smaller while retaining ~95% of the original's performance. Distilled models are particularly well-suited for ultra-low latency or on-device inference.

Implementation & Tooling

Quantisation tools are widely accessible through industry-standard libraries like Hugging Face Transformers, bitsandbytes, and the Intel Neural Compressor.

For immediate results, you can source off-the-shelf 4-bit models directly from Hugging Face for instant deployment. While more advanced techniques like pruning and distillation require higher engineering effort, they provide significant long-term throughput benefits.

💡 Key Requirement: Always validate each optimisation on a held-out test set to ensure model accuracy remains within production tolerances.

3. KV-Cache and Memory Strategies

Long-context inference is often memory-bound. Every decoded token adds keys and values to the cache, and each step re-reads that KV cache from GPU memory. If your workload involves long contexts, unoptimised caches can waste gigabytes of VRAM - limiting your batch size and forcing extra GPUs. Here are the key tactics:

  • KV-Cache Offloading: Move part of the KV cache to CPU RAM or fast SSD during decoding. vLLM, TensorRT-LLM, and HF TGI all support this. One study found enabling KV offloading saved ~20% of cost in offline batch workloads; for online workloads, it enabled the use of cheaper VMs, which in some cases reduced costs by up to ~74%. Hyperstack VMs have ample host RAM to ensure the serving runtime is configured to spill cache when needed.

  • PagedAttention / vLLM: The vLLM engine virtualises KV-cache usage so no memory is wasted on fragmentation, achieving near-zero KV duplication and flexible sharing between requests. In practice, vLLM boosts throughput 2–4× over naive servers while supporting much longer sequences.

  • Sequence Chunking / Key Sharing: For multi-turn scenarios, reuse KV cache where possible or encode common prompt prefixes once. Some frameworks support prefix caching to reuse work across requests.

  • Disaggregated Memory: Cutting-edge research (e.g. NVIDIA Dynamo) explores splitting prefill and decoding across nodes. Hyperstack's SXM-based GPUs (H100/H200 SXM) have NVLink (600 GB/s), which benefits multi-GPU caching compared to PCIe.

💡

Insight: If your LLM's context often exceeds ~8K tokens or your GPU VRAM is fully consumed by KV cache at runtime, prioritise KV optimisation. Without it, you will require significantly more GPUs just to fit the working set—causing infrastructure costs to scale rapidly.

4. Optimised Inference Runtimes

The serving framework you choose can drastically affect your cost. Modern LLM servers include batching and kernel optimisations out of the box and picking the wrong one means leaving money on the table:

  • TensorRT-LLM (with Triton): NVIDIA's open-source engine, tuned for NVIDIA H100/H200. Supports FP8, in-flight batching, prefix caching, and kernel fusions. H100 FP8 achieves ~3–4.6× the throughput of NVIDIA A100 FP16 on the same model. Use it when ultra-low latency on NVIDIA hardware is required - setup is more complex but yields maximum performance.

  • Hugging Face Text Generation Inference (TGI): A robust, HF-maintained server with easy Docker deployment. TGI v3 adds multi-token caching and is 13× faster on long prompts. A solid default if you value ease of deployment and multi-model support.

  • vLLM: Focused on continuous token-level batching and advanced memory paging. Delivers 2–4× throughput improvements for large models, integrates easily with Hugging Face models, and is particularly effective for high-concurrency or long-context workloads.

  • TorchServe / Custom REST: For simpler use cases or vision models, Triton or TorchServe work. They support dynamic batching and containerisation but may not include the latest LLM optimisations. If using Triton, enable dynamic_batching in the config (e.g. max delay ~5 ms).

Key Point: A poorly chosen runtime wastes money. On NVIDIA, TensorRT-LLM generally wins for raw efficiency, while vLLM and TGI excel for flexibility and ease of deployment. Testing multiple engines and measuring performance is the only way to guarantee the best ROI.

5. Dynamic Batching & Token-Level Scheduling

Naive one-request-at-a-time serving leaves your GPUs mostly idle. Effective request batching and scheduling is one of the most overlooked levers for cutting cost.

  • Dynamic Batching (Request-level): Hold incoming requests up to a short timeout (a few ms) and batch them. This often moves GPU utilisation from ~20–30% to ~70–85%. Triton has built-in support - in config.pbtxt set dynamic_batching { preferred_batch_size: [4, 8] max_queue_delay_microseconds: 5000 }. Tune the max delay to match your SLA.

  • Continuous (Token-Level) Batching: For autoregressive models, vLLM's continuous batching lets new requests join a running batch between decoding steps. NVIDIA's TensorRT-LLM uses in-flight batching to similar effect. Alibaba's Aegaeon used token-level scheduling across multiple LLMs and cut GPU usage by 82% while boosting throughput ~9×. Even at a smaller scale, token batching can improve throughput by 2–5×.

  • Queue Management: Implement a queue or proxy in front of your GPUs to shape arrival rates and improve batch fill. Pick a strategy per workload: micro-batches (2–8) for chatbots, full batches (dozens) for async tasks.

  • CUDA Streams and Overlap: Overlap data transfers with compute and enable TensorRT's CUDA Graphs (--useCudaGraph) to reduce kernel launch overhead. These performance gains come at no additional infrastructure cost.

6. Multi-GPU & Multi-Tenancy

Use these techniques when you need to scale beyond one card or share GPUs across workloads:

  • Data Parallel (Multi-GPU): Run N copies of the model on N GPUs, each handling ~1/N of requests. In Triton, set instance_group { count: 2, kind: KIND_GPU, gpus: [0, 1] } to spin up two instances on two GPUs. Near-linear throughput scaling applies until another bottleneck appears.

  • Tensor Parallel / Pipeline Parallel: If a model is too large for one GPU, shard it. vLLM and DeepSpeed/Megatron handle tensor parallelism across cards. NVLink (SXM GPUs) is very beneficial here; PCIe-only GPUs may bottleneck on all-reduce bandwidth.

  • NVIDIA MIG (Multi-Instance GPU): For multiple smaller models or tenants, MIG slices a single NVIDIA A100/H100 into 2–7 independent GPUs, each with guaranteed compute and memory - ideal for isolating workloads and enforcing SLAs.

  • Fractional or Time-Sliced GPUs: Advanced schedulers time-share GPU threads across models. Alibaba's research suggests dynamic pooling outperforms static MIG slices for highly variable, multi-model traffic.

7. Auto-Scaling & Scale-to-Zero

Static, always-on GPUs waste your money when idle. Use cloud autoscaling to match your capacity to actual demand:

  • Queue-Driven Scaling: Monitor request queue length or in-flight load, and spin GPU VMs up or down accordingly. Kubernetes tools (KEDA, Karpenter) autoscale pods or nodes on these metrics. Hyperstack supports fast VM provisioning (under one minute) and suspend/hibernate - idle VMs can be paused completely.

  • Spot VMs: For non-critical batch workloads, Spot VMs on Hyperstack offer a 20% discount compared to on-demand pricing. Run nightly analytics or batch jobs on Spot VMs, then shut them down. Reserve on-demand VMs for the latency-critical hot path.

  • Scale-to-Zero: For very intermittent traffic, consider a serverless model - wake GPU VMs only when queries arrive. Pre-warmed standby or hibernation can approximate scale-to-zero and avoid cold-start penalties.

  • Elastic Workload Mix: Separate real-time from batch traffic. Keep a few GPUs hot for low-latency APIs, and use auto-scaled burst clusters for heavy off-peak inference. GMI Cloud handles ~10× peak spikes via elastic scaling.

Don't pay for idle GPUs. Moving from static to elastic GPU pools can cut costs ~50–70%. Invest effort in proper autoscaling policies and capacity-buffer thresholds.

8. Cost Metrics & Observability

Use tokens per dollar (or cost per million tokens) as your master metric. Instrument your system to expose GPU utilisation, tokens/sec, latencies (p50/p95/p99), and cost per 1K inferences or per million tokens.

  • Monitoring Tools: Export metrics to Prometheus and Grafana - vLLM has Prometheus endpoints, Triton exposes Stats, and Hyperstack VM metrics are available. Key dashboards: GPU utilisation, queue depth, batch size, tokens/sec.

  • Alerts & Dashboards: Set alerts on low utilisation (wasted resources) or rising cost per token. Track trends: is an optimisation actually lowering $/token?

  • Token vs Hour: Cost per hour is misleading. NVIDIA showed that H200 GPUs were 2× the cost per hour of H100 but delivered ~35× more tokens per watt - yielding 17.5× lower cost per token. Plot cost per token, not rental rates. Hourly rates also hide billing extras — hyperscaler bills typically include ingress and egress charges that don't appear in the GPU rate, while Hyperstack charges none.

  • Capacity Planning: Continuously benchmark new models and GPUs using MLPerf Inference or custom load tests (e.g. Triton Perf Analyser) to quantify throughput at scale.

  • Cloud Billing Integration: In Hyperstack, enable detailed billing and cost management to attribute spend to inference jobs. This helps identify idle VMs or runaway jobs before costs compound.

Good observability closes the loop. Without it, you risk spending on optimisations that do not pay off. Cache hit-rate monitoring, for example, tells you whether a result cache is effective - or whether batching is actually happening as expected.

Techniques vs. Impact / Cost 

Technique

When to Use

Impact

Complexity (Dev/Ops)

Quantisation (INT8/4)

Large LLMs where model fits GPU after compressing.

2–4× throughput; ~50–80% cost drop

Low: tools available; test accuracy

Pruning / Distillation

When accuracy trade-off is acceptable.

2–10× smaller models

High: requires training data/effort

Dynamic Batching

Online services with moderate latency SLAs.

2–5× throughput (GPU util ↑)

Medium: config Triton/vLLM

Token-Level Scheduling

Mixed or bursty multi-model inference.

Up to ~9× throughput

High: custom scheduler needed

Speculative Decoding

Generation-heavy workloads (long responses).

2–3× speedup

Medium: need secondary model

KV Cache Offloading

Long contexts that exceed single-GPU memory.

~20% cost savings (batch case)

Medium: runtime support

vLLM / PagedAttention

Long-context LLMs with high concurrency.

2–4× throughput

Low: use library & set up

TensorRT-LLM (FP8)

NVIDIA GPUs, ultra-low latency needs.

~4× throughput (H100 vs A100)

High: NVIDIA toolchain

Right-sized GPU

Always (match model to GPU).

Up to 35× tokens/sec gain

Low: measure & pick

MIG / GPU Sharing

Many small models or tenants.

≈+20–50% utilisation

Medium: infrastructure support

Auto-Scaling

Variable traffic workloads.

~50–70% cost reduction vs static

Medium: infra setup

Cache Frequent Queries

Apps with repeated inputs (search, FAQ).

Large latency & cost win if >10% hit rate.

Low: add caching layer

A Quick Checklist

Here's a quick checklist that you can use to cut inference costs:

  • Benchmark & Profile: Measure tokens/sec, VRAM use, and latency on candidate VMs. Compute cost per million tokens.

  • Optimise Model: Try INT8/4 quantisation (e.g. AWQ) and measure accuracy. Prune or distil if latency is still too high.

  • Tune Runtime: Deploy on Triton, vLLM, or TGI with dynamic batching enabled. Test batch sizes until latency degrades.

  • Manage KV Cache: Monitor GPU memory during inference. Enable KV offload or use vLLM's PagedAttention for long prompts.

  • Scale Smart: Hook up an autoscaler (KEDA/Karpenter) on queue length or GPU utilisation. Define scale-to-zero if feasible.

  • Choose GPU Wisely: Use multi-GPU or MIG only if needed. Avoid oversized VMs that sit idle.

  • Monitor & Iterate: Track GPU utilisation, latency (p99), and cost per token over time. Continually adjust batching, model, or scale rules.

  • Document & Automate: Containerise the stack, script benchmarking, and encode best practices in CI pipelines.

By systematically applying these strategies - and measuring every step - you can minimise inference cost on Hyperstack GPUs while meeting SLA requirements. Teams regularly see 40–70% reductions in compute spend and 2–5× improvements in throughput by combining these techniques.

Start Running Cost-Efficient Inference on Hyperstack

Hyperstack provides your team with instant access to NVIDIA H100 and H200 GPU VMs. With ultra-fast provisioning, seamless autoscaling, and Spot VM discounts, our platform is engineered to drive your cost per token to the absolute minimum. Stop overpaying for idle GPU hours and start optimising.

Ready to optimise your infrastructure?
Run your inference workloads on Hyperstack today.

FAQs

What is the most important metric for measuring LLM inference cost?

Cost per million tokens (or tokens per dollar) is the master metric. GPU cost per hour is misleading in isolation - a more expensive VM can deliver a lower cost per token if it has higher throughput. Always benchmark your actual workload and compute $/token = (GPU $/hr) / (tokens/sec × 3600).

Which GPU should I choose for inference workloads on Hyperstack?

It depends on your model size and latency requirements. H100 and H200 VMs are best for large models (70B+) or latency-critical production workloads. For smaller models (7B–13B), an A100 or L40S can deliver lower cost per token. Always benchmark on real traffic before committing to a VM type.

How much can quantisation reduce my inference costs?

Quantisation (INT8 or INT4) typically delivers 2–4× throughput improvement on the same GPU, which translates directly to lower cost per token. State-of-the-art 4-bit methods like AWQ and GPTQ keep accuracy within approximately 1% of full precision. For many production workloads, this is the single highest-impact, lowest-effort optimisation available.

What is dynamic batching and why does it matter?

Dynamic batching groups multiple incoming requests together before sending them to the GPU, rather than processing each one individually. This can move GPU utilisation from 20–30% up to 70–85%, delivering 2–5× throughput improvement at the same hardware cost. Both Triton Inference Server and vLLM support dynamic batching out of the box.

What are Spot VMs on Hyperstack and when should I use them?

Spot VMs on Hyperstack offer a 20% discount compared to on-demand pricing. They are best suited for non-critical or interruptible workloads - such as batch inference jobs, nightly analytics, or model-validation runs - where brief interruptions are acceptable. Reserve on-demand VMs for latency-critical, real-time inference paths.

How does vLLM's PagedAttention reduce inference costs?

PagedAttention virtualises the KV cache similarly to how an operating system pages memory, eliminating fragmentation and minimising wasted VRAM. This allows vLLM to pack more concurrent requests into the same GPU memory, increasing throughput by 2–4× compared to naive serving approaches like TorchServe - with no additional hardware cost.

When should I use TensorRT-LLM instead of vLLM?

TensorRT-LLM is the better choice when raw throughput and minimal latency on NVIDIA hardware are the priority and you can invest in the more complex setup (model conversion to TRT engines, NVIDIA-specific toolchain). It achieves approximately 3–4.6× the throughput of A100 FP16 on an H100 using FP8 precision. vLLM is a better default if you prioritise deployment flexibility, faster setup, or multi-model support.

How does autoscaling reduce GPU costs in practice?

Static GPU VMs left running when idle are pure wasted spend. Autoscaling tools like KEDA or Karpenter monitor queue depth or GPU utilisation and spin VMs up or down accordingly. Moving from a static GPU pool to an elastic one typically cuts compute costs 50–70%. Hyperstack supports fast VM provisioning (under one minute), making autoscaling practical for most workloads.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

What is Qwen3.6? Qwen3.6 is a cutting-edge, open-weight AI model engineered for elite ...

The model passes the evaluation. The latency report comes back and it is 3× too slow for ...