<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">
Reserve here

NVIDIA H100 SXMs On-Demand at $2.40/hour - Reserve from just $1.90/hour. Reserve here

Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close
|

Updated on 11 May 2026

KV Cache in LLMs Explained: The Key to Low-Latency LLM Inference

TABLE OF CONTENTS

NVIDIA H100 SXM On-Demand

Sign up/Login

Key Takeaways

  • KV cache eliminates recomputation but shifts bottleneck to GPU memory, making long-context inference constrained by capacity rather than compute performance.

  • Memory per token scales with layers, heads,nd dimensions, causing mid-sized models to consume massive VRAM at extended context lengths.

  • Batch size and sequence length compete directly for memory, forcing trade-offs between throughput efficiency and maximum context supported per request.

  • Eviction strategies like sliding window H2O, and StreamingLLM determine what context survives, making workload-specific tuning critical for maintaining output quality.

  • Latency increases with context length due to memory bandwidth and access overhead, not compute limits, making efficient cache management essential.

You kick off a long-context generation job. The model is solid, the prompt is well-formed and the first few tokens come back fast. Then, beyond the 30,000-token mark, latency increases. By 80,000 tokens, you are waiting. By 128k, each new token feels like a round-trip to somewhere expensive.

The instinct is to look at compute. You check GPU utilisation and it is not saturated. You check the model itself and nothing has changed. The problem is not the model. It is what the model has to do with every single token it generates and how much memory it needs to do it.

KV cache in LLMs prevents transformers from recomputing everything from scratch on every forward pass. Understanding how it works, what it costs in GPU memory and where it breaks under long-context loads is imperative for anyone running inference at scale.

Why Transformers Have a Repetition Problem

Transformer attention works by computing relationships between every token in the sequence and every other token that came before it. For each token, the model produces three vectors: a Query (Q), a Key (K) and a Value (V). The attention mechanism uses Q to find which K vectors are relevant, then aggregates the corresponding V vectors to produce the output.

The problem is the K and V vectors. For token 5,000 in a sequence, the model still needs the K and V vectors from tokens 1 through 4,999 to compute attention correctly. Without caching, the model recomputes all of them on every forward pass. That means generation cost scales quadratically with sequence length; every new token forces a full recompute of the entire history.

KV cache solves this by storing the K and V vectors from previous tokens in GPU memory. On each new forward pass, the model computes Q, K and V only for the current token, retrieves the cached K and V vectors for the rest of the sequence and runs attention over the combined set. No recomputation. Just a memory read.

How KV Cache Works

Each transformer layer has its own attention heads and each attention head maintains its own K and V vectors. KV cache stores all of them.

During the prefill phase, where the model processes the input prompt, the full K and V tensors are computed for every token in the prompt and written to the cache. During the decode phase, where the model generates tokens one at a time, only the new token's K and V are computed and appended. Attention then runs over the full cached sequence.

The cache grows with every generated token. It is append-only during decode. Reading from it is fast. The cost is not compute, it is memory.

KV Cache GPU Memory

This is where the real constraint lives. The memory footprint of KV cache per token is determined by three things: the number of transformer layers, the number of attention heads and the head dimension (the size of each K or V vector per head).

The formula: Memory per token = 2 (K and V) x num_layers x num_kv_heads x head_dim x bytes_per_element

In standard Multi-Head Attention (MHA), num_kv_heads equals the total number of attention heads. Modern large models use Grouped Query Attention (GQA) or Multi-Query Attention (MQA), where K and V are shared across query head groups. Llama 3 70B, Mistral 7B, Qwen2-72B, and DeepSeek 67B all use GQA with num_kv_heads of 8. This shrinks real KV cache by 8x compared to the MHA worst case shown below.

For an MHA worst case, take a 70B model with 80 layers, 64 attention heads, head dimension 128, running in FP16 (2 bytes per element):

Per token: 2 x 80 x 64 x 128 x 2 bytes = 2,621,440 bytes = ~2.5 MB per token

At 32,768 tokens (a 32k context window), that is roughly 80 GB of KV cache alone, which fills the entire VRAM of a single NVIDIA H100 SXM before model weights or activations enter the picture. At 128k tokens, you are looking at over 300 GB just for the cache.

With GQA at num_kv_heads of 8 (the actual config for Llama 3 70B), the same model needs roughly 10 GB at 32k and 40 GB at 128k. The formula does not change. The 8x reduction comes entirely from the smaller num_kv_heads value. This is why modern 70B inference fits on a single H100 or H200 at long contexts when older MHA architectures could not. 

Smaller models are cheaper but the relationship is linear and merciless. A 7B model with 32 layers, 32 heads and head dim 128 in FP16 costs approximately 0.5 MB per token. At 128k context, that is still 64 GB which is most of a single GPU's VRAM budget before you account for model weights or activation memory.

This is why KV cache GPU memory is the binding constraint for long-context inference, not FLOPS.

KV Cache Size vs. Batch Size

Serving inference at scale means serving multiple requests concurrently. Batch size is how you make that economical because more requests per GPU means more throughput, lower cost per token.

KV cache memory scales with two things simultaneously: sequence length and batch size. Every request in the batch needs its own KV cache, sized to its own context length. The total KV cache memory is roughly:

Total KV cache = batch_size x max_sequence_length x memory_per_token

 At a batch size of 8 with 32k context on the MHA worst case 70B, you need 640 GB of KV cache. The same workload on a GQA-based 70B drops to 80 GB. Either way, this is a multi-node memory allocation problem at the larger end, and it has to be solved before the model weights even land in memory.

The tension is direct: a larger KV cache lets you serve longer sequences, but it leaves less room to increase batch size. A larger batch size improves throughput, but it compresses the maximum sequence length each request can use before you run out of memory. You are always trading one against the other.

This tension shapes real decisions. For instance, how many GPUs you allocate per replica, whether you use tensor parallelism or pipeline parallelism, whether you pre-allocate cache pages or manage them dynamically. Getting this wrong means either memory errors mid-generation or throughput numbers that look good in benchmarks but collapse under realistic traffic.

Long-Context Loads and KV Cache Eviction

Once the KV cache fills available GPU memory, something has to give. Eviction is the process of deciding which cached K and V vectors to remove to make room for new ones. The strategy you choose determines what your model knows and what it forgets and the trade-offs are not symmetric.

  • Sliding window eviction is the simplest approach. It keeps only the most recent N tokens in the cache and drops everything older. This works well when generation is locally coherent; a model writing the next paragraph does not need to attend to the first paragraph of a 100k-token document. The problem surfaces when it does: any generation task that requires long-range dependencies (summarisation, multi-document QA, code generation with references to distant function definitions) will degrade visibly when the relevant context has been evicted.

  • H2O (Heavy-Hitter Oracle) takes a different approach. It tracks which K and V vectors have historically received high attention scores, the tokens the model has been attending to most and preferentially keeps them. Less-attended tokens are evicted first. This preserves semantic anchors better than the sliding window and degrades more gracefully on tasks with sparse but critical long-range dependencies. The cost is the overhead of tracking attention scores per token, which adds bookkeeping complexity to the inference loop.

  • StreamingLLM combines a fixed set of initial tokens (the attention sink tokens, which the model attends to by default due to training dynamics) with a sliding window of recent tokens. By keeping the earliest tokens permanently in cache, it avoids the failure mode where the model loses orientation to the start of a sequence. This is relevant for multi-turn chat applications, where the system prompt and early context anchor everything that follows. The limitation is that it still drops mid-sequence tokens, which means tasks requiring recall of specific content from the middle of a long document remain vulnerable.

No eviction strategy recovers what was dropped. Once a K or V vector is evicted, that token's contribution to attention is gone for the remainder of the generation. The implication for inference architecture is that eviction policy is not a tuning parameter; it is a workload-specific design decision. The right strategy depends on the access pattern of the task, not the defaults of the serving framework.

Run Inference on Infrastructure That Gives You Confidence to Make These Decisions

KV cache management only becomes a meaningful optimisation problem when the supporting infrastructure can actually hold the cache at the scale your workloads demand. Some shared, oversubscribed GPU environments impose their own ceiling before your eviction strategy ever kicks in.

Hyperstack's dedicated GPU infrastructure gives you the memory headroom, networking performance, and single-tenant resource allocation to run long-context inference without fighting the platform. If you are scaling production LLM workloads and need an environment to match distributed inference demands, explore what Hyperstack's Secure Private Cloud can do for your team.

FAQs

What is KV cache in LLMs and why is it important?

KV cache stores key and value tensors from previous tokens, avoiding recomputation during generation. It reduces latency and compute cost, enabling efficient long-context inference in production environments.

Why does latency increase with longer context lengths?

As context grows, the model attends to more cached tokens. Memory bandwidth and access overhead increase, slowing token generation even when GPU compute utilisation is not fully saturated.

How much GPU memory does KV cache use per token?

KV cache memory depends on layers, heads, head dimension and precision. Large models can consume megabytes per token, causing memory usage to scale linearly with sequence length.

Why is KV cache a bigger constraint than compute?

GPU compute is abundant, but memory is limited. KV cache grows with every token, quickly exhausting VRAM and restricting sequence length, batch size and overall inference scalability.

How does batch size impact KV cache memory usage?

Each request requires its own KV cache. Total memory usage scales with batch size and sequence length, forcing trade-offs between higher throughput and maximum supported context length.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Sign up now
Talk to an expert

Share On Social Media

When you decide between a single GPU and a GPU cluster, you are not only choosing more ...

A guide to choosing the right infrastructure for your AI workloads 70% of enterprises are ...