<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">
Reserve here

NVIDIA H100 SXMs On-Demand at $2.40/hour - Reserve from just $1.90/hour. Reserve here

Reserve here

Deploy 8 to 16,384 NVIDIA H100 SXM GPUs on the AI Supercloud. Learn More

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close

publish-dateDecember 2, 2025

5 min read

Updated-dateUpdated on 4 Feb 2026

Run DeepSeek OCR on Hyperstack with your Own UI

Written by

Hitesh Kumar

Hitesh Kumar

Share this post

Table of contents

summary

NVIDIA H100 GPUs On-Demand

Sign up/Login
summary

Key Takeaways

  • DeepSeek-OCR is a multimodal OCR model designed to extract both text and document structure from images and PDFs.

  • The setup uses a Hyperstack GPU virtual machine to run DeepSeek-OCR in a private, high-performance environment.

  • The model combines a vision encoder and a language decoder to handle complex layouts such as tables and multi-column documents.

  • Deployment involves cloning the DeepSeek-OCR repository, installing Python dependencies, and configuring the runtime environment.

  • A Gradio-based web interface allows users to upload documents and view OCR results in structured Markdown output.

  • The deployed OCR service can be extended into APIs or integrated into document processing and RAG workflows.

Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack

Optical Character Recognition (OCR) is the process of recognising and extracting text from a source like images or PDFs using just the visual field - it's what we do when we read!

Methods for performing OCR have exited for a while but in the past few years (or even months rather), transformer-based models have become incredibly competent at it. DeepSeek, one of the world's leading AI foundation model labs, have released their DeepSeek-OCR 3B parameter model for quickly and easily creating your own OCR workflows.

Picture 1-3

Why is it harder to run than other DeepSeek models?

You might be used to running other AI models, like DeepSeek's LLMs, which are often available via a simple API call or a straightforward Python library like transformers. We've even made tutorials in the past that you can follow to get DeepSeek V3. DeepSeek-OCR is a bit more hands-on because it's not just a language model; it's a specialised multi-modal system.

It essentially has two parts: a sophisticated vision encoder that sees and understands the layout of a page (just like our eyes), and a 3-billion-parameter language decoder that reads and interprets the text from that visual information. This two-stage process is what makes it so powerful, but it also requires a more complex stack of software to run efficiently.

The setup in this guide uses vLLM, a high-throughput serving engine, to get the best possible performance. This is what adds most of the setup steps - we need to install a particular version of it along with dependencies like flash-attn. It's this requirement for a high-performance, GPU-accelerated serving environment that makes it more complex than a simple pip install package, but the payoff in speed and accuracy is well worth it.

How good is DeepSeek-OCR? 

In short: it's exceptionally good. It represents the current state-of-the-art for open-source OCR in its size group, especially when it comes to understanding real-world, complex documents.

Where traditional OCR tools might just extract a "wall of text" that loses all formatting, DeepSeek-OCR understands the structure of the document. This is its key advantage. It excels at:

  • Complex Layouts: Accurately reading multi-column articles, magazine pages, and scientific papers.

  • Tables: It doesn't just see text in a table; it understands the table's rows and columns and formats the output (as markdown) to match.

  • Mixed Content: It's highly adept at handling pages with a mix of text, code blocks, and even mathematical equations.

Because it outputs structured markdown, you're not just getting the raw text; you're getting the document's semantic structure. This makes its output immediately useful for feeding into other systems, like a RAG pipeline or a summarisation model. For its 3B-parameter size, it hits a perfect sweet spot of being incredibly accurate while still being fast enough to interpret huge documents on a single H100 GPU.

How to set up DeepSeek-OCR on your own Hyperstack VM, step-by-step

We'll take you through the whole process from start to end to get a really simple and basic OCR workflow running on your own Hyperstack VM. 

Step 0: Getting a Hyperstack VM

This guide assumes you've just spun up a new Linux VM on our platform and can access it via SSH. If you haven't done this before, please see our getting started guide in our documentation.

Step 1: Clone the DeepSeek-OCR repo 

# Clone the DeepSeek-OCR repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git

Step 2: Install UV (the package manager):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Step 3: Create a python virtual environment:

uv venv deepseek-ocr --python 3.12.9
source deepseek-ocr/bin/activate

Step 4: Install vLLM and other requirements

cd DeepSeek-OCR

# Get vllm whl
wget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
unzip vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl -d vllm-0.8.5+cu118-whl

# Install requirements
uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install uvicorn fastapi gradio --upgrade
uv pip install transformers==4.57.1 --upgrade

This step may take a while, there are a lot of dependencies!

Step 5: Download the Python code

main.py 

This is a standalone python file that sets up the webserver and hosts it on your VM. We recommend you have a quick read through before you attempt to run it, just to familiarise yourself with what it does (more on this later).

Step 6: Get the code into your VM:

# Create the "web" dir and put main.py in there
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
mkdir -p web

cat <<EOF > web/main.py
<paste the contents of main.py here>
EOF

You can alternatively use some editor like nano or vim, or SSH into this VM from a more interactive source like VSCode or similar to make this part easier. 

Step 7: Start the server and access via your browser

# Start the server
uvicorn web.main:app --host 0.0.0.0 --port 3000

You should now be able to navigate to the UI by going to http://<your-VMs-ip>:3000, and interact with the UI! 

NOTE: Remember to open port 3000 for inbound TCP traffic via your VM's firewall on Hyperstack! For more info on this, see our documentation here 

Once loaded, It should look something like this:

image (15)

In this simple, barebones UI, you can upload PDFs or images and DeepSeek-OCR will automatically run on them.

The results will be visible in the lower box, with the option to see (and download) the labelled input and the extracted text in markdown format. 

To re-run, simple delete the existing input and upload something new!

Here's an example of an example PDF article output from DeepSeek-OCR:

Picture 1-4

Troubleshooting

As stated, this is a very minimal, quickly-put-together UI, and hence is not maintained and updated by Hyperstack, and is certainly not bug-free! However, feel free to modify the code the main.py file to solve any issues or add any features you like.

One bug we are aware of in our early testing is the UI's inability to replace old inputs when new ones are uploaded. In this case, simply Ctrl+C to terminate the server and re-run the same uvicorn command - this and a reload of the web page will then start a fresh instance of the UI with the issue no longer being present. 

What's Next?

Congratulations! You've now got your own private, high-performance OCR server running. This Gradio UI is a fantastic sandbox for testing, but the real power comes from what you can build on top of it.

The most logical next step is to adapt the web/main.py file. Instead of launching a Gradio UI, you could modify it to create a simple, robust REST API endpoint using FastAPI. Imagine an endpoint where you can POST an image or PDF file and get a clean JSON response containing the extracted markdown.

Once you have that API, the possibilities are endless:

  • Build a RAG Pipeline: This is the big one. You can now programmatically feed your entire library of PDFs and documents through this API, storing the clean markdown output in a vector database.

  • Create a "Chat with your Docs" App: Combine your new OCR API with a conversational LLM (like DeepSeek-LLM) to build a powerful application that lets you ask questions about your documents.

  • Automate Data Entry: Create a workflow that watches a specific folder or email inbox, runs any new attachments through your OCR API, and then parses the structured output to populate a database or spreadsheet.

You've done the hard part by setting up the core engine. Now you can use your Hyperstack VM as a stable, private microservice to power all kinds of intelligent document-processing workflows.

Launch Your VM today and Get Started with Hyperstack!

FAQs

What type of model is DeepSeek-OCR?

DeepSeek-OCR is a multimodal model combining vision and language understanding, designed to extract text and structure from documents efficiently.

What format does DeepSeek-OCR output?

It outputs structured markdown that preserves tables, layout, and semantic information, making it ready for downstream processing or RAG pipelines.

Which engine is used for high-throughput serving?

vLLM is used as a high-throughput serving engine, optimised for GPU acceleration to deliver fast, efficient OCR performance.

Which package manager is required for setup?

The setup requires UV, a modern package manager, to create virtual environments and install all dependencies reliably on Hyperstack.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Related content

Stay updated with our latest articles.

tutorials Tutorials link

Step-by-Step Guide to Deploying Qwen3.5 on Hyperstack

What is Qwen3.5? Qwen3.5 is a powerful, open-weight AI ...

What is Qwen3.5?

Qwen3.5 is a powerful, open-weight AI model built to act as a highly capable digital assistant that understands text, code, images, and video. It uses a highly efficient "Mixture-of-Experts" design, meaning it holds a massive 397 billion parameters but only activates 17 billion at a time to answer a prompt, making it incredibly fast without losing its trillion-parameter-level smarts. It can also process up to 1 million tokens at once, easily handling massive codebases, two-hour videos, and long, multi-step tasks in a single go.

A major reason Qwen3.5 is so smart is its advanced training system, which was built to train AI agents across millions of complex, real-world scenarios at once:

  • Separate Practice and Learning: The system splits the workload. Some graphics cards (GPUs) are dedicated purely to letting the AI practice tasks and generate responses, while others focus solely on updating the model's "brain" based on those experiences.
  • Smart Data Management: A built-in scheduler organises the AI's learning experiences, making sure the training system always receives fresh, balanced data without causing bottlenecks or delays.
  • Continuous Updates: As the AI learns and improves, its updated knowledge is seamlessly synced back to the practice servers in real-time, without ever needing to pause the system.
  • Built-in Tool Use: The training system is directly wired into Qwen-Agent, allowing the model to naturally practice using external tools (like web search or code execution) and remember context over long, back-and-forth workflows.

Qwen3.5 Features

Qwen3.5 goes beyond just chatting, it introduces major upgrades focused on getting complex, real-world tasks done efficiently:

  • Lightning-Fast and Cost-Effective: By only using a small fraction of its "brain" at a time (activating 17B out of 397B parameters), Qwen3.5 generates answers 8 to 19 times faster than previous versions, especially when reading long documents or code.
  • Advanced Vision and Video Skills: Qwen3.5 naturally understands images, computer screens, and videos. It can perform complex visual tasks, like looking at video game footage to write the code behind it, or clicking through a computer interface on its own.
  • Built to be an Independent Agent: Qwen3.5 operates in a default "Thinking" mode, meaning it pauses to reason through hard problems step-by-step before answering. It is highly skilled at using web search and seamlessly working with AI coding tools (like Qwen Code or Claude Code) to build software autonomously.
  • Massive Memory (Up to 1 Million Tokens): Out of the box, it can remember hundreds of thousands of words, and can be pushed to handle over 1 million tokens. This means you can drop in entire books, massive software projects, or long conversation histories without it forgetting the details.
  • Speaks 201 Languages: The model has been trained on 201 different languages and regional dialects. It also processes non-English text 10% to 60% faster than before, giving it a deep understanding of different cultures worldwide.

How to Deploy Qwen3.5 on Hyperstack

Now, let's walk through the step-by-step process of deploying the necessary infrastructure.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.

  • Select Hardware Configuration: For efficient inference with tensor parallelism is key. Choose the "8xH100-80G-PCIe" flavour to ensure sufficient VRAM and memory bandwidth.

  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all necessary drivers.

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Here you will replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Once connected, you should see a welcome message indicating you're logged into your Hyperstack VM.

Now that we are inside the VM, we will use Docker to launch the vLLM server.

Step 4: Create a Model Cache Directory

We'll create a directory on the VM's high-speed ephemeral disk. Storing the model here ensures faster loading times on startup.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions to the directory
sudo chmod -R 0777 /ephemeral/hug

This command creates a folder named hug inside the /ephemeral disk and sets its permissions so that the Docker container can read and write the model files.

Step 5: Launch the vLLM Server

 We will use the nightly vllm-openai Docker image. Although vLLM itself provides specific images such as vllm/vllm-openai:qwen3_5 for Qwen 3.5, note that we are using specific flags like --tool-call-parser to enable the advanced agentic features of Qwen3.5.

# Pull the latest vLLM OpenAI image from Docker Hub
docker pull vllm/vllm-openai:nightly

# Run the container with the specified configuration
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_qwen35 \
-e VLLM_ALLREDUCE_USE_SYMM_MEM=0 \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:nightly \
Qwen/Qwen3.5-397B-A17B-FP8 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--enforce-eager \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000

This command instructs Docker to:

  • --gpus all: Use all available NVIDIA GPUs on the host machine.
  • --ipc=host: Share the host’s IPC namespace to improve multi-GPU communication performance.
  • --network host: Expose the container directly on the host network for simpler API access.
  • -v /ephemeral/hug:/root/.cache/huggingface: Mount the Hugging Face cache directory to persist downloaded model weights and avoid re-downloading.
  • Qwen/Qwen3.5-397B-A17B-FP8: Load the Qwen 3.5 397B FP8 model from Hugging Face.
  • --tensor-parallel-size 8: Split the model across 8 GPUs for large-scale tensor parallelism.
  • --max-model-len 262144: Set the maximum supported context length to 262,144 tokens.
  • --reasoning-parser qwen3: Enable the Qwen3 reasoning parser for structured reasoning outputs.
  • --enable-auto-tool-choice: Allow the model to automatically decide when to invoke tools.
  • --tool-call-parser qwen3_coder: Use the Qwen3 coder-specific tool-call parser for agent-style tool interactions.
  • --gpu-memory-utilization 0.90: Allocate up to 90% of available GPU memory for model weights and KV cache.

Step 6: Verify the Deployment

First, check the container logs to monitor the model loading process. This may take several minutes.

docker logs -f vllm_qwen3

The process is complete when you see the line: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000. This is essential for external access.

Finally, test the API from your local machine (not the VM) by replacing the IP address with your VM's IP address.

# Test the API endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Qwen/Qwen3.5-397B-A17B-FP8",
"messages": [
{"role": "user", "content": "Type \"I love Qwen3.5\" backwards"}
],
"max_tokens": 200,
"temperature": 0.6,
"top_p": 0.95,
"extra_body": {
"top_k": 20
}
}'

You can see that we have a successful response as a JSON object containing the model reply:

{
"id": "chatcmpl-b290028506a93865",
"object": "chat.completion",
"created": 1771864485,
"model": "Qwen/Qwen3.5-397B-A17B-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Thinking Process:\n\n1. **Analyze the Request:**\n * Input: \"Type \"I love Qwen3.5\" backwards\"\n * Task: Reverse the string \"I love Qwen3.5\".\n\n2. **Perform the Reversal:**\n * Original string: `I love Qwen3.5`\n * Reversed: `5.3newQ evol I`\n\n3.",
...
},
"finish_reason": "stop"
}
],
...
}
💡

Note that Qwen team recommends using the following set of sampling parameters for generation:

# Thinking mode
temperature=0.6, top_p=0.95, top_k=20, min_p=0.0,
presence_penalty=0.0, repetition_penalty=1.0

# Instruct (or non-thinking) mode
temperature=0.7, top_p=0.8, top_k=20, min_p=0.0,
presence_penalty=1.5, repetition_penalty=1.0

You can see that our model is responding correctly to our query which means Qwen/Qwen3.5-397B-A17B-FP8 is successfully deployed on Hyperstack.

Step 7: Hibernating Your VM (OPTIONAL)

When you are finished with your current workload, you can hibernate your VM to avoid incurring unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual machine.
  • Look for a "Hibernate" option.
  • Click to hibernate the VM, which will stop billing for compute resources while preserving your setup.

Disabling "Thinking" Style for Concise Responses

Now that we have successfully deployed the vLLM server with the Qwen 3.5 model, we can interact with it using the OpenAI API format. First, we need to install the OpenAI Python client library to send requests to our local vLLM server.

# Install the OpenAI Python client library to interact with the vLLM server
pip3 install openai

We can now instantiate an OpenAI-compatible client in Python that points to our local vLLM server. Since vLLM typically does not enforce API keys, we can use a placeholder value for the api_key parameter.

from openai import OpenAI

# Create an OpenAI-compatible client that points to a local vLLM server.
client = OpenAI(
base_url="http://localhost:8000/v1", # Local API endpoint exposing OpenAI-style routes
api_key="EMPTY", # Placeholder key; vLLM typically does not enforce API keys
)

Since Qwen 3.5 is a thinking model with advanced reasoning capabilities, but thinking requires more tokens and may not be suitable for all use cases, we can disable the "thinking" style on inference to get more concise responses.

This can be useful when tasks are pretty straightforward and don't require the model to show its internal reasoning process, such as simple code generation or direct question answering.

# Define the conversation payload sent to the model.
# Here, the user asks for a short Python script that reverses a string.
messages = [
{"role": "user", "content": "Write a quick Python script to reverse a string."}
]

# Send a chat completion request to the local vLLM server via the OpenAI-compatible client.
chat_response = client.chat.completions.create(
model="Qwen/Qwen3.5-397B-A17B-FP8", # Model to use for generation
messages=messages, # Chat history / prompt messages
max_tokens=500, # Maximum number of tokens in the model response
temperature=0.7, # Sampling randomness (higher = more creative)
top_p=0.8, # Nucleus sampling threshold
presence_penalty=1.5, # Penalize repeated topics to encourage novelty
extra_body={
"top_k": 20, # Restrict sampling to top-k candidates
"chat_template_kwargs": {
"enable_thinking": False # Disable internal "thinking" style output
},
},
)

In here we are asking the model to generate a Python script that reverses a string. By setting enable_thinking to False, we are instructing the model to skip the detailed reasoning process and directly provide the final answer, which should be a concise Python code

Finally, we can print the generated response from the model, which should contain a Python script that reverses a string.

# Print the generated text from the first returned choice.
print("Chat response:", chat_response.choices[0].message.content)

This is what we are getting:

Chat response: Here is a quick and efficient Python script to reverse a string using slicing:

```python
def reverse_string(text):
return text[::-1]

# Example usage
if __name__ == "__main__":
user_input = input("Enter a ...

Our Qwen 3.5 model successfully generated a Python script that reverses a string, and it did so without including the internal "thinking" process in the output, resulting in a concise and direct answer.

Multimodal Capabilities with Qwen 3.5

Qwen 3.5 is also a multimodal model, which means it can process and understand both text and images. This allows us to create prompts that include images along with text questions, and the model can analyze the image to provide relevant answers.

For example, we can build a multimodal chat prompt that includes an image URL and a text question about the image.

# Build a multimodal chat prompt with one user message:
# - an image URL
# - a text question about the image
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    # Public image to analyze
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/RealWorld/RealWorld-04.png"
                }
            },
            {
                "type": "text",
                # Question for the model based on the provided image
                "text": "Where is this?"
            }
        ]
    }
]

In our messages payload, we have a single user message that contains two parts: an image URL and a text question. The model will process the image at the provided URL and attempt to answer the question "Where is this?" based on the visual content of the image.

# Send the request to the local vLLM server via OpenAI-compatible client
chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B-FP8",  # Model identifier
    messages=messages,                    # Multimodal user prompt
    max_tokens=600,                       # Max tokens in generated response
    temperature=0.6,                      # Sampling randomness
    top_p=0.95,                           # Nucleus sampling threshold
    extra_body={
        "top_k": 20,                      # Restrict sampling to top-k candidates
    },
)

# Print the first completion text returned by the model
print("Chat response:", chat_response.choices[0].message.content)

We can initialize the OpenAI-compatible client and send the multimodal prompt to our local vLLM server. This is what we get back from the model:

Chat response: The user wants to know the location of the image.

1.  **Analyze the image:**
    *   **Foreground:** There's a large statue of a person (looks like
an indigenous figure) with a golden headband.
Below it, there's a sign that says "@rigen" in a cursive font.
There's also a colorful floor or platform. ...

You can see that the model is able to analyze the image and provide a detailed response about its content, demonstrating its multimodal understanding capabilities.

We can also process video inputs in a similar way by providing a video URL in the prompt. The model can analyze the video frames and answer questions about the video content.

# Build a multimodal prompt:
# - one video input (URL)
# - one text question about the video content
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video_url",
                "video_url": {
                    # Public video to analyze
                    "url": "https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3.5/demo/video/N1cdUjctpG8.mp4"
                }
            },
            {
                "type": "text",
                # Question based on the video
                "text": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?"
            }
        ]
    }
]

In our messages payload, we have a user message that includes a video URL and a text question about the video content.

# Send the chat completion request to the local vLLM server
chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.5-397B-A17B-FP8",  # Model identifier
    messages=messages,                   # Multimodal conversation payload
    max_tokens=600,                      # Maximum tokens in response
    temperature=0.6,                     # Sampling randomness
    top_p=0.95,                          # Nucleus sampling threshold
    extra_body={
        "top_k": 20,                     # Restrict token sampling to top-k candidates
        # Video frame sampling config: sample frames at 2 FPS
        "mm_processor_kwargs": {"fps": 2, "do_sample_frames": True},
    },
)

# Print the generated answer from the first completion choice
print("Chat response:", chat_response.choices[0].message.content)

In here we are specifying additional parameters in the extra_body to configure how the model processes the video input. By setting do_sample_frames to True and specifying fps: 2, we are instructing the model to sample frames from the video at a rate of 2 frames per second for analysis.

This is what we get back from the model:

Chat response: The user is asking about the number of porcelain jars
discovered in the niches located in the primary chamber of a tomb, based
on the ...

You can see that the model is able to analyze the video content and provide a relevant response to the user's question, demonstrating its ability to understand and process video inputs in a multimodal context.

Agentic Use Case with Qwen 3.5

One of the most powerful features of Qwen/Qwen3.5-397B-A17B-FP8 is its advanced agentic tool-calling capability.

Unlike a standard chat interaction where the model simply generates text, an agentic workflow allows the model to:

  • Decide when external tools are needed
  • Call tools automatically
  • Receive tool outputs
  • Continue reasoning using those outputs
  • Complete multi-step tasks autonomously

Qwen team recommends using Qwen-Agent, a Python framework for building agent applications, to fully leverage these capabilities. First, let's install Qwen-Agent in your local Python environment:

# Install Qwen-Agent for building agent applications
pip3 install qwen-agent

We will configure Qwen-Agent to use our locally deployed vLLM server instead of external APIs.

import os
from qwen_agent.agents import Assistant

# Define LLM configuration pointing to our local vLLM server
llm_cfg = {
    # Use our OpenAI-compatible vLLM endpoint
    'model': 'Qwen/Qwen3.5-397B-A17B-FP8',
    'model_type': 'qwenvl_oai',
    'model_server': 'http://localhost:8000/v1',  # Local API endpoint
    'api_key': 'EMPTY',  # Placeholder key (vLLM does not enforce API keys)

    'generate_cfg': {
        'use_raw_api': True,
        # When using vLLM OpenAI-compatible API,
        # enable or disable thinking mode using chat_template_kwargs
        'extra_body': {
            'chat_template_kwargs': {'enable_thinking': True}
        },
    },
}

In this configuration, we are doing the following:

  • model_server points to our local vLLM deployment.
  • enable_thinking is set to True to allow structured reasoning.
  • use_raw_api ensures Qwen-Agent sends requests in OpenAI-compatible format.

Now we define a tool using the Model Context Protocol (MCP). This example uses the official MCP filesystem server.

# Define available tools for the agent
tools = [
    {
        'mcpServers': {
            # Filesystem MCP server configuration
            "filesystem": {
                "command": "npx",
                "args": [
                    "-y",
                    "@modelcontextprotocol/server-filesystem",
                    "/ephemeral/agent_workspace"  # Directory accessible to the agent
                ]
            }
        }
    }
]

This configuration:

  • Launches an MCP filesystem server using npx
  • Grants the model access to /ephemeral/agent_workspace
  • Allows the model to read, write, edit, and organize files within that directory

For security purposes, it is recommended to expose only a dedicated workspace directory rather than the entire system.

Now we can initialize the agent with the specified LLM configuration and tools.

# Initialize the agent
bot = Assistant(llm=llm_cfg, function_list=tools)

At this point, the model is capable of:

  • Understanding user instructions

  • Deciding when to use filesystem tools

  • Executing file operations

  • Continuing reasoning after tool execution

Example 1: Organizing the Desktop

We now provide a user instruction that requires filesystem interaction.

# Streaming generation example
messages = [{'role': 'user', 'content': 'Help me organize my /ephemeral/agent_workspace desktop. There are many files and folders all over the place.'}]

# Run the agent with the provided messages and stream responses
for responses in bot.run(messages=messages):
    pass

# Print the final responses from the agent after processing the instruction
print(responses)

We are asking the agent to help organize the /ephemeral/agent_workspace desktop. The model will autonomously decide to use the filesystem tool to analyze the desktop contents, create folders, and move files accordingly.

This is what happens internally:

  1. The model analyzes the request.
  2. It decides that filesystem access is required.
  3. It calls the MCP filesystem tool.
  4. The tool returns file listings.
  5. The model generates a plan to organize files.
  6. It may create folders and move files accordingly.
  7. It returns a summary of actions performed.

I have included couple different files in the /ephemeral/agent_workspace desktop for testing. After running the above code, we get the following output from the agent:

<think>
Checking the contents of the desktop...
The desktop contains multiple files including documents, images, and scripts.

I created the following folders:
- Documents
- Images
...

You can see that the model is able to analyze the desktop contents, decide on an organizational structure, and perform file operations autonomously using the MCP filesystem tool.

Example 2: Develop a Website and Save It to the Desktop

Now we provide a more advanced instruction:

# Streaming generation example
messages = [{'role': 'user', 'content': 'Develop a dog website and save it on the /ephemeral/agent_workspace desktop.'}]

# Run the agent with the provided messages and stream responses
for responses in bot.run(messages=messages):
    pass

# Print the final responses from the agent after processing the instruction
print(responses)

In here, we are asking the agent to develop a dog-themed website and save it on the /ephemeral/agent_workspace desktop.

This is what happens internally:

  1. The model interprets the request.
  2. It generates HTML content for a dog-themed website.
  3. It calls the filesystem tool.
  4. It creates index.html in the specified directory.
  5. It writes the generated HTML code into the file.
  6. It confirms completion.
I have created a file named "index.html" on the desktop.

The website includes:
- A header section
- A description of dogs
- An image placeholder
- Basic styling with CSS

You can open the file in your browser to view the website.

The actual directory now contains:

/ephemeral/agent_workspace/index.html

This file can be opened directly in a browser. This is what our simple website looks like:

Perfect, it includes a header, description, image placeholder, and basic styling, all generated autonomously by the agent using the Qwen 3.5 model and the MCP filesystem tool.

Why Deploy Qwen3.5 on Hyperstack?

Hyperstack is a cloud platform designed to accelerate AI and machine learning workloads. Here's why it's an excellent choice for deploying Qwen3.5:

  • Availability: Hyperstack provides access to the latest and most powerful GPUs such as the NVIDIA H100 on-demand, specifically designed to handle large language models. 
  • Ease of Deployment: With pre-configured environments and one-click deployments, setting up complex AI models becomes significantly simpler on our platform. 
  • Scalability: You can easily scale your resources up or down based on your computational needs.
  • Cost-Effectiveness: You pay only for the resources you use with our cost-effective cloud GPU pricing
  • Integration Capabilities: Hyperstack provides easy integration with popular AI frameworks and tools.

FAQs

What is Qwen3.5?

Qwen3.5 is an open-weight, native vision-language model built by the Qwen Team. It uses a highly efficient Mixture-of-Experts (MoE) architecture (397B total parameters, but only 17B active at a time) to power advanced, multimodal digital agents without slowing down.

What is the context window of Qwen3.5?

The model natively supports a massive 262,144-token context window. With special scaling techniques (like YaRN), it can even be extended to process over 1 million tokens, allowing you to feed it massive codebases or up to two hours of video.

Does Qwen3.5 support "thinking" mode?

Yes! In fact, Qwen3.5 operates in "thinking" mode by default. It naturally generates <think>...</think> blocks to reason through complex problems step-by-step before giving a final answer. (You can turn this off via API settings if you just want a direct response).

What hardware is required for Qwen3.5?

Even though it is highly efficient and only activates 17B parameters during generation, you still need to load all 397B parameters into memory. This requires significant VRAM, typically needing a setup of 8 high-end GPUs (like 8x 80GB H100s or A100s) to run smoothly.

What are the main use cases for this model?

Qwen3.5 is perfectly suited for building universal AI agents. It excels at complex visual reasoning, automating computer and smartphone interfaces (GUI automation), deeply analysing long videos, and autonomous "vibe coding" alongside tools like Qwen Code and OpenClaw.

Fareed Khan

Fareed Khan

calendar 24 Feb 2026

Read More
tutorials Tutorials link

Optimising Long-Context LLMs with KVPress Compression on ...

🚀 We love keeping up to date with the latest techniques, ...

🚀

We love keeping up to date with the latest techniques, so we decided to put NVIDIA KVPress to the test. Spoiler alert: it provides massive memory savings and faster inference.

As Large Language Models scale to massive context windows, the Key-Value (KV) cache has become the primary bottleneck for inference speed and memory. While libraries like KVPress offer compression techniques to shrink this cache, a critical question remains: Can you really delete 80% of a model's memory without degrading its reasoning capabilities?

In this guide, we conduct a head-to-head benchmark on Hyperstack’s H100 infrastructure. We compare a standard training-free approach (KnormPress) against NVIDIA’s state-of-the-art retrofitted model (DMS) to see which approach gives the highest performance and efficiency for the Qwen 3-8B model under demanding workloads.

Understanding The End-to-End KVPress Workflow

Understanding how kvpress achieves these results without requiring you to retrain your model or change its architecture is key to using it effectively. The library is designed as a  wrapper for Hugging Face Transformers. Here is the end-to-end flow of how a "Press" actually works during inference:

  1. Hook Registration: When you initialise a press (either through the pipeline or via a context manager), the library registers "forward hooks" on every attention layer of the transformer. Think of these as checkpoints that intercept the data flowing through the network.
  2. The Prefilling Phase: As the model begins the "prefill" phase (processing your prompt), the standard transformer logic runs. It generates Key and Value matrices for every token in your input. However, immediately after each attention layer calculates these matrices, the KVPress hook intercepts them before they are stored in the cache.
  3. Importance Scoring: Inside the hook, the compression algorithm (e.g., KnormPress) analyses these Key and Value tensors. It assigns an "importance score" to every token. The logic is simple: if a token has a low score, it likely won't be attended to by future tokens, so it's safe to discard.
  4. Cache Pruning: Based on your compression_ratio, KVPress identifies the lowest-scoring tokens. For example, a 0.8 ratio means the bottom 80% of tokens are identified and removed from the tensors.
  5. In-Place Cache Update: The library updates the past_key_values object in place. Crucially, the transformer doesn't know this happened, it continues as if it has the full context, but its memory footprint is now significantly smaller.
  6. Optimised Decoding: For all subsequent generated tokens (the decoding phase), the model attends only to this compressed cache. Because the cache is smaller, the GPU reads less data from VRAM at every step, directly boosting throughput and reducing latency.

Compression Techniques: KnormPress vs. DMS

While kvpress offers many different compression algorithms ranging from attention-based like SnapKVPress to dimension pruning like ThinKPress our analysis focuses on two distinct techniques to answer the question: Is training necessary for extreme compression?

To test this, we are using a standard training-free (KnormPress) against NVIDIA state-of-the-art retrofitted model (DMS). Here is how they work under the hood:

1. The Training-Free Approach: KnormPress

KnormPress is an easy-to-use method. It assumes that the size of a key vector shows how important it is in the attention mechanism.

  • The Heuristic: It calculates the L2 norm (magnitude) of every key vector in the cache.

  • The Assumption: Tokens with larger key norms contribute more to the attention score (the "Massive Activation" hypothesis).

  • The Selection: It ranks all tokens by their norm and identifies the bottom percentage (e.g., 80%) as redundant.

  • The Action: These low-norm tokens are physically sliced out of the Key and Value tensors in VRAM.

  • The Result: A significantly smaller memory footprint achieved instantly, with zero modifications to the model weights.

2. The Trained Model: Dynamic Memory Sparsification (DMS)

DMS represents the "retrofitted" approach. Instead of guessing which tokens matter, the model was explicitly trained to manage a sparse memory state.

  • Learned Eviction Policy: Unlike other compression techniques, DMS uses a lightweight, trained module to predict an "eviction probability" for every incoming token based on its hidden state.

  • Hybrid Memory Management: It combines a strict "Sliding Window" (protecting the most recent ~512 tokens) with a learned sparse retention policy for the older context.

  • Paged Block Allocation: To achieve massive throughput, DMS pre-allocates memory blocks (PagedAttention style) and routes important tokens into them, rather than constantly resizing tensors.

  • Sparse Reasoning: The model underwent a "retrofitting" training phase (approx. 1,000 steps), teaching it to reason effectively even when 87.5% of its history is missing.

  • The Result: Extreme 8x compression that preserves reasoning capabilities, albeit with higher initial VRAM allocation due to its memory pooling strategy.

By providing this variety, kvpress allows developers to find the perfect balance of memory savings and generation quality for their specific use case, without needing to modify the underlying model architecture or retrain.

How to Install KVPress on Hyperstack

Now, let's walk through the step-by-step process of installing the necessary modules.

Step 1: Accessing Hyperstack

Click to view Step 1 details

First, you will need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

Click to view Step 2 details

From the Hyperstack dashboard, we will launch a new GPU-powered VM.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.

Since we are going to do stress test using the batch processing, it's better to use 1xH100 80GB GPU for our experiments to ensure that we have enough memory to test higher batch sizes and compression ratios without running into out-of-memory errors.

  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all necessary drivers.

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Click to view Step 3 details

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Here you will replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Once connected, you should see a welcome message indicating you're logged into your Hyperstack VM.


Step 4: Installing KVPress

Click to view Step 4 details

KVPress comes with specific torch.cuda utilities to measure GPU memory usage and time during the prefilling and generation phases of language model inference. It's better to create a virtual environment to avoid dependency conflicts.

So let's create a new environment and install the required libraries:

# Create a new virtual environment (you can name it anything you like)
python3 -m venv kvpress_env

We are naming our environment kvpress_env, but you can choose any name you like. Let's activate the environment:

# Activate the virtual environment
source kvpress_env/bin/activate

We can now install the required libraries using pip:

# Install required libraries
pip install kvpress matplotlib

We are installing kvpress for the KV cache compression and matplotlib for plotting the results. KVpress will automatically handle the installation of its dependencies, including transformers and torch.

💡

Note that we will be using an 8B parameter model for our experiments, if you want to use higher parameter models, you can use flash attention and quantisation techniques to reduce memory usage:

# Install Flash Attention (Highly recommended)
pip install -U flash-attn --no-build-isolation

Once the installation is complete, we can now start coding the benchmarking approach using kvpress .

Benchmarking Extreme Compression: Baseline vs. Knorm vs. DMS

To see how useful trained sparse attention is, we run an experiment on the Qwen 3-8B model with three setups:

  • Baseline (0%): The standard dense model with no compression.
  • KnormPress (80%): A training-free heuristic that aggressively drops 80% of tokens on the fly.
  • DMS (8x): NVIDIA's retrofitted model, specifically trained to manage an 8x compressed cache.
💡

Note: For the DMS configuration, we are using the official pre-trained checkpoints provided by NVIDIA. We did not perform the "retrofitting" training phase ourselves. Instead, we are using NVIDIA open-source weights to evaluate the performance of an already-optimized sparse model.

We can now import the necessary libraries for our performance analysis. This includes libraries for data manipulation, model loading, GPU memory management, and plotting:

# Data manipulation and analysis
import pandas as pd

# Hugging Face utilities for downloading datasets
from huggingface_hub import hf_hub_download

# Progress bar for loops

from tqdm.auto import tqdm

# Serialization and file operations

import pickle
import os

# Warning suppression

import warnings

# Timing utilities

from time import time

# Plotting libraries
import matplotlib.pylab as plt
import matplotlib.ticker as ticker
from matplotlib.colors import LinearSegmentedColormap
import numpy as np

# PyTorch for tensor operations and GPU management

import torch

# Transformers library for LLMs

from transformers import AutoModelForCausalLM, pipeline, AutoTokenizer
from transformers.utils.logging import disable_progress_bar
import transformers
from transformers.cache_utils import DynamicCache

# Garbage collection for memory management

import gc

# KV cache compression library

from kvpress import KnormPress

Most of the libraries are standard for working with language models and GPU performance analysis. KnormPress is the specific compression method we will be using to evaluate the performance of KV cache compression. It basically applies a normalisation technique to the keys and values in the KV cache to reduce memory usage while trying to preserve as much information as possible for generation quality.

Before we start our experiments, we will suppress warnings and disable the progress bar from the transformers library to keep our output clean and focused on the results. This is especially useful when running multiple iterations of model inference, as it prevents cluttering the console with unnecessary logs:

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")
transformers.logging.set_verbosity_error()
disable_progress_bar()

To properly evaluate the performance of KV cache compression, we need a dataset of prompts that we can feed into the model.

Step 1: Preprocessing Our Eval Dataset

The ShareGPT dataset is a collection of conversations between users and an assistant, which provides a rich source of real-world prompts for our experiments.

It also contains lengthy conversations that allow us to have a high context length, which is important for testing the effectiveness of KV cache compression. We will download the dataset from the Hugging Face Hub, load it into a pandas DataFrame, and extract prompts that are suitable for our performance analysis.

# Download and load the ShareGPT dataset from Hugging Face Hub
DEFAULT_DATASET_URL = "anon8231489123/ShareGPT_Vicuna_unfiltered"

# The dataset file contains conversations between users and the assistant, which we will use to extract prompts for our performance analysis.
DEFAULT_DATASET_FILE = "ShareGPT_V3_unfiltered_cleaned_split.json"

# Download the dataset file and load it into a pandas DataFrame for processing

dataset_path = hf_hub_download(repo_id=DEFAULT_DATASET_URL, filename=DEFAULT_DATASET_FILE, repo_type="dataset")

# Load the dataset into a DataFrame for easier manipulation and filtering

df = pd.read_json(dataset_path)

Once we have the dataset loaded, let's take a look at the structure of the conversations.

# Examine the structure of the conversations in the dataset
sample_conv = df.iloc[0]["conversations"]

# Print the type of the conversation and the first few turns to understand the format of the data
print("Example Conversation Structure:")
print(type(sample_conv))
print("\nFirst 2 turns:")

# Look at the first 2 turns of the conversation to see how the prompts and responses are structured

for turn in sample_conv[:2]:
print(turn)

This is what we are getting when we print the structure of the conversations:

Example Conversation Structure:
<class 'list'>
First 2 turns:
[
{
"from": "human",
"value": "Summarize the main ideas of Jeff ..."
},
{
"from": "gpt",
"value": "Here are the main ideas of Jeff Walker's Product ... and improve efficiency."
}
]

We need to filter the conversations to ensure that we have valid prompts and responses for our performance analysis. We will only keep conversations that have at least 2 messages (a user prompt and an assistant response).

# Filter conversations with at least 2 messages (prompt and completion)
valid_convs = df[df["conversations"].apply(lambda x: len(x) >= 2)]

We will sort the conversations by their ID for reproducibility and then shuffle them with a fixed random seed to ensure that we have a random selection of prompts for our experiments while still being able to reproduce the results in future runs.

# Sort by ID for reproducibility, then shuffle with fixed random seed
sorted_convs = valid_convs.sort_values(by="id")
shuffled_convs = sorted_convs.sample(frac=1, random_state=4387).reset_index(drop=True)

After that we will extract the prompts from the first message of each conversation, ensuring that both the prompt and the completion have at least 10 characters.

# Extract prompts from the first message of each conversation
# Only include conversations where both prompt and completion have at least 10 characters
all_prompts = []

for _, data in shuffled_convs.iterrows():

prompt = data["conversations"][0]["value"] # First message is the user prompt
completion = data["conversations"][1]["value"] # Second message is the assistant response

if len(prompt) >= 10 and len(completion) >= 10:
all_prompts.append(prompt)

print
(f"Prepared {len(all_prompts)} prompts from ShareGPT.")

When running the above code, we should see an output like this:

Prepared 88797 prompts from ShareGPT.

So there are 88K valid prompts that we can use for our performance analysis. This is more than enough for our experiments, especially since we will be testing with batch sizes up to 64.

Although we have a large number of prompts, but it's better to specify the GPU device we want to use for our experiments to ensure that we are utilising the correct hardware for model inference and performance measurements.

# Specify the GPU device to use for model inference
device = "cuda:0"

We also need to define the model checkpoints for our baseline and DMS runs. The baseline will use the standard Qwen3-8B model, while the DMS run will use a specific open-source checkpoint from NVIDIA that has been trained to manage an 8x compressed cache.

# Model checkpoints for baseline and DMS runs
base_ckpt = "Qwen/Qwen3-8B"

# DMS checkpoint from Hugging Face Hub, specifically trained for 8x compression
dms_ckpt = "nvidia/Qwen3-8B-DMS-8x"

Our data preprocessing is now complete, and we have a list of prompts ready for our performance analysis.

Step 2: Create Cache Size Calculation Function

We will be using the Qwen3-8B model for our experiments, which is a language model that is suitable for testing KV cache compression techniques because of the reasoning/thinking nature it has.

The next function we need to build is get_size_of_cache, which calculates the memory size of a KV cache in bytes. Let's do that now.

def get_size_of_cache(cache, seen=None):
if seen is None:
seen = set()

obj_id = id(cache)
if obj_id in seen:
return 0
seen.add(obj_id)

size = 0

# CASE 1: Custom DMSCache (Paged Attention style)
if hasattr(cache, "layers") and len(cache.layers) > 0 and hasattr(cache.layers[0], "key_blocks"):
for layer in cache.layers:
# We want the logical size of the compressed tokens, not the giant empty pool.
if hasattr(layer, "cache_seq_lengths") and layer.cache_seq_lengths is not None:
used_tokens = layer.cache_seq_lengths.sum().item()
head_dim = layer.head_dim
element_size = layer.dtype.itemsize if hasattr(layer, "dtype") else 2

size += 2 * used_tokens * head_dim * element_size
else:
if getattr(layer, "key_blocks", None) is not None:
size += layer.key_blocks.element_size() * layer.key_blocks.nelement()
if getattr(layer, "value_blocks", None) is not None:
size += layer.value_blocks.element_size() * layer.value_blocks.nelement()
return size

# CASE 2: Standard HF DynamicCache
elif hasattr(cache, "key_cache") and hasattr(cache, "value_cache"):
for k in cache.key_cache:
if k is not None and torch.is_tensor(k):
size += k.element_size() * k.nelement()
for v in cache.value_cache:
if v is not None and torch.is_tensor(v):
size += v.element_size() * v.nelement()
return size

# CASE 3: QuantizedCache or newly structured caches
elif hasattr(cache, "layers"):
for layer in cache.layers:
if hasattr(layer, "keys") and layer.keys is not None:
size += layer.keys.element_size() * layer.keys.nelement()
if hasattr(layer, "values") and layer.values is not None:
size += layer.values.element_size() * layer.values.nelement()
return size

# CASE 4: Legacy tuple/list structure
elif isinstance(cache, (list, tuple)):
for item in cache:
if isinstance(item, (list, tuple)):
for tensor in item:
if tensor is not None and torch.is_tensor(tensor):
size += tensor.element_size() * tensor.nelement()
elif item is not None and torch.is_tensor(item):
size += item.element_size() * item.nelement()
return size

return 0

This helper function calculates the memory size of a KV cache in bytes. Here's how it works:

  • The function uses a seen set to track already-processed objects and avoid double-counting shared references, enabling safe recursive size calculation.

  • CASE 1 (DMSCache): For NVIDIA's paged attention-style cache, it extracts the number of retained tokens from cache_seq_lengths, then multiplies by head dimension and element size (2 bytes for fp16/bf16) to get the precise logical cache size, avoiding over-counting from pre-allocated but unused memory pools.

  • CASE 2 (DynamicCache): For Hugging Face's standard dynamic cache, it iterates through key_cache and value_cache lists, summing the memory of all non-null tensors using PyTorch's element_size() and nelement() methods.

  • CASE 3 (QuantizedCache): For newer cache implementations that organize tensors by layer, it accesses keys and values attributes within each layer and accumulates their memory footprint.

  • CASE 4 (Legacy Format): For older tuple or list-based caches, it goes through all the nested items, finds every tensor, and adds up their sizes in bytes.

  • The function returns 0 if the cache object doesn't match any known structure, ensuring robustness across different Transformers library versions.

When it comes to cache implementations, we need to separate the logic for measuring the prefilling phase (when the model processes the input prompt and builds the KV cache) and the generation phase (when the model produces new tokens autoregressively using the KV cache).

Step 3: Creating Prefill State

This way we can get detailed statistics for both phases, including memory usage and time taken, which will help us understand the trade-offs of different compression ratios in a more granular way.

First, let's implement the function to measure the prefilling phase with batch processing.

def get_prefilling_stats(model_ckpt, prompts_batch, press=None):
"""
Measure prefilling time and KV cache size for batch processing.

The prefilling phase is when the model processes the full input
sequence and builds the KV cache.

This version supports optional KV compression via a context manager.
"""

# Clean up GPU memory and reset memory tracking
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

# Load model configuration
config = AutoConfig.from_pretrained(model_ckpt, trust_remote_code=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_ckpt)

# Set padding token ID in config to EOS token
config.pad_token_id = tokenizer.eos_token_id

# Prepare model loading arguments
model_kwargs = {"dtype": "auto", "device_map": "auto", "config": config}

# Some checkpoints require remote code execution
if "DMS" in model_ckpt:
model_kwargs["trust_remote_code"] = True

# Load the language model
model = AutoModelForCausalLM.from_pretrained(model_ckpt, **model_kwargs)

# Set padding token to EOS
tokenizer.pad_token = tokenizer.eos_token

# Tokenize batch with padding and truncation
inputs = tokenizer(
prompts_batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024
).to(device)

# Warmup: Run a small forward pass
with torch.no_grad():
model(inputs.input_ids[:, :10])

torch.cuda.synchronize()

# Apply compression context if provided; otherwise use no-op context
context_manager = press(model) if press else nullcontext()

# Main measurement: Process full batch and build KV cache
with torch.no_grad(), context_manager:
# Initialize dynamic KV cache
cache = DynamicCache()

# Measure prefilling time
start = time()

# Forward pass with caching enabled
outputs = model(
inputs.input_ids,
past_key_values=cache,
use_cache=True
)

torch.cuda.synchronize()
prefill_time = time() - start

# Retrieve the actual cache returned by the model
actual_cache = (
outputs.past_key_values
if hasattr(outputs, "past_key_values")
else cache
)

# Measure total memory footprint of the KV cache
cache_size = get_size_of_cache(actual_cache)

# Cleanup
del model, tokenizer, inputs, cache, outputs

# Return statistics (cache size converted from bytes to GB)
return {
"Prefilling time": prefill_time,
"Cache Size": cache_size / 1024**3,
}

In our prefilling stats function, the logic is as follows:

    • We start by cleaning up GPU memory and resetting memory tracking to ensure accurate measurements unaffected by previous runs.
    • Load the model configuration, tokenizer, and prepare model arguments with appropriate settings like dtype="auto" and device_map="auto".
    • Tokenize the input batch with padding and truncation to 1024 tokens, then run a warmup forward pass to initialize CUDA kernels without measuring startup overhead.
    • Setting up an optional compression context manager. If a press is provided, we apply it during the forward pass, otherwise, we use a no-op context manager.
    • We then execute the main prefilling measurement with caching enabled, measure timing, retrieve the actual KV cache, calculate its memory footprint using get_size_of_cache, then clean up and return results in GB.

Step 4: Creating Generation State

In a similar way, we need to implement the function to measure the generation phase with batch processing. The generation phase is when the model produces new tokens autoregressively using the KV cache built during prefilling.

def get_generation_stats(model_ckpt, prompts_batch, press=None, max_new_tokens=100):
"""
Measure generation time and peak memory usage for batched decoding.

The generation phase includes both prefilling (processing the input prompt)
and autoregressive decoding of new tokens.

This function supports optional KV compression via a context manager.
"""

# Clean up GPU memory and reset memory tracking
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

# Load model configuration
config = AutoConfig.from_pretrained(model_ckpt, trust_remote_code=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_ckpt)

# Set padding token ID in config to EOS token
config.pad_token_id = tokenizer.eos_token_id

# Prepare model loading arguments
model_kwargs = {"dtype": "auto", "device_map": "auto", "config": config}

# Some checkpoints require remote code execution
if "DMS" in model_ckpt:
model_kwargs["trust_remote_code"] = True

# Load the language model
model = AutoModelForCausalLM.from_pretrained(model_ckpt, **model_kwargs)

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

# Configure deterministic generation settings
model.generation_config.pad_token_id = tokenizer.pad_token_id
model.generation_config.eos_token_id = None
model.generation_config.do_sample = False

# Tokenize batch with padding and truncation
inputs = tokenizer(
prompts_batch,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024
).to(device)

# Apply compression context if provided; otherwise use no-op context
context_manager = press(model) if press else nullcontext()

# Main measurement: Generate new tokens autoregressively
with torch.no_grad(), context_manager:
# Measure total generation time (includes prefilling + decoding)
start = time()

model.generate(
**inputs,
max_new_tokens=max_new_tokens,
generation_config=model.generation_config
)

torch.cuda.synchronize()
total_time = time() - start

# Record peak GPU memory usage during generation
peak_memory = torch.cuda.max_memory_allocated()

# Cleanup
del model, tokenizer, inputs

# Return statistics (memory converted from bytes to GB)
return {
"Total time": total_time,
"Peak memory usage": peak_memory / 1024**3,
}

In our generation stats function, the logic is as follows:

    • Similar to the prefilling function, we start by cleaning up GPU memory and resetting memory tracking for accurate measurements.
    • Load the model configuration and tokenizer, ensuring that the padding token is set correctly for batch processing.
    • Prepare model loading arguments with optimal settings and load the model.
    • Configure the generation settings to ensure deterministic output and proper handling of padding.
    • We tokenize the input batch with padding and truncation, then apply the optional compression context manager during generation.
    • Measure the total time taken for generation (which includes both prefilling and decoding) and record the peak GPU memory usage during the process.
    • Finally, clean up and return the results in GB for memory usage.

Step 5: Combining Stats Logic

It is not possible to directly compare the prefilling and generation stats because they measure different phases of the inference process. However, we can combine them into a single results dictionary that includes both sets of statistics, so let's create a function to do that.

def combine_stats(prefill, gen, batch_size, max_new_tokens=100):
"""
Combine prefilling and generation statistics into unified metrics.

This function derives decoding-only time, computes throughput,
and aggregates key benchmarking metrics.

Args:
prefill: Dictionary returned by get_prefilling_stats
gen: Dictionary returned by get_generation_stats
batch_size: Number of sequences processed simultaneously
max_new_tokens: Number of tokens generated per sequence

Returns:
dict: Combined statistics including peak memory usage,
cache size, time-to-first-token (TTFT), and throughput
"""

# Compute decoding time by subtracting prefilling time
# Total generation time = prefilling + autoregressive decoding
gen_time = gen['Total time'] - prefill['Prefilling time']

# Compute throughput (tokens per second)
# Formula: (Batch Size × Tokens Generated) / Decoding Time
throughput = (
(batch_size * max_new_tokens) / gen_time
if gen_time > 0 else 0
)

# Return combined benchmarking metrics
return {
# Peak GPU memory measured during generation phase
'Peak memory usage': gen['Peak memory usage'],

# KV cache size measured during prefilling phase
'Cache Size': prefill['Cache Size'],

# Time to first token (equals prefilling time)
'TTFT': prefill['Prefilling time'],

# Decoding throughput in tokens per second
'Throughput': throughput
}

In our combine_stats function, it takes the prefilling and generation statistics, along with the batch size and number of new tokens generated, to compute additional metrics:

    • We calculate the decoding time by subtracting the prefilling time from the total generation time, giving us the time spent on autoregressive decoding alone.
    • Compute the throughput in tokens per second using the formula: (Batch Size × Tokens Generated) / Decoding Time. We also include a check to prevent division by zero in case the decoding time is extremely short.
    • Finally, return a dictionary that includes the peak memory usage, cache size, time to first token (which is the prefilling time), and the computed throughput, providing a comprehensive view of the performance characteristics for each compression ratio and batch size.

Step 6: Running Benchmarking Loop

We can now define the compression_ratios and batch_sizes that we want to evaluate, and then iterate through each batch size to measure the prefilling and generation statistics for each compression ratio along with the baseline and DMS runs.

# Initialize results dictionary
stats = {}

# Focused on a single aggressive compression ratio for this benchmark

compression_ratios = 0.8

# # Currently focused on a single batch size for controlled comparison
batch_sizes = [64]

max_prompts_needed = max(batch_sizes)

if
len(all_prompts) < max_prompts_needed:
raise ValueError(f"Need at least {max_prompts_needed} prompts, but only have {len(all_prompts)}")

After defining the compression ratios, batch sizes and our DMS based test, we will iterate through batch size and measure the prefilling and generation statistics.

print("Starting Final Benchmark...")  # Iterate over batch sizes

for batch_size in tqdm(batch_sizes, desc="Batch Sizes"):

# Select the first `batch_size` prompts for evaluation
# Assumes all_prompts contains sufficiently many prompts
prompts = all_prompts[:batch_size]

# Initialize nested dictionary for this batch size
stats[batch_size] = {}

# ----------------------------------------------------------
# 1. Baseline Run (No Compression)
# ----------------------------------------------------------
print(f"Running Baseline for BS={batch_size}...")

# Measure prefilling statistics
prefill_base = get_prefilling_stats(base_ckpt, prompts)

# Measure full generation statistics
gen_base = get_generation_stats(base_ckpt, prompts)

# ----------------------------------------------------------
# 2. Knorm Compression Run (80% retention)
# ----------------------------------------------------------
print(f"Running Knorm (80%) for BS={batch_size}...")

# Apply KnormPress compression with 80% retention ratio
prefill_knorm = get_prefilling_stats(
base_ckpt,
prompts,
press=KnormPress(compression_ratios)
)

gen_knorm = get_generation_stats(
base_ckpt,
prompts,
press=KnormPress(compression_ratios)
)

# ----------------------------------------------------------
# 3. DMS Model Run (8× compression variant)
# ----------------------------------------------------------
print(f"Running DMS (8x) for BS={batch_size}...")

# Evaluate DMS checkpoint without external compression
prefill_dms = get_prefilling_stats(dms_ckpt, prompts)
gen_dms = get_generation_stats(dms_ckpt, prompts)

# ----------------------------------------------------------
# Combine Results
# ----------------------------------------------------------
# Each entry aggregates TTFT, throughput, cache size, and peak memory
stats[batch_size] = {
'Baseline (0%)': combine_stats(prefill_base, gen_base, batch_size),
'Knorm (80%)': combine_stats(prefill_knorm, gen_knorm, batch_size),
'DMS (8x)': combine_stats(prefill_dms, gen_dms, batch_size)
}

print("Benchmark Complete.")

This will start the benchmarking process:

Loading ShareGPT data...

Prepared 88797 prompts from ShareGPT.

Starting Final Benchmark...
Batch Sizes:   0%|          | 0/1 [00:00<?, ?it/s]

Running Baseline for BS=64...
Running Knorm (80%) for BS=64...

Running DMS (8x) for BS=64...
Batch Sizes: 100%|██████████| 1/1 [01:17<00:00, 77.45s/it]

Benchmark Complete.

Now that our benchmarking is complete we can now look into the results.

Performance Analysis: Throughput vs. Memory Tradeoffs

Let's print the stats variable to see the results for each batch size and compression ratio:

# printing the stats results
print(stats)

These are our results:

{64: {'Baseline (0%)': {'Peak memory usage': 30.883902072906494,
   'Cache Size': 9.0,
   'TTFT': 3.491295337677002,
   'Throughput': 564.4670750984385},
  'Knorm (80%)': {'Peak memory usage': 23.683218479156494,
   'Cache Size': 1.79296875,
   'TTFT': 3.5725390911102295,
   'Throughput': 1271.0371405858396},
  'DMS (8x)': {'Peak memory usage': 50.46891498565674,
   'Cache Size': 5.957980155944824,
   'TTFT': 8.323242664337158,
   'Throughput': 2687.9628813719537}}}

To better understand our results, let's visualise the key metrics (Peak Memory Usage, Cache Size, Throughput, and Time to First Token) across different compression ratios and batch sizes. This will help us see the trade-offs between memory savings and performance as we apply different levels of compression to the KV cache.

Let's create a function to plot these metrics in a clear and informative way, using a 1x4 grid to show all four metrics side by side for easy comparison.

Click to expand Python code
# --- Plotting Function ---
def plot_bar_comparison(stats, batch_size=64, title_suffix=''):
"""
Plot bar charts comparing key performance metrics across methods.

Displays Peak Memory, KV Cache Size, Throughput, and TTFT
for a given batch size.
"""

# Use clean whitegrid style for readability
plt.style.use('seaborn-v0_8-whitegrid')

# Extract data for selected batch size
data = stats[batch_size]

# Labels correspond to benchmark variants (Baseline, Knorm, DMS)
labels = list(data.keys())

# Metrics to visualize: (dictionary_key, unit, display_title)
metrics = [
('Peak memory usage', 'GB', 'Peak VRAM'),
('Cache Size', 'GB', 'KV Cache Size'),
('Throughput', 'tok/s', 'Throughput'),
('TTFT', 's', 'Time to 1st Token')
]

# Create 4 side-by-side subplots
fig, axes = plt.subplots(1, 4, figsize=(24, 6))

# Fixed color scheme: Baseline (gray), Knorm (green), DMS (red)
colors = ['#cccccc', '#4daf4a', '#e41a1c']

# Iterate over metrics and generate bar plots
for idx, (key, unit, title) in enumerate(metrics):

ax = axes[idx]

# Extract metric values in label order
vals = [data[label][key] for label in labels]

# Create bar chart
bars = ax.bar(labels, vals, color=colors, edgecolor='black', alpha=0.8)

# Annotate bars with exact numeric values
ax.bar_label(bars, fmt=f'%.2f {unit}', padding=3, fontsize=10)

# Baseline value used for percentage comparison
baseline_val = vals[0]

# Throughput: higher is better → show positive improvement
if key == 'Throughput':

imp_knorm = (
((vals[1] - baseline_val) / baseline_val) * 100
if baseline_val > 0 else 0
)
imp_dms = (
((vals[2] - baseline_val) / baseline_val) * 100
if baseline_val > 0 else 0
)

# Place improvement percentages inside bars
if vals[1] > 0:
ax.text(1, vals[1]*0.5, f"+{imp_knorm:.0f}%",
ha='center', color='white', fontweight='bold')

if vals[2] > 0:
ax.text(2, vals[2]*0.5, f"+{imp_dms:.0f}%",
ha='center', color='white', fontweight='bold')

# Memory metrics: lower is better → show reduction percentage
elif key != 'TTFT':

imp_knorm = (
((baseline_val - vals[1]) / baseline_val) * 100
if baseline_val > 0 else 0
)
imp_dms = (
((baseline_val - vals[2]) / baseline_val) * 100
if baseline_val > 0 else 0
)

if vals[1] > 0:
ax.text(1, vals[1]*0.5, f"-{imp_knorm:.0f}%",
ha='center', color='white', fontweight='bold')

if vals[2] > 0:
ax.text(2, vals[2]*0.5, f"-{imp_dms:.0f}%",
ha='center', color='white', fontweight='bold')

# Set subplot title and formatting
ax.set_title(title, fontsize=14, fontweight='bold')
ax.set_ylabel(unit)
ax.grid(axis='y', linestyle='--', alpha=0.7)
ax.set_ylim(bottom=0)

# Overall figure title
plt.suptitle(
f"Performance Comparison: Batch Size {batch_size}{title_suffix}",
fontsize=16,
y=1.02
)

plt.show()

In our plotting function, we are doing the following:

  • We set up a 1x4 grid of subplots to display the four key metrics side by side for easy comparison.

  • Using a consistent color scheme to differentiate between the Baseline (gray), Knorm (green), and DMS (red) bars.

  • For each metric, we extract the values for the three methods and create a bar chart.

  • We annotate each bar with its exact value for clarity.

  • For throughput, calculate and display the percentage improvement over the baseline directly on the bars.

  • For memory metrics (Peak VRAM and KV Cache Size), we calculate and display the percentage reduction compared to the baseline.

  • Finally, set titles, labels, and gridlines for better readability and display the overall figure title.

Let's execute the plotting function to visualize our results:

# Execute plotting
plot_bar_comparison(stats)

This will generate a comprehensive visualisation of our benchmarking results:

  • Throughput : This is the main metric. By slashing the amount of data the GPU has to pull from memory during decoding, the DMS model achieves an +376% increase in tokens per second compared to the baseline. KnormPress also helps, giving a 34% speedup.
  • KV Cache Size: As expected, both methods reduce the size of the context window. Knorm explicitly reduces 80% of the tokens. DMS shows a reduction from 9.00 GB to 5.96 GB (a ~34% drop on these shorter prompts). Why isn't DMS at 1/8th the size? Because DMS is explicitly trained to maintain a strict "sliding window" of the most recent 512 tokens to guarantee contextual accuracy. On shorter prompts, this safety window limits the maximum possible compression ratio, but ensures the model never loses its immediate train of thought.
  • Peak VRAM (The DMS Trade-off): Look at the first panel. While Knorm successfully reduces overall VRAM usage by 12%, DMS actually spikes Peak VRAM by 63% (up to 50.47 GB). This is a trade-off. To achieve that +376% decoding speed, the DMS implementation uses PagedAttention techniques, pre-allocating massive, empty memory pools upfront so it can route and evict tokens instantly without expensive tensor-resizing operations.
  • Time to First Token (TTFT): In the final panel, you will notice DMS takes roughly 8.29 seconds to process the initial prompt, more than double the baseline. Because DMS has to mathematically score every token, track sliding windows, and execute dynamic discarding via chunked processing, it loses the pure parallel processing speed of a standard dense FlashAttention pass during prefilling.

Is retraining really necessary?

The data proves that both methods are incredibly fast at generating text. But it creates an obvious question: if KnormPress is training-free, lowers VRAM, and boosts our speed... why go through the trouble of using NVIDIA's complex, trained DMS model?

The answer lies in what actually happens to the text when you suddenly delete 80% of an AI's memory. For that we need to perform a qualitative analysis of the generated outputs, comparing the baseline, KnormPress, and DMS outputs side by side to see if there are any differences in coherence, relevance, or factual accuracy.

We are going to take a small sample of 10 prompts from our dataset and generate outputs using all three methods for a direct comparison.

# Take a small sample of 10 prompts
sanity_prompts = all_prompts[:10]

This will allow us to see if the aggressive compression from KnormPress has any noticeable impact on the quality of the generated text compared to the baseline and DMS outputs, which will help us understand if retraining is necessary to maintain output quality at high compression ratios.

Click to expand Python code
def generate_text_samples(model_ckpt, prompts, press=None, max_new_tokens=100):
"""
Generate and decode text outputs for a batch of prompts.

This function is intended for qualitative comparison between
baseline and compressed models.
"""

# Clean up GPU memory before loading model
gc.collect()
torch.cuda.empty_cache()

# Load model configuration and tokenizer
config = AutoConfig.from_pretrained(model_ckpt, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(base_ckpt)

# CRITICAL: Left-padding is required for batched autoregressive generation
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token

# Ensure model config uses correct padding token
config.pad_token_id = tokenizer.eos_token_id

# Prepare model loading arguments
model_kwargs = {"dtype": "auto", "device_map": "auto", "config": config}

if "DMS" in model_ckpt:
model_kwargs["trust_remote_code"] = True

# Load the language model
model = AutoModelForCausalLM.from_pretrained(model_ckpt, **model_kwargs)

# Tokenize prompts with padding and truncation
inputs = tokenizer(
prompts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=1024
).to(device)

# Apply compression context if provided
context_manager = press(model) if press else nullcontext()

# Generate text autoregressively
with torch.no_grad(), context_manager:
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
do_sample=False
)

# Slice off prompt tokens
input_length = inputs.input_ids.shape[1]
generated_ids = outputs[:, input_length:]

# Decode generated token IDs to text
generated_texts = tokenizer.batch_decode(
generated_ids,
skip_special_tokens=True
)

# Cleanup
del model, tokenizer, inputs, outputs
gc.collect()
torch.cuda.empty_cache()

return generated_texts

In this function, we are doing the following:

  • We load the model and tokenizer, ensuring that the tokenizer is set to left-padding mode, which is crucial for correct attention behaviour during autoregressive generation in a batched setting.
  • We tokenize the input prompts with padding and truncation, then apply the optional compression context manager during generation.
  • We generate new tokens autoregressively, making sure deterministic output by setting do_sample=False.
  • After generation, we slice off the input prompt tokens to isolate only the newly generated text, which we then decode back into human-readable strings.
  • Finally, we clean up GPU memory to avoid interference with subsequent runs.

Let's run this function for the baseline, KnormPress, and DMS models to compare their outputs side by side.

# --- Run the Sanity Check ---
print("Generating Baseline texts...")
baseline_texts = generate_text_samples(base_ckpt, sanity_prompts, press=None)

print("Generating Knorm (80%) texts...")
knorm_texts = generate_text_samples(base_ckpt, sanity_prompts, press=KnormPress(0.8))

print("Generating DMS (8x) texts...")
dms_texts = generate_text_samples(dms_ckpt, sanity_prompts, press=None)

# --- Combine into a DataFrame ---
sanity_df = pd.DataFrame({
"Prompt": sanity_prompts,
"Baseline_Output": baseline_texts,
"Knorm_80_Output": knorm_texts,
"DMS_8x_Output": dms_texts
})

# print the DataFrame to compare outputs
print(sanity_df)

This is what the output looks like:

Prompt Baseline_Output Knorm_80_Output DMS_8x_Output
To convert the Terraform... depend on the method... dependUserCode. ’’ ‘haiesan_realodyn... depend on the method...
Traceback (most recent call... The error message you're... 爰录音. ’’ ‘haiesan_realodyn... The error message you're...
Write a pitch for... The movie should be... The Quart Equationsto clubhouseGT... The movie should be...
node-js get one mongodb... In Node.js, to retrieve... _APBAINED{yd zro 爰录音... In Node.js, to retrieve...
i need 50 blog... for 2024 Here are... for.logic)UILayout KamiAEA.EdKHKeyId没收吉UPI... for 2024 Here are...
Sure, here is the... This, in turn, reduces... ThisGPC ‫/head ,[edException... This, in turn, reduces...
| Fund Name | Return... Based on the table... cycl ýboot. ’’ ‘haiesan_realodyn... Based on the table...
Chapter 2: Creating Belief... Now, let's move on... [Whitespace/Empty] Now, let's move on...
I want you to... or maybe a payment... or)-(aN Equationsto clubhouseGT... or maybe a payment...
I want to send... Best regards, [Your Name]... cycl ýboot. ’’ ‘haiesan_realodyn... Best regards, [Your Name]...

Let's understand the outputs:

  • Speed Gains from Training-Free Methods: Looking at the performance graphs, KnormPress seems like a big win. It drops 80% of its KV cache and increases throughput by 125%. But the text outputs show that this speed comes at a serious cost to the model’s quality.
  • Total Context Collapse at Extreme Ratios: Removing 80% of a model’s memory all at once causes it to fail. The Knorm_80_Output column shows repetitive loops, random Chinese characters (爰录音), and meaningless symbols. The model cannot answer the prompts.
  • Consistent Output with Retraining (DMS): Even under an 8x memory compression, NVIDIA’s DMS model produces text almost identical to the dense Baseline. It keeps reasoning and formatting intact. 
  • Why DMS Works: DMS succeeds where Knorm fails because of the retraining, or "retrofitting," phase. The Qwen 3-8B DMS model went through a brief training process that taught it how to navigate and reason effectively even with a heavily sparsified memory state. Unlike Knorm, which drops tokens on the fly based on a simple heuristic, DMS learned how to use the limited memory intelligently, preserving context and maintaining coherent outputs under extreme compression.
  • High-Speed Performance: Because DMS maintains the quality of its outputs, its impressive 376% increase in throughput (reaching 2,687 tokens/sec at batch size 64) is practical for real-world use. By preserving reasoning and coherence even under extreme compression, DMS turns memory compression into a true performance improvement, rather than just an experimental optimization.
  • Final Conclusion: Retraining is necessary. Training-free methods like KnormPress work well for light compression (20–40%), but for extreme compression of 80% or more, a trained sparse model like DMS is needed to keep the model’s outputs coherent.

Conclusion

In this guide, we tested how far we can compress KV caches using NVIDIA’s KVPress library on Hyperstack’s H100 infrastructure. The main question was: Can we speed up inference a lot without hurting the model’s intelligence?

Our benchmarks showed a clear trade-off. Training-free methods like KnormPress can boost throughput by 125%, but pushing them to 80% compression causes a complete collapse in reasoning. The outputs become unusable.

On the other hand, NVIDIA’s DMS (Dynamic Memory Sparsification) shows that real compression without lossing accuracy needs a different approach. By retraining the Qwen 3-8B model to work with sparse memory, DMS achieves a 376% increase in throughput and reduces the logical cache by 8x, while keeping outputs nearly as coherent as the dense baseline.

This performance comes with a cost. DMS increases peak VRAM by 63% because of pre-allocated memory pools and doubles the Time-to-First-Token (TTFT) latency. It’s best suited for high-throughput, long-context scenarios where raw speed matters more than initial response time or VRAM limits.

In the end, scaling LLMs is about balancing trade-offs. For moderate optimization (20–40% compression), KVPress heuristics give a safe performance boost. But for extreme, production-grade, only retrofitted models like DMS can maintain coherence and performance.

Exceeding Benchmark Performance: Should This Be Used in Production?

The numbers we have seen are impressive, a 376% increase in decoding throughput is a good performance boost for inference costs. But does this mean every developer should immediately implement 8x KV cache compression in their production environments?

The short answer is: No, not everyone.

Like all extreme optimization techniques, Dynamic Memory Sparsification (DMS) and KVPress come with specific trade-offs. Here is what you need to consider before integrating this into your stack:

General Limitations of the Technique

  1. The TTFT Penalty: If your application requires instantaneous response times (like real-time voice bots), DMS might not be suitable. As our benchmarks showed, the Time-to-First-Token (TTFT) more than doubled because the model has to mathematically score and sort every token during the prefill phase.
  2. Peak VRAM Actually Increases: If you are running on consumer GPUs or are severely VRAM-constrained, DMS is not the answer. To achieve lightning-fast decoding, DMS relies on pre-allocated memory pools (PagedAttention). While the logical cache is tiny, the physical VRAM footprint reserved on the GPU spikes significantly.
  3. The Retrofitting Barrier: To get extreme compression (80%+) without output collapse, you cannot just plug-and-play any open-source model. You must use specifically retrofitted checkpoints (like the NVIDIA DMS weights) or invest the compute to train the DMS eviction policy into your own fine-tuned models.

Limitations of Our Experiment

While our benchmarking on Hyperstack clearly demonstrates the architectural trade-offs between dense, training-free, and retrofitted compression, it is important to acknowledge the boundaries of our test:

  • This was a qualitative sanity check, not a detailed test: We relied on a 10-prompt qualitative review to see if the model's reasoning collapsed. We did not run the models through full academic benchmarks (like MATH-500 or GPQA) to measure the exact percentage drop in factual accuracy at 8x compression.
  • Narrow hardware/workload scope: We tested a single extreme compression ratio (80%) at a single, heavy batch size (64) on an 8B model. Results will scale differently depending on your specific hardware, smaller batch sizes, or massive 70B+ models.

The Bottom Line: Who Should Use This?

You should use training-free KVPress (Knorm, SnapKV) if: You want a free 10% to 30% performance boost on standard Hugging Face models without changing your weights, and your context windows are moderately sized.

You should invest in DMS if: You are operating at enterprise scale, handling massive parallel batches, processing massive context windows (like RAG over entire books), and care far more about raw decoding throughput (tokens per second) than you do about initial prompt latency. For these heavy-duty workloads, running DMS on high-bandwidth infrastructure like Hyperstack’s H100s will yield a massive return on investment.

FAQ

Does extreme compression affect model accuracy?

It depends entirely on the method. As our sanity check shows, aggressive training-free compression (like KnormPress at 80%) destroys model accuracy, leading to hallucinations and gibberish. However, retrofitted models like DMS are specifically trained to handle sparse memory, allowing them to maintain baseline-level accuracy even at 8x compression ratios.

Why does the DMS model consume more VRAM despite compressing the cache?

This is an architectural trade-off. To reach a 376% throughput increase, DMS uses a PagedAttention-style memory pool. It pre-allocates large VRAM blocks so tokens can be routed and removed very quickly without slow dynamic memory operations. Even though the logical cache is small (1.12 GB), the physical VRAM reserved stays high (50 GB).

When should I use KVPress heuristics vs. DMS?

  • Use KVPress (Knorm/SnapKV) for easy optimization on any standard model. It is ideal for moderate compression (20-40%) where you want free speedups without changing your model weights.
  • Use DMS for dedicated, high-load production environments where you need maximum possible throughput on long-context tasks and are willing to use a specific, retrofitted model checkpoint.

Can I train my own model with DMS?

Yes. While we used the pre-trained Qwen3-8B-DMS-8x checkpoint, the methodology is open. You can apply the DMS retrofitting process to your own custom LLMs using the training recipes provided in NVIDIA's research, allowing you to create sparse variants of Llama 3, Mistral, or other architectures.

Does this work on consumer GPUs?

While kvpress works on consumer hardware, the DMS architecture is memory-hungry due to its pre-allocation strategy. As seen in our benchmarks, the 8B model spiked to over 50GB of VRAM. For testing DMS specifically, we recommend enterprise-grade GPUs like the NVIDIA A100 or H100 (80GB) available on Hyperstack to avoid Out-Of-Memory (OOM) errors.

Fareed Khan

Fareed Khan

calendar 23 Feb 2026

Read More
tutorials Tutorials link

How to Deploy and Use Qwen3 (Complete Setup Guide)

What is Qwen3-Coder-Next? Qwen3-Coder-Next is the latest ...

What is Qwen3-Coder-Next?

Qwen3-Coder-Next is the latest open-weight language model from the Qwen Team, built specifically for coding agents and local development. Unlike traditional dense models, it utilizes a Mixture-of-Experts (MoE) architecture. While it boasts a massive 80B total parameter count, it only activates 3B parameters per token. This unique design allows it to deliver performance comparable to models with 10–20x more active parameters while remaining highly efficient for inference and agent deployment.

Qwen3-Coder-Next Features

The latest version of Qwen3-Coder-Next comes with significant enhancements, including:

  • Super Efficient MoE Architecture: With only 3B activated parameters (out of 80B total), it offers exceptional speed and cost-effectiveness without sacrificing deep reasoning capabilities.
  • Advanced Agentic Capabilities: Excelling at long-horizon reasoning and complex tool usage, it is designed to recover from execution failures and handle dynamic coding tasks robustly.
  • Massive Context Integration: Features a 256k context window and adaptability to scaffolding frameworks like Claude Code, Cline, and Trae, ensuring seamless integration into modern IDEs.
  • High-Performance Tool Use: Optimized for tool calling with specific parsers, making it an ideal engine for autonomous software engineering agents.

How to Deploy Qwen3-Coder-Next on Hyperstack

Now, let's walk through the step-by-step process of deploying the necessary infrastructure.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.

  • Select Hardware Configuration: For Qwen3-Coder-Next, efficient inference with tensor parallelism is key. Choose the "4xH100-80G-PCIe" flavour to ensure sufficient VRAM and memory bandwidth.

  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all necessary drivers.

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Here you will replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Once connected, you should see a welcome message indicating you're logged into your Hyperstack VM.

Now that we are inside the VM, we will use Docker to launch the vLLM server.

Step 4: Create a Model Cache Directory

We'll create a directory on the VM's high-speed ephemeral disk. Storing the model here ensures faster loading times on startup.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions to the directory
sudo chmod -R 0777 /ephemeral/hug

This command creates a folder named hug inside the /ephemeral disk and sets its permissions so that the Docker container can read and write the model files.

Step 5: Launch the vLLM Server

We will use the vllm-openai Docker image. Note that we are using specific flags like --tool-call-parser to enable the advanced agentic features of Qwen3.

# Pull the latest vLLM image
docker pull vllm/vllm-openai:latest

# Launch the vLLM container for Qwen3-Coder-Next
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_qwen3 \
--restart always \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-Coder-Next \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 \
--port 8000

This command instructs Docker to:

  • --gpus all: Use all available NVIDIA GPUs.
  • --network host: Expose the container's ports directly on the VM's network for simpler access.
  • -v /ephemeral/hug:/root/.cache/huggingface: Mount our cache directory to persist the downloaded model.
  • --model ...: Specifies the Qwen/Qwen3-Coder-Next model.
  • --tensor-parallel-size 4: Splits the model across 4 GPUs for optimal distribution.
  • --enable-auto-tool-choice: Allows the model to decide when to call tools.
  • --tool-call-parser qwen3_coder: Uses the specific parser required for Qwen3's agentic functions.
  • --gpu-memory-utilization 0.90: Allocates 90% of VRAM to the model weights and KV cache.

Step 6: Verify the Deployment

First, check the container logs to monitor the model loading process. This may take several minutes.

docker logs -f vllm_qwen3

The process is complete when you see the line: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000. This is essential for external access.

Finally, test the API from your local machine (not the VM) by replacing the IP address with your VM's IP address.

# Test the API endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Qwen/Qwen3-Coder-Next",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 200
}'

You can see that we have a successful response as a JSON object containing the model reply:

{
"id": "chatcmpl-98f4b9baaeb1613d",
"object": "chat.completion",
"created": 1770201723,
"model": "Qwen/Qwen3-Coder-Next",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hi there! 😊 How can I help you today?",
...
},
"finish_reason": "stop"
}
],
...
}
💡

Note that to achieve optimal performance, Qwen team recommends using the following sampling parameters:

# Recommended sampling configuration
temperature = 1.0
top_p = 0.95
top_k = 40

You can see that our model is responding correctly to our query which means Qwen/Qwen3-Coder-Next is successfully deployed on Hyperstack.

Step 7: Hibernating Your VM (OPTIONAL)

When you are finished with your current workload, you can hibernate your VM to avoid incurring unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual machine.
  • Look for a "Hibernate" option.
  • Click to hibernate the VM, which will stop billing for compute resources while preserving your setup.

Agentic Coding with Qwen3-Coder-Next

Qwen3 supports tool use, allowing it to call user-defined functions during chat interactions. We will be working with OpenAI's Python client to demonstrate how to set up and use tool calls with the Qwen3-Coder-Next model. Let's install the OpenAI Python client first if you haven't done so already.

pip3 install openai

Next, we have to define a function and describe it to the model. In this example, we define a simple function that squares a number.

# Define a function that takes a number as input and returns its square
def square_the_number(num: float) -> float:
return num ** 2 # Return the square of the input number

This function is simply squaring the input number. Next, we describe this function to the model so it can use it when needed.

# Define a list of tools for the LLM to use
tools = [
{
"type": "function", # Specify the tool type as a function
"function": {
"name": "square_the_number", # Name of the function
"description": "output the square of the number.", # Description for the LLM
"parameters": {
"type": "object", # Parameters are defined as an object
"required": ["input_num"], # 'input_num' is required
"properties": {
'input_num': {
'type': 'number', # Type of the parameter is number
'description': 'input_num is a number that will be squared'
}
},
}
}
}
]

The tool definition is a list containing a single dictionary that describes the function, its parameters, and their types. Let's understand what each part means:

    • type: function indicates that this tool is a function.
    • name: The name of the function that the model can call.
    • description: Provides a brief explanation of what the function does, as it will be shown to the model.
    • parameters: Defines the expected input for the function, including required fields and their types.
    • input_num: Defined as a number that the model will provide when calling the function.
    • properties: Describes the individual parameters, including their types and descriptions.

Now, we can set up the OpenAI client to interact with the local endpoint and use the defined tool.

from openai import OpenAI  # Import the OpenAI client

# Initialize the OpenAI client with a custom endpoint and API key
client = OpenAI(
base_url='http://localhost:8000/v1', # Set the base URL for the API
api_key="EMPTY" # Use an empty API key (for local endpoints)
)

We are setting the base_url to point to our local server where Qwen3 is running. The api_key is set to "EMPTY" since we are using a local instance that does not require authentication.

Let's prepare a message to send to the model, asking it to square the number 1024.

# Define the messages to send to the LLM
messages = [{'role': 'user', 'content': 'square the number 1024'}]

# Create a chat completion request to the LLM
completion = client.chat.completions.create(
messages=messages, # Pass the user messages
model="Qwen/Qwen3-Coder-Next", # Specify the model to use
tools=tools, # Pass the defined tools
)

# Print the first choice from the completion response
print(completion.choices[0])

We get the following output:

Choice(
finish_reason='tool_calls',
index=0,
logprobs=None,
message=ChatCompletionMessage(
content=None,
refusal=None,
role='assistant',
annotations=None,
audio=None,
function_call=None,
tool_calls=[
ChatCompletionMessageFunctionToolCall(
id='chatcmpl-tool-9ef4c45bc6375d0b',
function=Function(
arguments='{"input_num": 1024}',
name='square_the_number'
),
type='function'
)
],
reasoning=None,
reasoning_content=None
),
stop_reason=None,
token_ids=None
)

You can see that after sending the request, model decides to call the function. We can then extract the tool call and execute the function:

# Extract the tool call from the completion response
import json

tool_call = completion.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)

# Execute the function with the provided argument
result = square_the_number(num=args['input_num'])

# print the result of the function execution
print(f"Result of squaring the number: {result}")

# It prints -> Result of squaring the number: 1048576

This confirms that our model successfully used the defined function to square the number 1024, resulting in 1048576.

Why Deploy Qwen3-Coder-Next on Hyperstack?

Hyperstack is a cloud platform designed to accelerate AI and machine learning workloads. Here's why it's an excellent choice for deploying Qwen3-Coder-Next:

  • Availability: Hyperstack provides access to the latest and most powerful GPUs such as the NVIDIA H100 on-demand, specifically designed to handle large language models. 
  • Ease of Deployment: With pre-configured environments and one-click deployments, setting up complex AI models becomes significantly simpler on our platform. 
  • Scalability: You can easily scale your resources up or down based on your computational needs.
  • Cost-Effectiveness: You pay only for the resources you use with our cost-effective cloud GPU pricing
  • Integration Capabilities: Hyperstack provides easy integration with popular AI frameworks and tools.

FAQs

What is Qwen3-Coder-Next?

Qwen3-Coder-Next is an open-weight model from the Qwen Team, designed for agentic coding and local development. It utilizes a Mixture-of-Experts (MoE) architecture to achieve high performance with high efficiency.

What is the context window of Qwen3-Coder-Next?

The model supports a massive 256k context window natively, allowing it to process entire repositories, long files, and complex multi-file edits without losing context.

Does Qwen3-Coder-Next support "thinking" mode?

No, Qwen3-Coder-Next supports only non-thinking mode and does not generate <think></think> blocks. It is optimized for direct instruction following and tool use.

What hardware is required for Qwen3-Coder-Next?

While efficient (3B active params), the model has 80B total parameters. It requires significant VRAM to load the full weights, making the 4xH100 configuration on Hyperstack an ideal choice.

What are the main use cases for this model?

Qwen3-Coder-Next is perfectly suited for building autonomous coding agents, integrating with IDEs like Claude Code or Cline, and performing complex software engineering tasks locally or in the cloud.

Fareed Khan

Fareed Khan

calendar 4 Feb 2026

Read More