<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Meet Hyperstack at RAISE 2026, 8th-9th July · Booth #14A · Scale your AI infrastructure with us.

Catch Hyperstack at ISC 2026, 22nd-26th June · Booth #A39 · Let's talk GPU-accelerated workloads

Reserve early access to NVIDIA B300s — arriving Q3/Q4

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close

publish-dateDecember 2, 2025

5 min read

Updated-dateUpdated on 17 Mar 2026

Run DeepSeek OCR on Hyperstack with your Own UI

Written by

Hitesh Kumar

Hitesh Kumar

Share this post

Table of contents

summary

NVIDIA H100 GPUs On-Demand

Sign up/Login
summary

Key Takeaways

  • DeepSeek-OCR is a multimodal OCR model designed to extract both text and document structure from images and PDFs.

  • The setup uses a Hyperstack GPU virtual machine to run DeepSeek-OCR in a private, high-performance environment.

  • The model combines a vision encoder and a language decoder to handle complex layouts such as tables and multi-column documents.

  • Deployment involves cloning the DeepSeek-OCR repository, installing Python dependencies, and configuring the runtime environment.

  • A Gradio-based web interface allows users to upload documents and view OCR results in structured Markdown output.

  • The deployed OCR service can be extended into APIs or integrated into document processing and RAG workflows.

Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack

Optical Character Recognition (OCR) is the process of recognising and extracting text from a source like images or PDFs using just the visual field - it's what we do when we read!

Methods for performing OCR have exited for a while but in the past few years (or even months rather), transformer-based models have become incredibly competent at it. DeepSeek, one of the world's leading AI foundation model labs, have released their DeepSeek-OCR 3B parameter model for quickly and easily creating your own OCR workflows.

deepseek

Why is it harder to run than other DeepSeek models?

You might be used to running other AI models, like DeepSeek's LLMs, which are often available via a simple API call or a straightforward Python library like transformers. We've even made tutorials in the past that you can follow to get DeepSeek V3. DeepSeek-OCR is a bit more hands-on because it's not just a language model; it's a specialised multi-modal system.

It essentially has two parts: a sophisticated vision encoder that sees and understands the layout of a page (just like our eyes), and a 3-billion-parameter language decoder that reads and interprets the text from that visual information. This two-stage process is what makes it so powerful, but it also requires a more complex stack of software to run efficiently.

The setup in this guide uses vLLM, a high-throughput serving engine, to get the best possible performance. This is what adds most of the setup steps - we need to install a particular version of it along with dependencies like flash-attn. It's this requirement for a high-performance, GPU-accelerated serving environment that makes it more complex than a simple pip install package, but the payoff in speed and accuracy is well worth it.

How good is DeepSeek-OCR? 

In short: it's exceptionally good. It represents the current state-of-the-art for open-source OCR in its size group, especially when it comes to understanding real-world, complex documents.

Where traditional OCR tools might just extract a "wall of text" that loses all formatting, DeepSeek-OCR understands the structure of the document. This is its key advantage. It excels at:

  • Complex Layouts: Accurately reading multi-column articles, magazine pages, and scientific papers.

  • Tables: It doesn't just see text in a table; it understands the table's rows and columns and formats the output (as markdown) to match.

  • Mixed Content: It's highly adept at handling pages with a mix of text, code blocks, and even mathematical equations.

Because it outputs structured markdown, you're not just getting the raw text; you're getting the document's semantic structure. This makes its output immediately useful for feeding into other systems, like a RAG pipeline or a summarisation model. For its 3B-parameter size, it hits a perfect sweet spot of being incredibly accurate while still being fast enough to interpret huge documents on a single H100 GPU.

How to set up DeepSeek-OCR on your own Hyperstack VM, step-by-step

We'll take you through the whole process from start to end to get a really simple and basic OCR workflow running on your own Hyperstack VM. 

Step 0: Getting a Hyperstack VM

This guide assumes you've just spun up a new Linux VM on our platform and can access it via SSH. If you haven't done this before, please see our getting started guide in our documentation.

Step 1: Clone the DeepSeek-OCR repo 

# Clone the DeepSeek-OCR repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git

Step 2: Install UV (the package manager):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Step 3: Create a python virtual environment:

uv venv deepseek-ocr --python 3.12.9
source deepseek-ocr/bin/activate

Step 4: Install vLLM and other requirements

cd DeepSeek-OCR

# Get vllm whl
wget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
unzip vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl -d vllm-0.8.5+cu118-whl

# Install requirements
uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install uvicorn fastapi gradio --upgrade
uv pip install transformers==4.57.1 --upgrade

This step may take a while, there are a lot of dependencies!

Step 5: Download the Python code

main.py 

This is a standalone python file that sets up the webserver and hosts it on your VM. We recommend you have a quick read through before you attempt to run it, just to familiarise yourself with what it does (more on this later).

Step 6: Get the code into your VM:

# Create the "web" dir and put main.py in there
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
mkdir -p web

cat <<EOF > web/main.py
<paste the contents of main.py here>
EOF

You can alternatively use some editor like nano or vim, or SSH into this VM from a more interactive source like VSCode or similar to make this part easier. 

Step 7: Start the server and access via your browser

# Start the server
uvicorn web.main:app --host 0.0.0.0 --port 3000

You should now be able to navigate to the UI by going to http://<your-VMs-ip>:3000, and interact with the UI! 

NOTE: Remember to open port 3000 for inbound TCP traffic via your VM's firewall on Hyperstack! For more info on this, see our documentation here 

Once loaded, It should look something like this:

start the server

In this simple, barebones UI, you can upload PDFs or images and DeepSeek-OCR will automatically run on them.

The results will be visible in the lower box, with the option to see (and download) the labelled input and the extracted text in markdown format. 

To re-run, simple delete the existing input and upload something new!

Here's an example of an example PDF article output from DeepSeek-OCR:

deepseek

Troubleshooting

As stated, this is a very minimal, quickly-put-together UI, and hence is not maintained and updated by Hyperstack, and is certainly not bug-free! However, feel free to modify the code the main.py file to solve any issues or add any features you like.

One bug we are aware of in our early testing is the UI's inability to replace old inputs when new ones are uploaded. In this case, simply Ctrl+C to terminate the server and re-run the same uvicorn command - this and a reload of the web page will then start a fresh instance of the UI with the issue no longer being present. 

What's Next?

Congratulations! You've now got your own private, high-performance OCR server running. This Gradio UI is a fantastic sandbox for testing, but the real power comes from what you can build on top of it.

The most logical next step is to adapt the web/main.py file. Instead of launching a Gradio UI, you could modify it to create a simple, robust REST API endpoint using FastAPI. Imagine an endpoint where you can POST an image or PDF file and get a clean JSON response containing the extracted markdown.

Once you have that API, the possibilities are endless:

  • Build a RAG Pipeline: This is the big one. You can now programmatically feed your entire library of PDFs and documents through this API, storing the clean markdown output in a vector database.

  • Create a "Chat with your Docs" App: Combine your new OCR API with a conversational LLM (like DeepSeek-LLM) to build a powerful application that lets you ask questions about your documents.

  • Automate Data Entry: Create a workflow that watches a specific folder or email inbox, runs any new attachments through your OCR API, and then parses the structured output to populate a database or spreadsheet.

You've done the hard part by setting up the core engine. Now you can use your Hyperstack VM as a stable, private microservice to power all kinds of intelligent document-processing workflows.

Launch Your VM today and Get Started with Hyperstack!

FAQs

What type of model is DeepSeek-OCR?

DeepSeek-OCR is a multimodal model combining vision and language understanding, designed to extract text and structure from documents efficiently.

What format does DeepSeek-OCR output?

It outputs structured markdown that preserves tables, layout, and semantic information, making it ready for downstream processing or RAG pipelines.

Which engine is used for high-throughput serving?

vLLM is used as a high-throughput serving engine, optimised for GPU acceleration to deliver fast, efficient OCR performance.

Which package manager is required for setup?

The setup requires UV, a modern package manager, to create virtual environments and install all dependencies reliably on Hyperstack.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Related content

Stay updated with our latest articles.

tutorials Tutorials link

Deploy the World's First Open-Source 100B Medical LLM on GPU ...

What is AntAngelMed? AntAngelMed is the world's first ...

What is AntAngelMed?

AntAngelMed is the world's first open-source 100B-parameter medical large language model, jointly developed by the Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen'er Medical AI. Built on the Ling-flash-2.0 Mixture-of-Experts architecture, it houses 100B total parameters while activating only 6.1B parameters per token, allowing it to match the performance of dense models several times its active size while delivering inference speeds exceeding 200 tokens per second on H20-class hardware. With a 128K context window, clinical-grade safety alignment via GRPO reinforcement learning, and #1 rankings on HealthBench (open-source category) and the MedBench leaderboard, AntAngelMed sets a new bar for what an openly available medical AI can do.

A major reason AntAngelMed performs so strongly on clinical benchmarks is its rigorous three-stage training pipeline combined with a highly efficient sparse MoE architecture. Here is how the design works:

  1. Sparse MoE Architecture: Built on Ling-flash-2.0's 1/32 activation-ratio Mixture-of-Experts design. Only ~6.1B of the model's 100B parameters activate per token, delivering large-model intelligence at small-model speeds.
  2. Three-Stage Medical Training: Continual pre-training on large-scale medical corpora (encyclopaedias, peer-reviewed research, clinical text), followed by supervised fine-tuning on multi-source medical instructions, and finally GRPO-based reinforcement learning that shapes empathy, safety boundaries, and evidence-based reasoning.
  3. Advanced MoE Optimisations: Expert granularity tuning, shared-expert ratios, no auxiliary loss with sigmoid routing, Multi-Token Prediction (MTP) layers, QK-Norm, and Partial-RoPE all combine to push small-activation MoE up to 7x more efficient than a dense model of comparable size.
  4. Extended Context via YaRN: Native context support extends to 128K tokens through YaRN extrapolation — essential for ingesting long patient histories, multi-document literature reviews, or extended diagnostic conversations.
  5. Clinical Safety Alignment: GRPO reinforcement learning with task-specific reward models explicitly optimises for empathy, structural clarity, safety boundaries, and reduced hallucinations on complex medical cases.
  6. FP8 + EAGLE3 Inference Acceleration: Optional FP8 quantisation paired with EAGLE3 speculative decoding delivers throughput gains of 45–94% over standard FP8 across reasoning workloads, with no measurable loss in stability.

AntAngelMed Features

AntAngelMed is more than a general chat model fine-tuned on medical data — it has been engineered end-to-end for clinical reasoning. Key features include:

Production Throughput Gains with FP8 + EAGLE3

Inference throughput improvement of FP8 + EAGLE3 speculative decoding over standard FP8 alone, measured at a concurrency of 32 on reasoning workloads.

HumanEval CODE REASONING +71% GSM8K MATH WORD PROBLEMS +45% Math-500 ADVANCED MATHS +94% 0% 25% 50% 75% 100%

Throughput improvement of the FP8 + EAGLE3 configuration relative to the FP8-only baseline, as reported by the AntAngelMed authors on the Hugging Face model card under a concurrency of 32.

  • Leading Open-Source Medical Performance: Ranks first among open-source models on OpenAI's HealthBench (with a particularly strong lead on HealthBench-Hard) and #1 overall on the MedBench leaderboard across knowledge, reasoning, language, and safety.
  • 6.1B Active / 100B Total Parameters: Matches the performance of ~40B dense models while running roughly 3x faster, thanks to its sparse-activation MoE design.
  • Clinical-Grade Safety and Ethics: Explicitly trained to favour evidence-based reasoning, structured responses, and safety disclaimers — reducing hallucination on complex diagnostic queries.
  • 128K Context Window: Handles long-form clinical documentation, multi-report synthesis, and extended multi-turn conversations in a single context.
  • OpenAI-Compatible API: Deploys cleanly via vLLM and exposes a standard /v1/chat/completions endpoint, making integration with existing healthcare applications straightforward.
  • Bilingual Medical Fluency: Strong performance across English and Chinese medical text, including medical knowledge Q&A, language understanding, and complex clinical reasoning.

Important Considerations Before Deploying a Medical LLM

Before walking through deployment, it is worth pausing on a point that matters more for medical AI than almost any other workload: where your model runs is as important as how well it performs. Medical data is among the most tightly regulated categories of information globally, governed by frameworks such as the EU GDPR, the UK Data Protection Act, the EU AI Act (which classifies clinical-decision-support and diagnostic AI as high-risk when used in a medical-device capacity), and HIPAA in the United States.

If you intend to use AntAngelMed in any setting that touches real patient data or protected health information, the on-demand deployment shown in this tutorial is a starting point for evaluation, not a production blueprint. For clinical workloads, Hyperstack offers a deployment pattern specifically engineered for regulated industries:

  • Secure Private Cloud: Single-tenant, dedicated GPU infrastructure with isolated networking, commissioned per customer and able to be deployed in specific regions and jurisdictions — including EU/EEA jurisdictions for organisations operating under GDPR. No shared GPU memory, no shared VPCs, no noisy neighbours, and clear audit evidence (tenancy models, access logs, control mappings, data residency confirmation) that compliance and legal teams can verify. This is the recommended path for any clinical workload on a 100B-class model like AntAngelMed.

Important Notice for This Tutorial

The walkthrough below uses a standard on-demand GPU VM for demonstration purposes only. All prompts shown are synthetic, generic medical questions — no real patient data, identifiable health information, or protected records are processed at any point. The model and all generated outputs remain within the dedicated GPU VM and are not exposed externally. For any workload involving real clinical data — particularly where EU data residency or audit-grade isolation is required — please contact our team to provision a Secure Private Cloud environment configured for your compliance posture.

How to Deploy AntAngelMed on Hyperstack

With those considerations in place, let's walk through the deployment process step by step.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM sized appropriately for a 100B-parameter MoE model.

  • Initiate Deployment: Click the "Deploy New Virtual Machine" button on the dashboard.

deploy new vm

  • Select Hardware Configuration: AntAngelMed is approximately 206 GB at BF16, and with KV cache, activations, and headroom for long-context inference, an 8-GPU configuration is recommended. Choose the "8x H100-80G-PCIe" flavour. This gives 640 GB of total GPU memory — comfortable margin for tensor-parallel-size 8 inference at production speeds.

h100 pcie

  • Select a Region: The 8x H100-80G-PCIe flavour is available in the CANADA-1 region on Hyperstack's on-demand cloud. Choose CANADA-1 for this tutorial. For production clinical workloads that require EU data residency, the appropriate path is a Secure Private Cloud deployment, which Hyperstack provisions in specific regions and jurisdictions including the EU/EEA.
  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all NVIDIA drivers and Docker pre-installed.

select os image

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine for remote management.
  • Review and Deploy: Double-check your settings and click "Deploy".

Step 3: Accessing Your VM

Once your VM is running, connect to it via SSH.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and run:

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private SSH key and [your_vm_public_ip] with the IP address of your VM. Once connected, you should see a welcome message confirming you are logged in.

Step 4: Create a Model Cache Directory

AntAngelMed weighs around 206 GB. We'll cache it on the VM's high-speed ephemeral disk so that subsequent container restarts do not re-download the weights.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions to the directory
sudo chmod -R 0777 /ephemeral/hug

This creates a folder named hug inside the /ephemeral disk and sets permissions so the Docker container can read and write model files.

Step 5: Launch the vLLM Server

AntAngelMed uses the custom bailing_moe architecture from Ling-flash-2.0, so we need vLLM v0.11.0 or later and the --trust-remote-code flag. We'll pull the official vLLM image and run the model with tensor parallelism across all eight H100s.

# Pull the vLLM OpenAI-compatible image (v0.11.0 recommended by the model authors)
docker pull vllm/vllm-openai:v0.11.0

# Run the container with the AntAngelMed model
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_antangelmed \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:v0.11.0 \
MedAIBase/AntAngelMed \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--served-model-name AntAngelMed \
--host 0.0.0.0 \
--port 8000

Breakdown of the key flags:

  • --gpus all: Use all eight NVIDIA H100 GPUs on the host.
  • --ipc=host: Share the host's IPC namespace for efficient multi-GPU communication.
  • --network host: Expose the container directly on the host network for simpler API access.
  • -v /ephemeral/hug:/root/.cache/huggingface: Mount the cache directory so weights persist across container restarts.
  • MedAIBase/AntAngelMed: Load the model directly from Hugging Face.
  • --tensor-parallel-size 8: Shard the model weights across all eight GPUs.
  • --dtype bfloat16: Use BF16 precision, as recommended by the model authors for Nvidia hardware.
  • --trust-remote-code: Required to load the custom bailing_moe modelling code.
  • --max-model-len 32768: Sets the maximum context length. Can be raised toward 128K with YaRN if your use case requires it.
  • --gpu-memory-utilization 0.90: Allocate up to 90% of GPU memory for weights and KV cache.
  • --served-model-name AntAngelMed: A clean alias used in API requests.

Step 6: Verify the Deployment

Check the container logs to monitor model loading. The first run will download ~206 GB from Hugging Face, which can take several minutes.

docker logs -f vllm_antangelmed

The server is ready once you see: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000.

firewall rules

Now test the API from your local machine, replacing the placeholder IP with your VM's public IP.

# Test the API endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "AntAngelMed",
"messages": [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "What should I do if I have a headache?"}
],
"max_tokens": 800,
"temperature": 0.6,
"top_p": 0.95,
"extra_body": {
"top_k": 20,
"repetition_penalty": 1.05
}
}'

A successful response returns a JSON object containing the model's structured medical reply:

{
"id": "chatcmpl-7e8a3b2c1d4f5e6a",
"object": "chat.completion",
"created": 1771954823,
"model": "AntAngelMed",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "A headache is a common symptom with many possible causes, ranging from minor (such as tiredness, dehydration, or eye strain) to more serious. Here is some general guidance.\n\n**Self-care measures to try first:**\n- Rest in a quiet, dimly lit room\n- Drink water steadily, as mild dehydration is a frequent trigger\n- Apply a cool compress to the forehead or a warm one to the neck and shoulders\n- Consider an over-the-counter analgesic such as paracetamol or ibuprofen, taken according to the package instructions\n\n**When to seek medical attention:**\nContact a healthcare professional promptly if your headache...",
...
},
"finish_reason": "stop"
}
],
...
}

The response demonstrates exactly the behaviour AntAngelMed was trained for: a structured answer that opens with context, separates self-care from red-flag symptoms, and points the user toward a clinician where appropriate — the kind of safety-aware framing the GRPO stage is designed to reinforce. With this output returning cleanly, MedAIBase/AntAngelMed is successfully deployed on Hyperstack.

Recommended Sampling Parameters

The AntAngelMed authors recommend the following sampling configuration for general medical Q&A:

# Recommended sampling for medical reasoning
temperature=0.6, top_p=0.95, top_k=20,
repetition_penalty=1.05, max_tokens=16384

Step 7: Hibernating Your VM (Optional)

When you are finished with your workload, hibernate the VM to avoid incurring unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual Machine.
  • Click the "Hibernate" option.
  • This stops billing for compute resources while preserving your setup, so you can resume quickly later.

Using AntAngelMed via the OpenAI-Compatible API

Now that the vLLM server is running, we can interact with AntAngelMed using the standard OpenAI Python client. First, install the library locally:

# Install the OpenAI Python client to interact with the vLLM server
pip3 install openai

Then instantiate an OpenAI-compatible client pointing at our vLLM endpoint. vLLM does not enforce an API key by default, so we pass a placeholder.

from openai import OpenAI

# Create an OpenAI-compatible client that points to the local vLLM server.
client = OpenAI(
base_url="http://localhost:8000/v1", # Local API endpoint exposing OpenAI-style routes
api_key="EMPTY", # Placeholder key; vLLM does not enforce API keys by default
)

All of the prompts in the examples below are synthetic and general — standard medical-education queries, not real patient interactions. This matches the demonstration posture set out at the beginning of this guide.

Example 1: General Medical Q&A

We'll start with the same prompt used in the model authors' Hugging Face example — a generic question about headache management. This is exactly the kind of self-care guidance use case where AntAngelMed's training shines.

# Define the conversation payload with a system role and a generic user question.
messages = [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "What should I do if I have a headache?"}
]

# Send the chat completion request using the recommended sampling parameters.
chat_response = client.chat.completions.create(
model="AntAngelMed", # Served model name from the vLLM container
messages=messages,
max_tokens=800,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
"repetition_penalty": 1.05,
},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

This is the kind of response we get back from the model:

AntAngelMed response:
A headache is a common symptom that can stem from many different causes,
most of which are minor and resolve on their own. The right next step
depends on how severe the pain is, how long it has lasted, and what other
symptoms (if any) are present. **General self-care steps to try first:** - Rest in a quiet, dimly lit room and try to relax your shoulders and jaw - Drink water steadily — mild dehydration is one of the most frequent triggers - Apply a cool compress to the forehead, or a warm compress to the neck and shoulders if the pain feels tension-related - Consider an over-the-counter analgesic such as paracetamol or ibuprofen,
taken strictly according to the package instructions - Note possible triggers (skipped meals, screen time, poor sleep, caffeine
withdrawal, stress) so you can address them **When to seek medical attention:** You should contact a healthcare professional promptly if your headache: - Comes on suddenly and severely ("thunderclap" headache) - Is accompanied by fever, neck stiffness, confusion, vision changes,
weakness, slurred speech, or seizures - Follows a head injury - Worsens despite rest and over-the-counter treatment, or recurs frequently - Disrupts your daily activities or sleep **Important note:** This information is for general educational purposes only and does not
replace professional medical advice. If you are unsure or concerned about
your symptoms, please consult a qualified healthcare provider.

Notice how the response is structured cleanly into self-care guidance, escalation criteria, and a safety disclaimer. That structure is not accidental — it is a direct consequence of the GRPO stage of training, where the model was rewarded for empathetic, structurally clear, evidence-based answers that respect safety boundaries. This is the behaviour you want from a medical assistant: useful first-line information, paired with clear signals about when human clinical judgement is required.

Example 2: Medical Education and Concept Explanation

AntAngelMed is equally strong on medical-knowledge questions, which is the kind of workload that powers patient-education portals, internal training tools, and clinical documentation helpers.

# A general medical-education question, no patient data involved.
messages = [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "Explain the difference between Type 1 and Type 2 diabetes in simple terms."}
]

chat_response = client.chat.completions.create(
model="AntAngelMed",
messages=messages,
max_tokens=800,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20, "repetition_penalty": 1.05},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

The model returns a clear, well-organised explanation:

AntAngelMed response:
Type 1 and Type 2 diabetes are both long-term conditions in which the body
has difficulty managing blood sugar (glucose), but their underlying causes
and typical treatments are quite different. **Type 1 Diabetes** - An autoimmune condition: the body's immune system mistakenly attacks the
insulin-producing cells in the pancreas - As a result, the pancreas produces little or no insulin - Usually diagnosed in childhood or adolescence, though it can occur at any age - Requires lifelong insulin therapy (injections or a pump) - Not caused by diet or lifestyle **Type 2 Diabetes** - The body either does not produce enough insulin or its cells become
resistant to insulin's effects - Strongly associated with factors such as excess body weight, physical
inactivity, family history, and increasing age - Typically develops in adulthood, although it is increasingly diagnosed
in younger people - Often managed initially with lifestyle changes (diet, exercise, weight
management) and oral medications, with insulin added later if needed **Key distinction:** Type 1 is primarily an autoimmune condition affecting insulin production,
whereas Type 2 is largely a metabolic condition affecting how the body
uses insulin. Both require ongoing medical care to prevent complications. If you or someone you know has been diagnosed with either type, a qualified
healthcare provider can build a personalised management plan.

Again, the structure is consistent: clear categorisation, plain-language framing, balanced coverage, and a closing pointer toward professional care. For applications like patient-facing FAQ tools, internal medical reference, or onboarding content for non-clinical staff, this is precisely the tone and format you want at scale.

Example 3: Disabling Verbose Reasoning for Lower-Latency Responses

For lighter-weight queries where you don't need the model to lay out its reasoning in detail, you can request a more concise output style. This reduces output tokens and improves end-to-end latency.

# A short, factual medical-knowledge question.
messages = [
{"role": "system", "content": "You are AntAngelMed. Answer concisely and factually."},
{"role": "user", "content": "What is the normal resting heart rate range for a healthy adult?"}
]

chat_response = client.chat.completions.create(
model="AntAngelMed",
messages=messages,
max_tokens=200,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20, "repetition_penalty": 1.05},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

And the response:

AntAngelMed response:
The normal resting heart rate for a healthy adult is typically between
60 and 100 beats per minute. Well-trained athletes may have resting rates
as low as 40–60 bpm, which is generally considered healthy for them.
Rates persistently above or below the normal range should be discussed
with a healthcare professional.

Short, accurate, and still framed with the appropriate caveat — useful for chatbot interfaces or low-latency API tiers serving high request volumes.

A Final Word on Data Handling

Everything shown above runs entirely inside your Hyperstack VM. The model weights are pulled to local storage, inference happens on-instance, and API responses are served back over your VM's network — no inference traffic is routed through Hyperstack-managed inference services, and no data leaves your instance unless your own application sends it elsewhere. For workloads that require stronger guarantees — EU/EEA data residency, dedicated single-tenant infrastructure, audit-ready isolation evidence — please talk to our team about a Secure Private Cloud deployment configured for your compliance posture.

Why Deploy AntAngelMed on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads — and it is particularly well suited to medical AI:

EU Data Residency via Secure Private Cloud
Dedicated, single-tenant GPU infrastructure commissioned in EU/EEA jurisdictions for organisations operating under GDPR and the EU AI Act. The right path for any clinical workload involving real patient data, with audit-ready tenancy and data-residency evidence built in.
Latest-Generation GPUs On-Demand
On-demand access to NVIDIA H100 and other top-tier accelerators — the right hardware for serving 100B-class medical models at production throughput, with billing accurate to the minute.
Tier 3 Data Centres, SOC 2 Type II
All Hyperstack regions run in Tier 3 certified data centres with concurrent maintainability and 99.982% annual uptime, independently audited and SOC 2 Type II certified for security, availability, and data integrity.
Simple Deployment
Pre-configured CUDA and Docker images remove most of the setup overhead, so you spend time on the model and not on the platform.
Transparent, Competitive Pricing
Pay only for what you use with our GPU pricing, plus hibernation to pause billing on idle workloads.

FAQs

What is AntAngelMed?

AntAngelMed is the world's first open-source 100B-parameter medical language model, built on the Ling-flash-2.0 MoE architecture with ~6.1B active parameters per token. It ranks first among open-source models on OpenAI's HealthBench and tops the MedBench leaderboard for Chinese medical AI.

What hardware is required to deploy AntAngelMed?

The BF16 model is approximately 206 GB. We recommend 8x NVIDIA H100-80G-PCIe on Hyperstack (available in the CANADA-1 region), which provides comfortable headroom for tensor-parallel-size 8 inference and long-context KV cache. An INT4 quantised version can run on smaller setups if memory is constrained.

What is AntAngelMed's context window?

AntAngelMed natively supports up to 128K tokens via YaRN extrapolation, making it suitable for long-form clinical documentation, multi-document literature review, and extended diagnostic conversations.

Is it safe to deploy a medical LLM on a public cloud?

For real clinical workloads, deployment design matters as much as the model itself. We recommend running AntAngelMed in a Secure Private Cloud deployment, which provides single-tenant dedicated infrastructure and can be commissioned in EU/EEA jurisdictions for GDPR-aligned data residency. The model and its outputs stay within your dedicated environment and are not exposed externally. For evaluation and synthetic-prompt experimentation, the on-demand tutorial above is a suitable starting point.

What are the main use cases for AntAngelMed?

Strong fits include medical knowledge Q&A, patient-education content generation, clinical documentation assistance, internal training tools for non-clinical staff, and research support — with the important caveat that any clinical-decision-support use must be designed in line with applicable regulations such as the EU AI Act.

How fast is inference on H100 hardware?

The model authors report over 200 tokens per second on H20-class hardware, with the sparse MoE design delivering roughly 3x the throughput of a comparable 36B dense model. H100 deployments with tensor parallelism deliver strong real-world throughput for production medical applications.

Fareed Khan

Fareed Khan

calendar 14 May 2026

Read More
tutorials Tutorials link

NVIDIA Nemotron 3 Nano Omni: Process Video, Audio, and Documents ...

What is NVIDIA Nemotron 3 Nano Omni? ...

What is NVIDIA Nemotron 3 Nano Omni?

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 is an open-weight multimodal foundation model engineered to unify video, audio, image, and text understanding in a single, highly efficient reasoning loop. It replaces the traditional fragmented stack of separate vision, speech, and language models with one production-ready model that can transcribe an hour of audio, summarise a two-minute video, or extract structured data from a complex 100-page document — all from the same endpoint. Built on a 30B-A3B hybrid Mixture-of-Experts architecture and supporting context windows up to 256K tokens, Nemotron 3 Nano Omni delivers state-of-the-art accuracy on document intelligence (OCRBenchV2, MMLongBench-Doc), video understanding (Video-MME, WorldSense), and speech benchmarks (VoiceBench), while remaining lightweight enough to run on a single H100 GPU.

A major reason Nemotron 3 Nano Omni achieves such strong multimodal accuracy at this efficiency level is its carefully engineered training pipeline and specialised hybrid architecture. Here is how it works under the hood:

  1. Hybrid Mamba2-Transformer MoE Backbone: Combines Mamba2 layers (efficient long-sequence state-space modelling) with transformer layers (precise reasoning) inside a Mixture-of-Experts decoder, delivering up to 4× memory and compute efficiency over equivalent dense models.
  2. C-RADIOv4-H Vision Encoder: A robust foundation vision encoder that processes high-resolution images and video frames, capable of focusing on specific patches within a full image to maintain OCR-level precision on dense documents.
  3. Parakeet Speech Encoder: A specialised audio encoder built on NVIDIA Granary and Music Flamingo data that goes far beyond simple transcription to capture tone, intent, and acoustic events.
  4. 3D Convolutional Spatiotemporal Processing: Captures motion and temporal context between video frames natively rather than treating frames as independent images, enabling true video understanding.
  5. Efficient Video Sampling (EVS): An inference-time layer that compresses dense visual tokens from many frames into a concise set the LLM can process without overwhelming the context window — halving video prefill VRAM and TTFT.
  6. Native Reasoning Mode: Chain-of-thought reasoning is enabled by default with budget-controlled thinking, making it ideal for complex multi-step document and video analysis tasks.

Nemotron 3 Nano Omni Features

Nemotron 3 Nano Omni is purpose-built to consolidate enterprise multimodal pipelines into a single deployable model. Its standout capabilities include:

  • True Omnimodal I/O: Accepts video (mp4, up to 2 minutes), audio (wav/mp3, up to 1 hour), images (jpeg/png), and text in a single request — no orchestration between separate models required.
  • Sparse MoE Efficiency: Only ~3B of 31B parameters activate per token, giving large-model intelligence at a fraction of the compute cost.
  • 256K Context Window: Reason across long documents, hour-long meeting recordings, or extended video transcripts in a single pass.
  • Best-in-Class Document Intelligence: Leads OCRBenchV2 (67.04) and MMLongBench-Doc (57.5), making it ideal for contracts, financial filings, scientific papers, and scanned forms.
  • Word-Level Timestamped ASR: Native speech-to-text with word-level timing, plus speech instruction following at 89.39 on VoiceBench.
  • Video Understanding at Scale: Up to ~9.2× greater effective system capacity vs. alternative open omni models at the same per-user interactivity threshold, with 72.2 on Video-MME.
  • Flexible Quantisation: Available in BF16 (62 GB), FP8 (33 GB), and NVFP4 (21 GB) — all staying within ~1 point of BF16 accuracy across 9 multimodal benchmarks.
  • Open by Design: Full weights, datasets, and training recipes released under the NVIDIA Open Model Agreement for unrestricted commercial use.

Multimodal Benchmark Performance

Nemotron Nano VL V2 vs. Nemotron 3 Nano Omni — across 11 industry-standard benchmarks

Nemotron Nano VL V2 Nemotron 3 Nano Omni
CVBench2D
78.3
83.95
+7.2%
OCRBenchV2 (EN)
54.8
67.04
+22.3%
OSWorld
11.1
47.4
+327%
CharXiv Reasoning
41.3
63.6
+54.0%
MMLongBench-Doc
38.0
57.5
+51.3%
MathVista MINI
75.5
82.8
+9.7%
OCR Reasoning
33.9
54.14
+59.7%
Video-MME
Not supported in VL V2
72.2
NEW
WorldSense
Not supported in VL V2
55.4
NEW
DailyOmni
Not supported in VL V2
74.52
NEW
VoiceBench
Not supported in VL V2
89.39
NEW
Reading the chart: All scores reported on a 0–100 scale (higher is better). NEW badges mark capabilities introduced with Nemotron 3 Nano Omni — video, joint video+audio, and speech understanding — that the prior VL-only model could not perform.

How to Deploy Nemotron 3 Nano Omni on Hyperstack

Now, let's walk through the step-by-step process of deploying the necessary infrastructure on Hyperstack to serve Nemotron 3 Nano Omni for production multimodal workloads.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM sized for the BF16 variant of the model.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.

  • Select Hardware Configuration: Nemotron 3 Nano Omni in BF16 is ~62 GB on disk and fits comfortably on a single H100 80GB. Choose the "1xH100-80G-PCIe" flavour. This single-GPU footprint is ideal for the Mamba2-Transformer hybrid MoE — the architecture activates only ~3B params per token, so tensor parallelism is unnecessary at this size.

  • Choose the Operating System: Select an "Ubuntu Server 22.04 LTS R570 CUDA 12.8 with Docker" image (or newer).
  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Attach an Ephemeral Disk: The BF16 weights are ~62 GB. Attach an ephemeral NVMe volume — we'll use it as the Hugging Face cache so model loading is bottlenecked by NVMe, not network.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Step 4: Create a Model Cache Directory

We'll create a directory on the VM's high-speed ephemeral disk. Storing the 62 GB BF16 checkpoint here ensures fast cold-starts on subsequent restarts.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions so the Docker container can use it
sudo chmod -R 0777 /ephemeral/hug

# Also create a media directory for sample inputs (PDFs, audio, video)
sudo mkdir -p /ephemeral/media
sudo chmod -R 0777 /ephemeral/media

Step 5: Authenticate with Hugging Face

Nemotron 3 Nano Omni is governed by the NVIDIA Open Model Agreement, so the first download requires a Hugging Face token. Generate a read token at huggingface.co/settings/tokens, then export it on the VM:

# Export your HF token so the vLLM container can pull the weights
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Step 6: Launch the vLLM Server

We will use the official vllm/vllm-openai:v0.20.0 image, which is the minimum version that ships the nemotron_v3 reasoning parser and 3D-conv video kernels required by this model. Note that audio packages are not preinstalled in the upstream image, so we install vllm[audio] at container startup before invoking vllm serve.

# Pull the required vLLM 0.20.0 image (CUDA 12.9 build for Hyperstack's R570 driver)
docker pull vllm/vllm-openai:v0.20.0-cu129

# Launch the multimodal server with audio + video + reasoning enabled
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_nemotron_omni \
--shm-size 16g \
-e HF_TOKEN="$HF_TOKEN" \
-v /ephemeral/hug:/root/.cache/huggingface \
-v /ephemeral/media:/media:ro \
--entrypoint /bin/bash \
 vllm/vllm-openai:v0.20.0-cu129 -c "
pip install vllm[audio] && \
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.92 \
--trust-remote-code \
--allowed-local-media-path / \
--video-pruning-rate 0.5 \
--media-io-kwargs '{\"video\": {\"fps\": 2, \"num_frames\": 256}}' \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
"

This command instructs Docker to:

  • --gpus all: Expose the H100 to the container.
  • --ipc=host & --shm-size 16g: Required for vLLM's multiprocessing and large multimodal tensors during prefill.
  • -v /ephemeral/hug:/root/.cache/huggingface: Persist the 62 GB BF16 checkpoint on the NVMe ephemeral disk between restarts.
  • -v /ephemeral/media:/media:ro: Read-only mount for input PDFs, audio, and video files we'll use later.
  • pip install vllm[audio]: Adds librosa and audio decoders the upstream image omits — required before any audio request.
  • --max-model-len 131072: 128K-token context, more than enough for hour-long audio transcripts and 100-page PDFs.
  • --max-num-seqs 16: Conservative concurrency cap for BF16 on a single 80 GB H100 (62 GB weights leave ~18 GB for KV cache + activations). Increase this value for FP8/NVFP4 deployments.
  • --video-pruning-rate 0.5: Enables Efficient Video Sampling — drops 50% of redundant video tokens to halve video-prefill VRAM and TTFT.
  • --media-io-kwargs: Sets video sampling to 2 FPS and 256 frames per clip — the recommended setting for 720p video on 80 GB GPUs.
  • --reasoning-parser nemotron_v3: Routes chain-of-thought traces into the proper response field for the Nemotron-3 chat template.
  • --tool-call-parser qwen3_coder: The Nemotron 3 Nano Omni release re-uses the qwen3_coder parser for structured tool/function calls.
  • --allowed-local-media-path /: Allows the API to load files from local paths (e.g., the /media mount), avoiding base64 round-trips for large inputs.
⚠️

Version Warning: Only vllm/vllm-openai:v0.20.0-cu129 (the CUDA 12.9 build) is compatible with Hyperstack's current Ubuntu images, which ship with the R570 driver. Do not use the default :v0.20.0 tag — it is built against CUDA 13.0 and requires the R580+ driver, which is not yet available on Hyperstack. Do not use :latest either. 

Step 7: Verify the Deployment

First, follow the container logs to monitor model loading. The 62 GB BF16 checkpoint takes 6–10 minutes to download on first run, then 60–90 seconds to load from the NVMe cache on subsequent starts.

docker logs -f vllm_nemotron_omni

The server is ready when you see: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000. Then run a quick text-only smoke test from your local machine:

# Smoke test from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
"messages": [
{"role": "user", "content": "In one sentence, what modalities can you process?"}
],
"max_tokens": 256,
"temperature": 0.2,
"extra_body": {
"top_k": 1,
"chat_template_kwargs": {"enable_thinking": false}
}
}'

If the response stream comes back with a coherent answer about video, audio, image, and text, your endpoint is live and ready for real workloads.

💡

NVIDIA recommends the following sampling parameters for Nemotron 3 Nano Omni:

# Thinking mode (long-doc analysis, multimodal reasoning,
video summarisation)
temperature=0.6, top_p=0.95, max_tokens=20480, reasoning_budget=16384, grace_period=1024 # Instruct (non-thinking) mode for general short-form tasks temperature=0.2, top_k=1, max_tokens=1024 # ASR / transcription tasks temperature=1.0, top_k=1

Step 8: Hibernating Your VM (Optional)

When you're done with a workload, you can hibernate the VM to pause compute billing while preserving the entire setup — including the 62 GB cached weights on the ephemeral volume:

  • In the Hyperstack dashboard, locate your VM.
  • Click the "Hibernate" option.
  • Resume any time without redownloading the model.

Use Case 1: Document Intelligence — PDF Extraction at Scale

Document intelligence is where Nemotron 3 Nano Omni really pulls away from older VLMs. With 67.04 on OCRBenchV2 and 57.5 on MMLongBench-Doc, it can read scanned contracts, parse complex financial tables, and extract structured data from multi-column scientific PDFs. The OpenAI-compatible API does not accept raw PDF uploads, so the standard pattern is to render each page to PNG with PyMuPDF and send pages as base64 images.

First, install the local dependencies on your client machine (not the VM):

pip3 install openai pymupdf pillow

Now we'll build a reusable PDF-to-structured-data extractor. The script below renders each page at 200 DPI, sends it to Nemotron 3 Nano Omni with a structured-extraction prompt, and aggregates the page-level outputs into a single JSON document. We use thinking mode here because long documents benefit significantly from the model's chain-of-thought reasoning.

# pdf_extract.py — page-by-page structured extraction with Nemotron 3 Nano Omni
import base64, io, json
import fitz  # PyMuPDF
from openai import OpenAI

# Point at your Hyperstack VM
client = OpenAI(
    base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1",
    api_key="EMPTY",
)
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

def render_page_to_data_url(pdf_path: str, page_num: int, dpi: int = 200) -> str:
    """Render a single PDF page to a PNG data URL."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    pix = page.get_pixmap(dpi=dpi)
    png_bytes = pix.tobytes("png")
    b64 = base64.b64encode(png_bytes).decode("utf-8")
    return f"data:image/png;base64,{b64}"

def extract_page(pdf_path: str, page_num: int) -> dict:
    image_url = render_page_to_data_url(pdf_path, page_num)

    prompt = (
        "You are a document intelligence system. Extract the following from "
        "this page and return a single valid JSON object with these keys:\n"
        "  - page_title (string or null)\n"
        "  - section_headers (list of strings)\n"
        "  - key_facts (list of one-sentence factual statements)\n"
        "  - tables (list of objects with 'caption' and 'rows' as 2D arrays)\n"
        "  - footnotes (list of strings)\n"
        "Respond with ONLY the JSON, no prose."
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }],
        max_tokens=20480,
        temperature=0.6,
        top_p=0.95,
        extra_body={
            "thinking_token_budget": 16384 + 1024,
            "chat_template_kwargs": {
                "enable_thinking": True,
                "reasoning_budget": 16384,
            },
        },
    )

    raw = response.choices[0].message.content.strip()
    # Strip markdown fences if the model wrapped JSON in them
    if raw.startswith("```"):
        raw = raw.split("```")[1].lstrip("json").strip()
    return json.loads(raw)

if __name__ == "__main__":
    pdf_path = "q4_earnings_report.pdf"
    doc = fitz.open(pdf_path)
    full_extraction = []
    for i in range(len(doc)):
        print(f"Processing page {i+1}/{len(doc)}...")
        full_extraction.append({"page": i + 1, **extract_page(pdf_path, i)})

    with open("extracted.json", "w") as f:
        json.dump(full_extraction, f, indent=2)
    print(f"Extracted {len(full_extraction)} pages → extracted.json")

Running this on a sample Q4 earnings report (page 3, which contains a revenue-by-segment table) produces output like:

{
  "page": 3,
  "page_title": "Revenue by Operating Segment",
  "section_headers": [
    "Segment Performance",
    "Year-over-Year Comparison"
  ],
  "key_facts": [
    "Cloud Services revenue grew 34% year-over-year to $4.82B.",
    "Hardware revenue declined 6% YoY to $1.91B due to supply normalization.",
    "Total Q4 revenue reached $8.14B, a 19% increase versus Q4 prior year."
  ],
  "tables": [
    {
      "caption": "Q4 Revenue by Segment ($ millions)",
      "rows": [
        ["Segment", "Q4 Current", "Q4 Prior", "YoY %"],
        ["Cloud Services", "4,820", "3,597", "+34.0%"],
        ["Hardware", "1,910", "2,032", "-6.0%"],
        ["Software Licensing", "1,410", "1,213", "+16.2%"]
      ]
    }
  ],
  "footnotes": [
    "All figures unaudited. Constant-currency growth shown in Appendix B."
  ]
}

Notice that the model not only OCR'd the table accurately but also reasoned about the numbers — computing year-over-year deltas and pulling them into key_facts rather than just regurgitating cells. This is exactly what MMLongBench-Doc measures, and where Nemotron 3 Nano Omni is currently best-in-class among open omni models.

💡

Throughput Tip: For multi-page batches, render all pages first then send them concurrently using asyncio + AsyncOpenAI. With --max-num-seqs 16 on a single H100,  you can comfortably process 1–2 pages/sec with reasoning enabled, or 4–6 pages/sec with reasoning disabled for simpler extraction tasks.

Use Case 2: Audio Transcription & Understanding

Nemotron 3 Nano Omni's audio stack is built on the NVIDIA Parakeet encoder and supports word-level timestamped ASR, speech instruction following (89.39 on VoiceBench), and full audio-content reasoning. It accepts wav and mp3 at 8 kHz or higher, with single files up to one hour long. Unlike a pure ASR model, you can ask it questions about the audio rather than just transcribing it.

The example below shows two patterns: first, a clean transcription with timestamps; second, a content-understanding query against the same audio. We'll use a meeting recording stored on the VM's /ephemeral/media mount.

# audio_pipeline.py — transcribe + understand a meeting recording
import base64
from openai import OpenAI

client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

def audio_to_data_url(path: str) -> str:
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:audio/wav;base64,{b64}"

audio_url = audio_to_data_url("meeting_q4_planning.wav")

# --- Pattern 1: Word-level timestamped transcription (ASR mode) ---
# NVIDIA recommends temperature=1.0, top_k=1 for ASR tasks.
transcript_resp = client.chat.completions.create(
    model=MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": audio_url}},
            {"type": "text",
             "text": "Transcribe this recording verbatim with word-level timestamps "
                     "in [hh:mm:ss] format at the start of each utterance."},
        ],
    }],
    max_tokens=8192,
    temperature=1.0,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print("=== TRANSCRIPT ===")
print(transcript_resp.choices[0].message.content)

# --- Pattern 2: Content-understanding Q&A on the same audio ---
qa_resp = client.chat.completions.create(
    model=MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": audio_url}},
            {"type": "text",
             "text": "List every action item assigned in this meeting along with "
                     "the owner and any mentioned deadline. Use bullet points."},
        ],
    }],
    max_tokens=2048,
    temperature=0.2,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print("\n=== ACTION ITEMS ===")
print(qa_resp.choices[0].message.content)

For a 6-minute Q4 planning recording, the transcription pattern produces:

=== TRANSCRIPT ===
[00:00:02] Sarah: Alright, let's get started. Thanks everyone for joining
the Q4 planning sync.
[00:00:09] Sarah: First on the agenda is the inference cost review. Marcus,
do you have the latest numbers?
[00:00:17] Marcus: Yeah, so we're trending about 18% over budget on GPU
spend, mostly driven by the new vision pipeline.
[00:00:26] Sarah: Got it. Can you put together a cost-reduction proposal
by next Friday?
[00:00:31] Marcus: Yep, I'll have it ready by the 15th.
...

And the same audio passed to the Q&A pattern returns a structured action-item list:

=== ACTION ITEMS ===
- Marcus: Prepare a GPU-cost reduction proposal — due Friday the 15th.
- Priya: Benchmark the FP8 checkpoint against BF16 on the eval set —
  due end of next week.
- Sarah: Schedule a follow-up with the platform team to review KV-cache
  utilization — by EOD Monday.
- Marcus + Priya: Co-author a one-page summary for the leadership review
  — due before the next planning sync.

This is the consolidation Omni is built for: one model, one endpoint, both raw transcription and semantic understanding. Previously this would have required a separate Whisper-class ASR model plus an LLM running diarisation-aware summarisation on top of its output.

💡

Long-Audio Tip: For recordings over 30 minutes, mount the file into the container at /media and pass it as a file:// URL instead of base64. The --allowed-local-media-path / flag we set in Step 6 enables this and avoids inflating the request payload by ~33% from base64 encoding.

Use Case 3: Video Summarisation & Temporal Q&A

Video is where Nemotron 3 Nano Omni's hybrid architecture shows the most dramatic efficiency gains — up to ~9.2× greater effective system capacity versus alternative open omni models at the same per-user interactivity threshold, thanks to 3D-conv spatiotemporal processing and Efficient Video Sampling. The model accepts mp4 files up to two minutes long; for 1080p content, sample at 1 FPS / 128 frames, and for 720p or lower, you can push to 2 FPS / 256 frames (which is what we configured at server launch).

For video tasks, NVIDIA recommends thinking mode — the chain-of-thought helps the model integrate temporal information across frames before answering. Below is a complete pipeline that produces both a dense summary and answers specific temporal questions about a product demo video.

# video_pipeline.py — dense summary + temporal Q&A on a product demo
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

# Path on the VM — uses the /media mount we created in Step 4
video_url = Path("/media/product_demo.mp4").resolve().as_uri()

REASONING_BUDGET = 16384
GRACE_PERIOD = 1024

def ask_video(question: str, use_audio: bool = True) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": question},
            ],
        }],
        max_tokens=20480,
        temperature=0.6,
        top_p=0.95,
        extra_body={
            "thinking_token_budget": REASONING_BUDGET + GRACE_PERIOD,
            "chat_template_kwargs": {
                "enable_thinking": True,
                "reasoning_budget": REASONING_BUDGET,
            },
            # Set to True for video+audio joint reasoning (e.g. tutorials, lectures)
            "mm_processor_kwargs": {"use_audio_in_video": use_audio},
        },
    )
    return response.choices[0].message.content

# 1. Dense scene-by-scene summary
summary = ask_video(
    "Provide a dense, scene-by-scene summary of this product demo. "
    "For each scene, include the visible UI elements, what action the "
    "presenter takes, and any on-screen text."
)
print("=== DENSE SUMMARY ===")
print(summary)

# 2. Specific temporal Q&A — the model must integrate across frames
answer = ask_video(
    "At what point in the demo does the presenter switch from the "
    "dashboard view to the settings panel, and what configuration "
    "change do they make there?"
)
print("\n=== TEMPORAL Q&A ===")
print(answer)

For a 90-second product demo of a hypothetical analytics dashboard, the dense summary returns:

=== DENSE SUMMARY ===
Scene 1 (0:00–0:12) — The video opens on a dark-mode dashboard titled
"Revenue Analytics — Q4 2025." Visible UI elements include a top
navigation bar with tabs labeled Overview, Segments, Forecasts, and
Settings. A line chart in the centre shows weekly revenue trending
upward. The presenter's cursor hovers over the "Segments" tab.

Scene 2 (0:12–0:34) — The presenter clicks "Segments." The view
transitions to a stacked bar chart showing four product lines. On-screen
text reads "Cloud Services now contributes 59% of total revenue." The
presenter narrates the year-over-year growth rates while highlighting
each bar in turn.

Scene 3 (0:34–0:51) — A tooltip appears showing the exact dollar value
for the Cloud Services segment ($4,820M). The presenter clicks a small
gear icon in the top right, and the Settings panel slides in from the
right edge of the screen.

Scene 4 (0:51–1:18) — In the Settings panel, the presenter toggles
"Constant-currency adjustment" from Off to On. The chart in the
background re-renders, and the Cloud Services bar shrinks slightly...

And the temporal-reasoning question returns:

=== TEMPORAL Q&A ===
The presenter switches from the dashboard view to the settings panel at
approximately 0:34, by clicking a small gear icon in the top-right
corner of the dashboard. In the settings panel, the only configuration
change made is toggling "Constant-currency adjustment" from Off to On —
this causes the underlying revenue chart to re-render with adjusted
figures, after which the presenter returns to the dashboard view at
roughly 1:18.

Both responses required the model to track UI state across frames, not just describe individual screenshots. This is precisely what 3D convolutions and EVS are built for, and it's why a single Nemotron 3 Nano Omni call replaces what would otherwise be a video-frame-extractor + per-frame VLM + temporal-reasoning LLM pipeline.

💡

Frame-Sampling Tip: If your video has fast motion (sports, gameplay) or rapidly changing UI, push toward 256 frames at 2 FPS. For static talking-head content, 64–128 frames at 1 FPS gives the same accuracy at half the prefill VRAM. The --video-pruning-rate 0.5 EVS flag we set at launch automatically discards redundant tokens regardless of which sampling rate you choose.

Disabling "Thinking" Mode for Latency-Sensitive Tasks

Reasoning mode is on by default and is the right choice for long documents, video summarisation, and any multi-step analysis. But for short, latency-sensitive requests — image classification, simple ASR, single-fact extraction — the chain-of-thought adds tokens you don't need. Disable it per-request with the chat_template_kwargs override:

from openai import OpenAI
client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
    messages=[{"role": "user",
               "content": "Classify this invoice as 'paid', 'overdue', or 'pending' in one word."}],
    max_tokens=16,
    temperature=0.2,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print(response.choices[0].message.content)
# → "overdue"

For text and image inputs, NVIDIA recommends keeping reasoning mode on. For pure ASR, video, and audio Q&A, try both modes against your eval set — the right answer depends on how much temporal/contextual integration the question requires.

Why Deploy Nemotron 3 Nano Omni on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads, and Nemotron 3 Nano Omni's multimodal pipeline is exactly the kind of GPU-bound, memory-sensitive deployment Hyperstack is engineered for:

01
On-Demand H100 AccessSpin up a single NVIDIA H100 80GB in minutes, exactly the minimum-spec GPU NVIDIA lists for the BF16 checkpoint.
02
High-Speed Ephemeral NVMe62 GB BF16 weights load from local NVMe in 60 to 90 seconds on warm restarts, with no slow re-downloads from Hugging Face.
03
Seamless Scale-Up PathWhen you outgrow a single H100, move to multi-GPU H100, H200, or B200 flavours without rewriting your serving stack.
04
Hibernate to Reduce CostsPause compute billing between batches while keeping the cached model and your full setup intact. See our GPU pricing for details.
05
Native vLLM and Docker WorkflowPre-configured CUDA images mean the docker run command above runs without driver wrangling or kernel patches.

FAQs

What is NVIDIA Nemotron 3 Nano Omni?

Nemotron 3 Nano Omni is NVIDIA's open-weight multimodal model that unifies video, audio, image, and text understanding in a single 30B-parameter Mamba2-Transformer hybrid MoE (with ~3B active parameters per token), released under the NVIDIA Open Model Agreement for commercial use.

Can Nemotron 3 Nano Omni run on a single H100?

Yes. The BF16 checkpoint is 62 GB and fits on a single NVIDIA H100 80GB, which is the minimum spec NVIDIA lists. FP8 (33 GB) and NVFP4 (21 GB) variants run on smaller GPUs with negligible accuracy loss.

What is the maximum context window?

Up to 256K tokens. On a single H100 with BF16, 128K is a comfortable practical setting that leaves enough VRAM for the KV cache plus a reasonable batch size.

What input formats does the model accept?

Video as mp4 up to 2 minutes, audio as wav or mp3 up to 1 hour, images as jpeg/png, and text. PDFs must be rendered to images first — the API does not accept raw PDF uploads.

Why do I need to install vllm[audio] manually?

The upstream vllm/vllm-openai:v0.20.0 Docker image ships without audio decoders to keep the image size down. Audio support requires librosa and related packages, which are pulled in by the [audio] extra. Running it as part of the container's startup command (as shown in Step 6) is the recommended pattern.

Is reasoning mode worth the extra latency?

For long documents, video summarisation, and any task requiring temporal or cross-modal integration — yes, the accuracy gains are substantial and reasoning is on by default. For short, single-fact tasks (classification, simple ASR), disable it via chat_template_kwargs.enable_thinking: false for a cleaner, faster response.

What's the easiest way to scale beyond a single H100?

Switch to the FP8 or NVFP4 quantised checkpoints to free up VRAM for higher concurrency on the same GPU, or move to a multi-GPU Hyperstack flavour and add --tensor-parallel-size N to the vllm serve command. Both paths are drop-in changes that don't require modifying client code.

Fareed Khan

Fareed Khan

calendar 4 May 2026

Read More
tutorials Tutorials link

Deploy Kimi K2.6 on Hyperstack: A Step-by-Step Guide for Coders

Kimi K2.6 is an open-weight, native multimodal agentic ...

Kimi K2.6 is an open-weight, native multimodal agentic model from Moonshot AI, engineered for state-of-the-art coding and autonomous agent workflows. It uses a sparse Mixture-of-Experts architecture with 1 trillion total parameters, of which only 32 billion are activated per forward pass, allowing it to match or beat closed models like GPT-5.4 and Claude Opus 4.6 on coding and agentic benchmarks while staying efficient enough to self-host. With a native 256K-token context window, native INT4 quantisation, and the ability to coordinate up to 300 sub-agents across 4,000 steps, it handles repository-level engineering, hour-long visual analysis, and persistent 24/7 background agents in a single autonomous run.

Kimi K2.6 is built on the same architectural foundation as K2.5 with substantial training and post-training upgrades. Here is how the architecture works:

  1. Sparse MoE Routing: 384 experts in total, with 8 selected per token plus 1 shared expert, so only ~32B of the 1T parameters are active at any moment.
  2. Multi-Head Latent Attention (MLA): Compresses the key-value cache into a low-rank latent space, drastically reducing memory pressure for long contexts compared with vanilla multi-head attention.
  3. Native INT4 Quantisation: Weights are released directly in INT4, the same scheme used in K2-Thinking, which is what brings the on-disk size down to ~595 GB and lets the model fit on a single 8-GPU node.
  4. Preserve Thinking: Retains full reasoning traces across multi-turn interactions, which materially improves performance in coding agent scenarios where context across iterations matters.
  5. Interleaved Thinking + Multi-Step Tool Calls: The model can reason between tool calls instead of producing one monolithic plan upfront, which is what makes it effective on workflows that span thousands of steps.
  6. Native Multimodal Fusion (MoonViT): A 400M-parameter vision encoder is integrated directly, so the model accepts images and video frames natively alongside text.

Kimi K2.6 Features

Kimi K2.6 goes well beyond chat. Its design targets the failure modes that show up in real, long-running engineering work:

  • Long-Horizon Coding: Generalises across Rust, Go, and Python, and across front-end, DevOps, and performance optimisation. In Moonshot's case studies, it sustained autonomous runs of 12 hours or more with 4,000+ tool calls without losing the plot.
  • State-of-the-Art Open Coding: Posts 58.6 on SWE-Bench Pro, 80.2 on SWE-Bench Verified, 76.7 on SWE-Bench Multilingual, and 89.6 on LiveCodeBench v6 — competitive with or ahead of leading closed models.
  • Coding-Driven Design: Turns plain prompts and visual inputs into production-ready interfaces with structured layouts, scroll-triggered animations, and even basic full-stack flows including authentication and database operations.
  • Elevated Agent Swarm: Scales to 300 sub-agents and 4,000 coordinated steps in a single run, decomposing large tasks into parallel domain-specialised subtasks (compared with 100 sub-agents and 1,500 steps in K2.5).
  • Proactive 24/7 Agents: Demonstrated 5-day autonomous engineering worklogs powering monitoring, incident response, and system operations as a persistent background agent.
  • Multimodal & Long Context: 256K-token context with native image and video understanding through the integrated MoonViT encoder.

Kimi K2.6 vs leading models on coding & agentic benchmarks

Kimi K2.6 GPT-5.4 (xhigh) Claude Opus 4.6 Gemini 3.1 Pro
 

Higher is better. Source: Moonshot AI Kimi K2.6 model card.

How to Deploy Kimi K2.6 on Hyperstack

Now, let's walk through the step-by-step process of deploying the necessary infrastructure.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.
      
  • Select Hardware Configuration: Kimi K2.6 weights are roughly 595 GB on disk after native INT4 quantisation, so you need at least 640 GB of total VRAM. Choose the "8xH100-80G-PCIe" flavour. This is the minimum practical configuration; for production-grade context lengths and concurrency, 8x H200-141G-SXM5 is the official reference setup.

  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all necessary drivers and Docker pre-installed.

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Once connected, you should see a welcome message indicating you're logged into your Hyperstack VM.

Now that we are inside the VM, we will prepare storage, download the model, and use Docker to launch the vLLM server.

Step 4: Create a Model Cache Directory

Kimi K2.6's INT4 weights are roughly 595 GB on disk, which is too large for the root partition on most VM images. We'll place them on the high-speed ephemeral disk, which on this Hyperstack flavour exposes around 6 TB of NVMe-backed storage at /ephemeral.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions to the directory
sudo chmod -R 0777 /ephemeral/hug

# Confirm there is enough free space (you should see ~6 TB available on /ephemeral)
df -h /ephemeral

This command creates a folder named hug inside the /ephemeral disk and sets its permissions so that the Docker container can read and write the model files. Storing weights on the ephemeral NVMe disk also dramatically reduces model load time.

Step 5: Pre-Download the Model Weights

The 595 GB download is large enough that it's worth running it once outside Docker so we can monitor progress and recover from interruptions cleanly. We'll use the official Hugging Face CLI.

# Install the Hugging Face Hub CLI with xet support for fast parallel downloads
pip3 install -U "huggingface_hub[hf_xet]"

# Make sure the CLI binary is on PATH (pip installs to ~/.local/bin)
export PATH="$HOME/.local/bin:$PATH"
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc

# Run the download inside tmux so it survives any SSH disconnect
sudo apt install -y tmux
tmux new -s download

# Inside the tmux session, download the model directly to the ephemeral disk
hf download moonshotai/Kimi-K2.6 \
--local-dir /ephemeral/hug/kimi-k2.6 \
--max-workers 8

You can detach from the tmux session at any time with Ctrl+b then d, and reattach later with tmux attach -t download. On a Hyperstack VM with a healthy network path to Hugging Face, the full 595 GB completes in roughly 15 minutes.

💡

Important: Always pass --local-dir /ephemeral/hug/kimi-k2.6. Without it, the CLI defaults to ~/.cache/huggingface on the small root disk and will fail with "No space left on device" partway through the download.

Step 6: Launch the vLLM Server

 We will use the latest vllm-openai Docker image. The flags below are deliberately tuned to fit Kimi K2.6 onto an 8x H100-80G node — the model occupies about 71.4 GiB per GPU once sharded, which leaves a tight headroom for KV cache, so we keep max-model-len small and max-num-seqs low for the initial bring-up.

# Pull the latest vLLM OpenAI image from Docker Hub
docker pull vllm/vllm-openai:latest

# Run the container with the specified configuration
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_kimi_k26 \
-v /ephemeral/hug/kimi-k2.6:/models/kimi-k2.6 \
vllm/vllm-openai:latest \
/models/kimi-k2.6 \
--served-model-name Kimi-K2.6 \
--tensor-parallel-size 8 \
--trust-remote-code \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--mm-encoder-tp-mode data \
--max-model-len 2048 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.93 \
--enforce-eager \
--host 0.0.0.0 \
--port 8000

This command instructs Docker to:

  • --gpus all: Use all 8 NVIDIA H100 GPUs on the host machine.
  • --ipc=host: Share the host's IPC namespace, which is required for high-bandwidth multi-GPU communication.
  • --network host: Expose the container directly on the host network for simpler API access.
  • -v /ephemeral/hug/kimi-k2.6:/models/kimi-k2.6: Mount the pre-downloaded weights from the ephemeral disk into the container.
  • --served-model-name Kimi-K2.6: The friendly name clients will use in the "model" field of API requests.
  • --tensor-parallel-size 8: Shard the model across all 8 GPUs. This is mandatory; the model does not fit otherwise.
  • --mm-encoder-tp-mode data: Tensor-parallel mode for the MoonViT vision encoder, required because K2.6 is natively multimodal.
  • --max-model-len 2048: Caps the context window at 2048 tokens for the initial smoke test. The model supports 256K natively, but each extra token of context consumes scarce KV cache memory on 80 GB GPUs. Increase this gradually once the server is stable.

Step 7: Verify the Deployment

First, follow the container logs to monitor the model loading process. Loading 595 GB of weights and sharding across 8 GPUs takes around 5–10 minutes on PCIe-connected H100s.

docker logs -f vllm_kimi_k26

The process is complete when you see: INFO: Uvicorn running on http://0.0.0.0:8000. On our test deployment, the load step reported "Model loading took 71.37 GiB memory" per GPU and a total init time of about six minutes.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000. This is essential for external access. 

Finally, test the API. Run this from your local machine (not the VM), replacing the IP with your VM's public IP. We pass "thinking": false via chat_template_kwargs to use Instant mode, so the model returns a direct answer without reasoning traces — perfect for a "hello world" health check.

# Test the API endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Kimi-K2.6",
"messages": [{"role": "user", "content": "hi"}],
"max_tokens": 50,
"chat_template_kwargs": {"thinking": false}
}'

A successful response looks like this:

{
"id": "chatcmpl-9ab83c6a4e1ab454",
"object": "chat.completion",
"model": "Kimi-K2.6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hi! How can I help you today?",
"reasoning": null
},
"finish_reason": "stop"
}
],
"usage": {"prompt_tokens": 10, "completion_tokens": 10, "total_tokens": 20}
}

The "reasoning" field is null because we used Instant mode. The Kimi K2.6 model is now successfully deployed on Hyperstack.

💡

The Moonshot team recommends the following sampling parameters for Kimi K2.6:

# Thinking mode (default)
temperature=1.0, top_p=0.95

# Instant mode (thinking disabled)
temperature=0.6, top_p=0.95

Step 8: Hibernating Your VM (OPTIONAL)

When you are finished with your current workload, you can hibernate your VM to avoid incurring unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual machine.
  • Look for a "Hibernate" option.
  • Click to hibernate the VM, which will stop billing for compute resources while preserving your setup.

Switching Between Thinking and Instant Modes

Now that the vLLM server is running, we can interact with it from Python using the standard OpenAI client. First, install the library:

# Install the OpenAI Python client library to interact with the vLLM server
pip3 install openai

Instantiate an OpenAI-compatible client pointed at the local vLLM server. The api_key can be any non-empty placeholder because vLLM does not enforce authentication by default.

from openai import OpenAI

# Create an OpenAI-compatible client that points to a local vLLM server.
client = OpenAI(
base_url="http://localhost:8000/v1", # Local API endpoint exposing OpenAI-style routes
api_key="EMPTY", # Placeholder key; vLLM does not enforce API keys
)

Kimi K2.6 ships with thinking mode enabled by default, which is what gives it strong reasoning behaviour but also burns extra tokens. For straightforward tasks like simple code generation or direct factual questions, you can disable thinking via chat_template_kwargs.thinking:

# Define the conversation payload sent to the model.
# Here, the user asks for a short Python script that reverses a string.
messages = [
{"role": "user", "content": "Write a quick Python script to reverse a string."}
]

# Send a chat completion request — Instant mode (thinking disabled)
chat_response = client.chat.completions.create(
model="Kimi-K2.6",
messages=messages,
max_tokens=500,
temperature=0.6,
top_p=0.95,
extra_body={
"chat_template_kwargs": {"thinking": False},
},
)

print("Response:", chat_response.choices[0].message.content)

By setting "thinking": False, the model skips internal reasoning and returns a direct, concise answer:

Response: Here is a quick and efficient Python script to reverse a string using slicing:

```python
def reverse_string(text):
    reversed_text = ""
    for char in text:
        reversed_text = char + reversed_text
    return reversed_text

# Example usage
if __name__ == "__main__":
    user_input = input("Enter a string: ")
    print(reverse_string(user_input))...

To re-enable thinking mode for tasks that benefit from explicit reasoning (debugging, planning, math), simply pass "thinking": True (or omit the flag entirely — thinking is the default). The reasoning trace will appear in the message.reasoning field and the final answer in message.content:

# Thinking mode — same call, just enable thinking
chat_response = client.chat.completions.create(
model="Kimi-K2.6",
messages=[{"role": "user", "content": "Which is bigger: 9.11 or 9.9? Think carefully."}],
max_tokens=500,
temperature=1.0,
top_p=0.95,
extra_body={"chat_template_kwargs": {"thinking": True}},
)

print("Reasoning:", chat_response.choices[0].message.reasoning)
print("Answer: ", chat_response.choices[0].message.content)

Multimodal Capabilities with Kimi K2.6

Kimi K2.6 is natively multimodal through its integrated MoonViT encoder, so it can analyse images and video frames alongside text in the same request. The image input format is identical to the OpenAI vision API:

# Build a multimodal chat prompt with one user message:
# - an image URL
# - a text question about the image
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://huggingface.co/moonshotai/Kimi-K2.6/resolve/main/figures/kimi-logo.png"
                }
            },
            {
                "type": "text",
                "text": "Describe this image in detail."
            }
        ]
    }
]

chat_response = client.chat.completions.create(
    model="Kimi-K2.6",
    messages=messages,
    max_tokens=600,
    temperature=0.6,
    top_p=0.95,
)

print("Response:", chat_response.choices[0].message.content)

The model returns a structured description of the image, demonstrating that the MoonViT vision encoder is fully wired through to the language model. You can pass either a public URL or a base64 data URI in the "url" field.

⚠️

Note on video input: Per the Moonshot model card, chat with video content is currently an experimental feature and is officially supported only on Moonshot's hosted API. Self-hosted vLLM and SGLang deployments accept image inputs reliably; video support varies by inference engine version, so test with short clips first if you depend on it.

Preserve Thinking for Multi-Turn Coding Agents

One of the more underrated additions in Kimi K2.6 is preserve_thinking mode. By default, the assistant's previous reasoning traces are dropped between turns. With preserve_thinking enabled, the full reasoning content is retained across turns, which Moonshot reports materially improves coding-agent performance because the model can continue from its earlier thought process instead of re-deriving it.

It is disabled by default. The example below shows how to enable it on a self-hosted vLLM endpoint:

messages = [
    {"role": "user", "content": "Tell me three random numbers."},
    {
        "role": "assistant",
        "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
        "content": "473, 921, 235",
    },
    {"role": "user", "content": "What are the other two numbers you have in mind?"},
]

chat_response = client.chat.completions.create(
    model="Kimi-K2.6",
    messages=messages,
    max_tokens=500,
    extra_body={
        "chat_template_kwargs": {
            "thinking": True,
            "preserve_thinking": True,
        }
    },
)

print(chat_response.choices[0].message.content)

Because the prior reasoning trace mentioned 215 and 222, the model's follow-up answer correctly references those exact numbers — something it could not do reliably without preserve_thinking. Moonshot recommends enabling preserve_thinking only when thinking mode itself is enabled.

Agentic Use Case with Kimi K2.6

One of the most powerful features of moonshotai/Kimi-K2.6 is its long-horizon agentic tool calling. Unlike standard chat where the model just generates text, an agentic workflow lets it:

  • Decide when external tools are needed
  • Call tools automatically
  • Receive tool outputs and continue reasoning
  • Complete multi-step tasks autonomously across hundreds or thousands of steps

Because we launched vLLM with --tool-call-parser kimi_k2, the server emits structured tool calls in the format Moonshot's agent frameworks expect. The Moonshot team's flagship reference framework is the Kimi Code CLI, but any OpenAI-compatible agent framework works here, including Qwen-Agent, Letta, and the Anthropic-compatible CLIs we cover later.

For self-contained examples, we'll use a small Python harness with a single filesystem tool. It is the simplest way to demonstrate K2.6's agentic loop end-to-end without pulling in a heavy framework.

import json, os, subprocess
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
WORKSPACE = "/ephemeral/agent_workspace"
os.makedirs(WORKSPACE, exist_ok=True)

# Define a single filesystem tool the agent can call
tools = [{
    "type": "function",
    "function": {
        "name": "run_shell",
        "description": "Run a shell command inside the workspace and return stdout/stderr.",
        "parameters": {
            "type": "object",
            "properties": {"command": {"type": "string"}},
            "required": ["command"],
        },
    },
}]

def run_shell(command):
    proc = subprocess.run(command, shell=True, cwd=WORKSPACE,
                          capture_output=True, text=True, timeout=30)
    return proc.stdout + proc.stderr

With the tool defined, the agent loop is just: send the user message, execute any tool calls the model returns, append the tool results, and call the model again until it stops issuing tool calls.

def agent_loop(user_prompt, max_steps=20):
    messages = [{"role": "user", "content": user_prompt}]
    for _ in range(max_steps):
        resp = client.chat.completions.create(
            model="Kimi-K2.6",
            messages=messages,
            tools=tools,
            tool_choice="auto",
            max_tokens=2048,
            temperature=1.0,
            top_p=0.95,
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            output = run_shell(args["command"])
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": output})
    return "step limit reached"

Example 1: Long-Horizon Refactor

This is the kind of task K2.6 is genuinely good at — sustained multi-step engineering work where it has to read existing code, plan a change, and apply it without losing track. We seed the workspace with a small file and ask the model to refactor it:

# Seed the workspace with a deliberately ugly file
open(f"{WORKSPACE}/calc.py", "w").write("""
def calc(a, b, op):
    if op == "add": return a + b
    if op == "sub": return a - b
    if op == "mul": return a * b
    if op == "div": return a / b
""")

print(agent_loop(
    "Refactor calc.py to use a dict-based dispatch table instead of if/elif chains, "
    "add type hints, handle division-by-zero, and add three pytest tests in test_calc.py."
))

What happens internally over the next ~10–15 tool calls:

  1. The model interprets the request and decides it needs to read calc.py first.
  2. It calls run_shell with cat calc.py, sees the existing structure.
  3. It writes a refactored version using a heredoc redirection.
  4. It writes test_calc.py with three test cases (add, mul, div-by-zero).
  5. It calls pytest test_calc.py -q and reads the output.
  6. If a test fails, it reads the failure, edits the file, and re-runs — this is the part where K2.6's preserve-thinking behaviour matters most.
  7. It returns a summary of changes.

This is a deliberately small case. In Moonshot's published case studies, the same loop scales up to 13-hour autonomous runs that modify 4,000+ lines across 1,000+ tool calls, including a complete overhaul of an 8-year-old open-weight matching engine that achieved a 185% throughput gain.

Example 2: Coding-Driven Design (Build a Website)

Kimi K2.6's "coding-driven design" capability turns a single prompt into a self-contained website. Using the same agent harness:

print(agent_loop(
    "Build a single-page cat-themed website at index.html. Include a hero section, "
    "a description of three cat breeds in cards, a scroll-triggered fade-in animation, "
    "and inline CSS so the file works standalone. Then verify the file exists."
))

The model writes a complete index.html in /ephemeral/agent_workspace with structured layout, embedded CSS, IntersectionObserver-based scroll animations, and a polished hero section — the kind of output that Moonshot benchmarks against Google AI Studio on their internal Kimi Design Bench. Open the file in a browser and you'll see a working, styled page generated end-to-end without manual edits.

Integration with Third-Party Coding Assistants

Because vLLM exposes the OpenAI-compatible /v1/chat/completions endpoint and we enabled the Kimi K2 tool-call parser, our locally-hosted Kimi K2.6 plugs into almost every modern coding assistant. Below are three of the most useful integrations.

Integrating with Kimi Code CLI

Moonshot's official agent framework for K2.6 is the Kimi Code CLI, available at kimi.com/code. It is the framework Moonshot uses to produce the long-horizon coding case studies referenced earlier, so it gets the most out of K2.6's interleaved thinking and multi-step tool-calling design.

By default Kimi Code talks to Moonshot's hosted API, but it can be redirected to your local vLLM endpoint by setting an OpenAI-compatible base URL in its config. The exact config keys vary slightly between Kimi Code releases, so consult the version notes shipped with your install — but conceptually you point its API base at http://<YOUR_VM_PUBLIC_IP>:8000/v1 and use Kimi-K2.6 as the model name.

Integration with Claude Code

Because vLLM also exposes Anthropic-compatible routes (/v1/messages) when configured correctly, you can point Claude Code directly at our local Kimi K2.6 server. Moonshot specifically advertises an Anthropic-compatible API for K2.6, which makes this integration first-class.

Set the following environment variables before launching Claude Code:

# Point Claude Code to the local vLLM server
export ANTHROPIC_BASE_URL="http://localhost:8000"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_AUTH_TOKEN="dummy"

# Use Kimi K2.6 for every Claude Code model tier
export ANTHROPIC_DEFAULT_OPUS_MODEL="Kimi-K2.6"
export ANTHROPIC_DEFAULT_SONNET_MODEL="Kimi-K2.6"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="Kimi-K2.6"

# Launch Claude Code
claude

Environment Variables Overview

Variable Description
ANTHROPIC_BASE_URL Points to your vLLM server (default: http://localhost:8000).
ANTHROPIC_API_KEY Required by the client; can be any dummy value for local vLLM.
ANTHROPIC_AUTH_TOKEN Required placeholder for the Claude authentication layer.
ANTHROPIC_DEFAULT_OPUS_MODEL The model identifier (matches --served-model-name from docker run).
💡

Efficiency Tip: Add these environment variables to your shell profile (e.g., .bashrc or .zshrc) or define them inside ~/.claude/settings.json for a persistent setup.

⚠️

Performance Warning:  Claude Code injects a per-request hash in the system prompt, which can break prefix caching. While fixed in recent vLLM releases, users on older versions should add "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" to the "env" section of their settings. 

Once Claude Code launches, verify the connection with a simple prompt. If the model responds correctly, your local agentic coding environment is fully operational and you can leverage Kimi K2.6's long-horizon reasoning capabilities directly for complex development tasks.

Deployment with OpenClaw

Kimi K2.6's reasoning-heavy architecture makes it a strong engine for OpenClaw, the self-hosted open-weight agent featured in Moonshot's own enterprise testimonials (Ollama, Fireworks, and others have noted that K2.6 powers all of the "claws" reliably). Pointing OpenClaw at your local vLLM server gives you a fully autonomous coding environment in your terminal without any external API latency.

# Install OpenClaw (Requires Node.js 22+)
curl -fsSL https://molt.bot/install.sh | bash

# Set a dummy API key for the local vLLM endpoint
export OPENCLAW_API_KEY="EMPTY"

# Launch the OpenClaw Dashboard
openclaw dashboard

Configuring the Local Provider

To bridge OpenClaw to your vLLM container, modify ~/.openclaw/openclaw.json. Merge the following provider block into your existing settings:

{
  "models": {
    "providers": {
      "local-vllm": {
        "baseUrl": "http://localhost:8000/v1",
        "apiKey": "EMPTY",
        "api": "openai-completions",
        "models": [
          {
            "id": "Kimi-K2.6",
            "name": "Kimi-K2.6-Local",
            "reasoning": true,
            "contextWindow": 262144
          }
        ]
      }
    }
  }
}
💡

Configuration Tip: The id field must match your --served-model-name from the docker run command (in our case, Kimi-K2.6). The contextWindow here is the model's native maximum (262,144 tokens); the actual usable window is whatever you set --max-model-len to when launching vLLM.

Once configured, launch the OpenClaw TUI with openclaw tui. From here you can issue high-level coding instructions and the K2.6-backed agent will autonomously plan, edit, and verify changes against your local 8x H100 infrastructure.

Why Deploy Kimi K2.6 on Hyperstack?

Hyperstack is a cloud platform engineered specifically to accelerate AI and machine learning workloads. Here is why it's the right choice for Kimi K2.6:

🚀
Multi-GPU Availability for Trillion-Parameter Models
Kimi K2.6 needs 8x H100-80G at minimum. Hyperstack provides on-demand access to NVIDIA H100 nodes purpose-built for trillion-parameter MoE workloads.
💾
High-Speed Ephemeral Storage
The 6 TB ephemeral NVMe disk dowloaded Kimi K2.6's 595 GB weights in roughly 15 minutes — fast enough that you can iterate on multiple model versions in a single afternoon.
Pre-Configured CUDA + Docker Images
The Ubuntu 22.04 + CUDA 12.2 + Docker image removes the entire driver-and-runtime setup phase, so the deployment goes straight from ssh to docker run.
📈
Seamless Scalability
Move from an 8x H100-80G smoke-test node to an 8x H200-141G production node with the same Docker image and the same launch flags.
💰
Hibernation for Cost Control
Pay only when actively serving. Hibernate the VM between batches and your GPU billing stops while the entire setup is preserved.
🔗
Native Compatibility with the AI Stack
vLLM, SGLang, KTransformers, Hugging Face — all of the inference engines Moonshot recommends for K2.6 run on Hyperstack with no extra configuration.

FAQs

What is Kimi K2.6?

Kimi K2.6 is Moonshot AI's open-weight, native multimodal agentic model. It uses a Mixture-of-Experts architecture with 1T total parameters and 32B activated, achieving state-of-the-art open-weight performance on long-horizon coding (58.6 on SWE-Bench Pro, 80.2 on SWE-Bench Verified) and agentic benchmarks (54.0 on HLE-Full with tools, 92.5 on DeepSearchQA F1).

What is Kimi K2.6's context window?

Kimi K2.6 supports a native 256K-token context window. The amount you can actually use at inference time depends on the --max-model-len flag and your available GPU memory; on 8x H100-80G with no other tuning you'll typically run shorter contexts (a few thousand tokens) for headroom.

Does Kimi K2.6 support thinking mode?

Yes. Thinking mode is enabled by default and produces explicit reasoning traces in the message.reasoning field. You can disable it for short responses by passing chat_template_kwargs.thinking = False. K2.6 also supports preserve_thinking mode for retaining reasoning across multi-turn coding agent interactions.

What hardware is needed for Kimi K2.6?

The native INT4 weights are roughly 595 GB on disk, so you need at least 640 GB of total GPU memory. The minimum practical configuration is 8x NVIDIA H100-80G; the official reference setup is 8x H200-141G-SXM5 (1128 GB total), which gives the best performance at full context length and concurrent serving.

What are the main use cases of Kimi K2.6?

Kimi K2.6 is best for long-horizon software development, coding-driven design (turning prompts into production-ready front-ends and lightweight full-stack apps), agent swarms with up to 300 sub-agents and 4,000 coordinated steps, and persistent 24/7 background agents for monitoring, incident response, and cross-platform automation.

Which inference engines support Kimi K2.6?

Moonshot officially recommends vLLM (≥ 0.19.1), SGLang (≥ 0.5.10), and KTransformers for CPU+GPU heterogeneous inference. The architecture is identical to K2.5, so any deployment recipe that works for K2.5 works directly for K2.6 with the same flags.

Fareed Khan

Fareed Khan

calendar 27 Apr 2026

Read More