<img alt="" src="https://secure.insightful-enterprise-intelligence.com/783141.png" style="display:none;">

Meet Hyperstack at RAISE 2026, 8th-9th July · Booth #14A · Scale your AI infrastructure with us.

Catch Hyperstack at ISC 2026, 22nd-26th June · Booth #A39 · Let's talk GPU-accelerated workloads

Reserve early access to NVIDIA B300s — arriving Q3/Q4

alert

We’ve been made aware of a fraudulent website impersonating Hyperstack at hyperstack.my.
This domain is not affiliated with Hyperstack or NexGen Cloud.

If you’ve been approached or interacted with this site, please contact our team immediately at support@hyperstack.cloud.

close

publish-dateDecember 2, 2025

5 min read

Updated-dateUpdated on 17 Mar 2026

Run DeepSeek OCR on Hyperstack with your Own UI

Written by

Hitesh Kumar

Hitesh Kumar

Share this post

Table of contents

summary

NVIDIA H100 GPUs On-Demand

Sign up/Login
summary

Key Takeaways

  • DeepSeek-OCR is a multimodal OCR model designed to extract both text and document structure from images and PDFs.

  • The setup uses a Hyperstack GPU virtual machine to run DeepSeek-OCR in a private, high-performance environment.

  • The model combines a vision encoder and a language decoder to handle complex layouts such as tables and multi-column documents.

  • Deployment involves cloning the DeepSeek-OCR repository, installing Python dependencies, and configuring the runtime environment.

  • A Gradio-based web interface allows users to upload documents and view OCR results in structured Markdown output.

  • The deployed OCR service can be extended into APIs or integrated into document processing and RAG workflows.

Take Control of Your Own OCR Workflow with DeepSeek-OCR and Hyperstack

Optical Character Recognition (OCR) is the process of recognising and extracting text from a source like images or PDFs using just the visual field - it's what we do when we read!

Methods for performing OCR have exited for a while but in the past few years (or even months rather), transformer-based models have become incredibly competent at it. DeepSeek, one of the world's leading AI foundation model labs, have released their DeepSeek-OCR 3B parameter model for quickly and easily creating your own OCR workflows.

deepseek

Why is it harder to run than other DeepSeek models?

You might be used to running other AI models, like DeepSeek's LLMs, which are often available via a simple API call or a straightforward Python library like transformers. We've even made tutorials in the past that you can follow to get DeepSeek V3. DeepSeek-OCR is a bit more hands-on because it's not just a language model; it's a specialised multi-modal system.

It essentially has two parts: a sophisticated vision encoder that sees and understands the layout of a page (just like our eyes), and a 3-billion-parameter language decoder that reads and interprets the text from that visual information. This two-stage process is what makes it so powerful, but it also requires a more complex stack of software to run efficiently.

The setup in this guide uses vLLM, a high-throughput serving engine, to get the best possible performance. This is what adds most of the setup steps - we need to install a particular version of it along with dependencies like flash-attn. It's this requirement for a high-performance, GPU-accelerated serving environment that makes it more complex than a simple pip install package, but the payoff in speed and accuracy is well worth it.

How good is DeepSeek-OCR? 

In short: it's exceptionally good. It represents the current state-of-the-art for open-source OCR in its size group, especially when it comes to understanding real-world, complex documents.

Where traditional OCR tools might just extract a "wall of text" that loses all formatting, DeepSeek-OCR understands the structure of the document. This is its key advantage. It excels at:

  • Complex Layouts: Accurately reading multi-column articles, magazine pages, and scientific papers.

  • Tables: It doesn't just see text in a table; it understands the table's rows and columns and formats the output (as markdown) to match.

  • Mixed Content: It's highly adept at handling pages with a mix of text, code blocks, and even mathematical equations.

Because it outputs structured markdown, you're not just getting the raw text; you're getting the document's semantic structure. This makes its output immediately useful for feeding into other systems, like a RAG pipeline or a summarisation model. For its 3B-parameter size, it hits a perfect sweet spot of being incredibly accurate while still being fast enough to interpret huge documents on a single H100 GPU.

How to set up DeepSeek-OCR on your own Hyperstack VM, step-by-step

We'll take you through the whole process from start to end to get a really simple and basic OCR workflow running on your own Hyperstack VM. 

Step 0: Getting a Hyperstack VM

This guide assumes you've just spun up a new Linux VM on our platform and can access it via SSH. If you haven't done this before, please see our getting started guide in our documentation.

Step 1: Clone the DeepSeek-OCR repo 

# Clone the DeepSeek-OCR repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git

Step 2: Install UV (the package manager):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Step 3: Create a python virtual environment:

uv venv deepseek-ocr --python 3.12.9
source deepseek-ocr/bin/activate

Step 4: Install vLLM and other requirements

cd DeepSeek-OCR

# Get vllm whl
wget https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
unzip vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl -d vllm-0.8.5+cu118-whl

# Install requirements
uv pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
uv pip install -r requirements.txt
uv pip install flash-attn==2.7.3 --no-build-isolation
uv pip install uvicorn fastapi gradio --upgrade
uv pip install transformers==4.57.1 --upgrade

This step may take a while, there are a lot of dependencies!

Step 5: Download the Python code

main.py 

This is a standalone python file that sets up the webserver and hosts it on your VM. We recommend you have a quick read through before you attempt to run it, just to familiarise yourself with what it does (more on this later).

Step 6: Get the code into your VM:

# Create the "web" dir and put main.py in there
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
mkdir -p web

cat <<EOF > web/main.py
<paste the contents of main.py here>
EOF

You can alternatively use some editor like nano or vim, or SSH into this VM from a more interactive source like VSCode or similar to make this part easier. 

Step 7: Start the server and access via your browser

# Start the server
uvicorn web.main:app --host 0.0.0.0 --port 3000

You should now be able to navigate to the UI by going to http://<your-VMs-ip>:3000, and interact with the UI! 

NOTE: Remember to open port 3000 for inbound TCP traffic via your VM's firewall on Hyperstack! For more info on this, see our documentation here 

Once loaded, It should look something like this:

start the server

In this simple, barebones UI, you can upload PDFs or images and DeepSeek-OCR will automatically run on them.

The results will be visible in the lower box, with the option to see (and download) the labelled input and the extracted text in markdown format. 

To re-run, simple delete the existing input and upload something new!

Here's an example of an example PDF article output from DeepSeek-OCR:

deepseek

Troubleshooting

As stated, this is a very minimal, quickly-put-together UI, and hence is not maintained and updated by Hyperstack, and is certainly not bug-free! However, feel free to modify the code the main.py file to solve any issues or add any features you like.

One bug we are aware of in our early testing is the UI's inability to replace old inputs when new ones are uploaded. In this case, simply Ctrl+C to terminate the server and re-run the same uvicorn command - this and a reload of the web page will then start a fresh instance of the UI with the issue no longer being present. 

What's Next?

Congratulations! You've now got your own private, high-performance OCR server running. This Gradio UI is a fantastic sandbox for testing, but the real power comes from what you can build on top of it.

The most logical next step is to adapt the web/main.py file. Instead of launching a Gradio UI, you could modify it to create a simple, robust REST API endpoint using FastAPI. Imagine an endpoint where you can POST an image or PDF file and get a clean JSON response containing the extracted markdown.

Once you have that API, the possibilities are endless:

  • Build a RAG Pipeline: This is the big one. You can now programmatically feed your entire library of PDFs and documents through this API, storing the clean markdown output in a vector database.

  • Create a "Chat with your Docs" App: Combine your new OCR API with a conversational LLM (like DeepSeek-LLM) to build a powerful application that lets you ask questions about your documents.

  • Automate Data Entry: Create a workflow that watches a specific folder or email inbox, runs any new attachments through your OCR API, and then parses the structured output to populate a database or spreadsheet.

You've done the hard part by setting up the core engine. Now you can use your Hyperstack VM as a stable, private microservice to power all kinds of intelligent document-processing workflows.

Launch Your VM today and Get Started with Hyperstack!

FAQs

What type of model is DeepSeek-OCR?

DeepSeek-OCR is a multimodal model combining vision and language understanding, designed to extract text and structure from documents efficiently.

What format does DeepSeek-OCR output?

It outputs structured markdown that preserves tables, layout, and semantic information, making it ready for downstream processing or RAG pipelines.

Which engine is used for high-throughput serving?

vLLM is used as a high-throughput serving engine, optimised for GPU acceleration to deliver fast, efficient OCR performance.

Which package manager is required for setup?

The setup requires UV, a modern package manager, to create virtual environments and install all dependencies reliably on Hyperstack.

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Related content

Stay updated with our latest articles.

tutorials Tutorials link

Deploy DiffusionGemma on a Cloud GPU for Fast, High-Throughput ...

What is DiffusionGemma? DiffusionGemma is an open-weights, ...

What is DiffusionGemma?

DiffusionGemma is an open-weights, diffusion-based language model built by Google DeepMind on the 26B-A4B Mixture-of-Experts Gemma 4 architecture. Instead of generating text one token at a time, DiffusionGemma generates a whole block of tokens in parallel using discrete diffusion. It carries 25.2B total parameters while activating only 3.8B parameters during inference. It accepts interleaved text, image, and video input to produce text output, and it ships under the Apache 2.0 license. The headline result is speed. By denoising a 256-token canvas in parallel, DiffusionGemma reaches over 1,000 tokens per second on a single NVIDIA H100, which is roughly 4x the throughput of a comparable autoregressive model.

In this tutorial, we will deploy DiffusionGemma 26B-A4B-it on a single Hyperstack NVIDIA H100 GPU using vLLM, expose an OpenAI-compatible API, and then measure its throughput so you can see the diffusion speed advantage on real hardware. Every step includes the exact command and the output you should expect. By the end, you will have a high-throughput text generation endpoint running on your own VM.

How DiffusionGemma Works: Discrete Diffusion

A standard causal language model is autoregressive. It produces text one token at a time, and every new token has to wait for the previous one to finish. That sequential dependency is the reason most large language models are bottlenecked by memory bandwidth rather than raw compute, because each step reads the entire model and KV cache just to emit a single token.

DiffusionGemma takes a different route. It uses an encoder-decoder architecture with block-autoregressive multi-canvas sampling. The encoder runs in a prefill capacity, processing the prompt and building the KV cache. The decoder then applies bidirectional attention over a block of tokens called a canvas (256 tokens wide), accessing the cached prompt context through cross-attention. Rather than emitting one token, the model starts the canvas as noise and iteratively denoises it in parallel, committing roughly 15 to 20 tokens per forward pass. Once a canvas is fully denoised, it is appended to the KV cache and the model moves on to the next canvas. This block-by-block denoising is what unlocks high decoding speed.

Several architectural choices come together to make this work efficiently:

  1. Discrete Text Diffusion: Generation shifts from token-by-token autoregression to block-autoregressive multi-canvas sampling. The model denoises blocks of tokens in parallel, which significantly increases decoding speed.
  2. Encoder-Decoder Design: An autoregressive encoder processes and caches the prompt context, paired with a decoder that applies bidirectional attention over the generation canvas.
  3. Sparse Mixture-of-Experts: DiffusionGemma activates 8 experts out of 128 total, plus 1 shared expert, so it delivers strong reasoning while keeping a low memory footprint suitable for a single accelerator.
  4. Multimodal Input Processing: It handles interleaved text, image (at variable aspect ratio and resolution), and video inputs, returning text output.
  5. Thinking Mode: A built-in reasoning mode lets the model think step by step before answering, with configurable thinking control tokens.
  6. Optimised for Small-Batch Inference: The model is engineered for low-latency, high-speed generation on a single capable GPU, which is exactly the deployment we build in this guide.

The Throughput Advantage

This is what DiffusionGemma is famous for. Because the model denoises 256 tokens in parallel rather than walking through them one by one, it moves the bottleneck from memory bandwidth to compute. A modern GPU like the NVIDIA H100 has an enormous amount of compute that sits idle during sequential decoding, and diffusion sampling puts that compute to work across the whole canvas at once.

The numbers are striking. In low batch size settings on a single NVIDIA H100 at FP8, DiffusionGemma exceeds 1,100 tokens per second. Google reports up to 4x faster token generation than autoregressive decoding, with 700+ tokens per second even on a consumer NVIDIA RTX 5090 at NVFP4. The model also performs adaptive inference-time computation, so simpler prompts and structured tasks like code require fewer denoising steps and run even faster.

Single NVIDIA H100 Generation Speed

Sustained text generation throughput on one NVIDIA H100 in a low batch size setting, comparing DiffusionGemma's parallel canvas denoising against a comparable autoregressive model decoding one token at a time.

DiffusionGemma DISCRETE DIFFUSION 1,100+ tok/s Autoregressive TOKEN BY TOKEN ~275 tok/s 0 600 1,200 tok/s

Throughput figures as reported by Google DeepMind and NVIDIA for DiffusionGemma 26B-A4B on a single NVIDIA H100 in low batch size settings. The autoregressive bar is a representative single-GPU, single-stream baseline for a similarly sized model.

📘

Want to push tokens per second even further on the same GPU? Read our guides on 7 LLM Inference Techniques to Reduce Latency and Boost Performance and Tips to Run Cost-Efficient Inference Workloads.

DiffusionGemma Features

DiffusionGemma is more than a fast decoder. It carries the full capability set of the Gemma 4 family into a diffusion model:

  • High-Speed Parallel Generation: Parallel denoising of a 256-token canvas keeps latency low, unlocking per-user generation speeds above 1,100 tokens per second on a single NVIDIA H100 at FP8.
  • Adaptive Inference-Time Compute: Simpler prompts and structured tasks need fewer denoising steps, so the effective tokens-per-second rate rises with easier work.
  • 256K Context Window: Context windows of up to 256K tokens support long documents, large codebases, and extended multi-turn conversations.
  • Multimodal Understanding: Object detection, document and PDF parsing, screen and UI understanding, chart comprehension, multilingual OCR, handwriting recognition, and video understanding through frame sequences.
  • Thinking Mode and Function Calling: A built-in step-by-step reasoning mode and native structured tool use for agentic workflows.
  • Multilingual Coverage: Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
  • OpenAI-Compatible Serving: Deploys cleanly through vLLM, SGLang, and NVIDIA NIM, exposing a standard /v1/chat/completions endpoint that drops into existing applications.

It is worth being precise about the trade-off. DiffusionGemma is built for speed, and on pure accuracy benchmarks it sits a little behind the autoregressive Gemma 4 26B A4B model it is derived from. The point is that it stays competitive on quality while delivering several times the throughput, which is the right balance for high-volume generation, chat, and code workloads.

DiffusionGemma 26B A4B Gemma 4 26B A4B (autoregressive)
 

Higher is better. Source: DiffusionGemma 26B-A4B-it model card (Google DeepMind). DiffusionGemma stays close on quality while running several times faster.

How to Deploy DiffusionGemma on Hyperstack

Now, let us walk through the deployment step by step. The whole point of DiffusionGemma is that it fits on a single GPU, so the setup is refreshingly simple.

📘

If you want to scale this across multiple GPUs later, see our guide on How to Run Distributed Inference with vLLM on NVIDIA H100 GPUs. For this tutorial, a single NVIDIA H100 is all we need.

Step 1: Accessing Hyperstack

First, you will need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM.

  • Initiate Deployment: Click the "Deploy New Virtual Machine" button on the dashboard.

deploy new vm

  • Select Hardware Configuration: Choose the "1x NVIDIA H100-80G-PCIe" flavour. DiffusionGemma is around 48 GB at BF16, roughly 26 GB once quantised to FP8, and about 13 to 18 GB at NVFP4, so a single 80 GB NVIDIA H100 leaves comfortable headroom for weights, the KV cache, and the diffusion canvas.

  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all NVIDIA drivers and Docker pre-installed.

select os image

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine for remote management and API access.
  • Review and Deploy: Double-check your settings and click "Deploy".

Step 3: Accessing Your VM

Once your VM is running, connect to it via SSH.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and run the following command, replacing the placeholders with your details.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Once connected, you will see a welcome message confirming you are logged in. Verify that the GPU is detected and the ephemeral disk is mounted:

# Confirm the NVIDIA H100 is detected
nvidia-smi --query-gpu=name,memory.total --format=csv

# Confirm the ephemeral disk is mounted
df -h /ephemeral

Here is the output we get, showing one NVIDIA H100 with 80 GB and a large ephemeral disk for the model weights:

name, memory.total [MiB]
NVIDIA H100 PCIe, 81559 MiB

Filesystem Size Used Avail Use% Mounted on
/dev/vdb 738G 28K 700G 1% /ephemeral

Step 4: Create a Model Cache Directory

We will cache the DiffusionGemma weights on the high-speed ephemeral disk so that container restarts do not re-download the model.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant read/write permissions so the container can store weights
sudo chmod -R 0777 /ephemeral/hug

Step 5: Launch the vLLM Server

DiffusionGemma needs the diffusion-capable vLLM build, so we run it with the official vLLM OpenAI image tagged gemma and pass the diffusion sampler settings through --hf-overrides. Because the model fits on one GPU, we use --tensor-parallel-size 1 and run the model in BF16, the configuration Google's developer guide recommends.

# Pull the diffusion-capable vLLM OpenAI image
docker pull vllm/vllm-openai:gemma

# Run DiffusionGemma on a single NVIDIA H100 with the diffusion sampler enabled
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_diffusiongemma \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:gemma \
--model google/diffusiongemma-26B-A4B-it \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.85 \
--attention-backend TRITON_ATTN \
--generation-config vllm \
--hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}' \
--diffusion-config '{"canvas_length":256}' \
--enable-chunked-prefill \
--served-model-name DiffusionGemma \
--host 0.0.0.0 \
--port 8000

Here is what the key flags do:

  • --model google/diffusiongemma-26B-A4B-it: Load the DiffusionGemma weights from Hugging Face.
  • --tensor-parallel-size 1: Run on a single GPU, which is all DiffusionGemma needs.
  • --max-model-len 32768: A practical context length for this demo. The full 256K window needs more GPU memory than a single 80 GB card provides at BF16.
  • --max-num-seqs 4: Keeps the server in the low batch size regime that diffusion sampling is optimised for.
  • --gpu-memory-utilization 0.85: Reserves most of the GPU for the weights and KV cache, matching the value in Google's guide.
  • --attention-backend TRITON_ATTN: The attention backend recommended for the diffusion decoder's bidirectional canvas attention.
  • --generation-config vllm: Use vLLM's sampling defaults rather than the model's bundled generation config.
  • --hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}': Enables the Entropy-Bounded (EB) sampler with the recommended entropy bound of 0.1.
  • --diffusion-config '{"canvas_length":256}': Sets the 256-token diffusion canvas.
  • --served-model-name DiffusionGemma: A clean alias used in API requests.
🐳

Alternative: NVIDIA NIM

If you prefer a fully packaged microservice, DiffusionGemma is also available as an NVIDIA NIM container. It exposes the same OpenAI-compatible API on port 8000 and only needs your NGC API key.

docker run --gpus=all \
-e NGC_API_KEY=$NGC_API_KEY \
-p 8000:8000 \
nvcr.io/nim/google/diffusiongemma-26b-a4b-it:latest

Step 6: Verify the Deployment

Check the container logs to watch the model load. The first run downloads the weights from Hugging Face, which takes a few minutes.

docker logs -f vllm_diffusiongemma

The server is ready once you see: INFO: Application startup complete.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000.

firewall rules

Now test the API from your local machine, replacing the placeholder with your VM's public IP.

# Test the endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "DiffusionGemma",
"messages": [
{"role": "user", "content": "In one sentence, what is discrete diffusion?"}
],
"max_tokens": 128
}'

A successful response returns a JSON object containing the model's reply:

{
"id": "chatcmpl-4f1a9b7c2e3d5a6b",
"object": "chat.completion",
"model": "DiffusionGemma",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Discrete diffusion is a generation method that starts from a block of noise tokens and iteratively denoises them in parallel into clean text, rather than predicting one token at a time."
},
"finish_reason": "stop"
}
]
}

With this output returning cleanly, google/diffusiongemma-26B-A4B-it is successfully deployed on Hyperstack.

Step 7: Hibernating Your VM (Optional)

When you are finished with your workload, hibernate the VM to avoid unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual Machine.
  • Click the "Hibernate" option.
  • This stops billing for compute resources while preserving your setup, so you can resume later.

Using DiffusionGemma via the OpenAI-Compatible API

Now that the vLLM server is running, we can interact with DiffusionGemma using the standard OpenAI Python client. First, install the library locally:

# Install the OpenAI Python client
pip3 install openai

Then instantiate a client pointing at our vLLM endpoint. vLLM does not enforce an API key by default, so we pass a placeholder.

from openai import OpenAI

# Point the client at the local vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1", # OpenAI-style routes
api_key="EMPTY", # Placeholder, vLLM does not check it
)

A Quick High-Speed Generation

We will start with a simple generation request to confirm everything is wired up. The EB sampler and the temperature schedule are already configured on the server, so the client only needs to send a normal chat request.

# A standard chat completion request
messages = [
{"role": "user", "content": "Explain why the sky is blue in three short sentences."}
]

response = client.chat.completions.create(
model="DiffusionGemma",
messages=messages,
max_tokens=256,
temperature=0.6,
)

print(response.choices[0].message.content)

The model returns a clean, well-formed answer almost instantly:

Sunlight is made up of all colours, which travel as waves of different
lengths. As that light passes through the atmosphere, the shorter blue
wavelengths are scattered far more strongly than the longer red ones.
Because this scattered blue light reaches your eyes from every direction,
the daytime sky appears blue.

Measuring the Throughput Advantage

This is the part that matters. To actually see the diffusion speed advantage, we stream a longer completion, count the generated tokens, and divide by the wall-clock time. This gives us a real tokens-per-second figure on our single NVIDIA H100.

import time
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

prompt = "List the integers from 1 to 500, separated by commas."

# Stream the response so we can time generation precisely
start = time.perf_counter()
completion_tokens = 0

stream = client.chat.completions.create(
model="DiffusionGemma",
messages=[{"role": "user", "content": prompt}],
max_tokens=1024,
temperature=0.6,
stream=True,
stream_options={"include_usage": True},
)

for chunk in stream:
if chunk.usage:
completion_tokens = chunk.usage.completion_tokens

elapsed = time.perf_counter() - start
print(f"Generated {completion_tokens} tokens in {elapsed:.2f}s")
print(f"Throughput: {completion_tokens / elapsed:.1f} tokens/sec")

Here is the result on a single NVIDIA H100:

Generated 1024 tokens in 0.71s
Throughput: 1442.3 tokens/sec

Around 1,450 tokens per second from a single GPU, comfortably past the 1,000 mark. The reason is the diffusion decoder. Instead of one token per forward pass, DiffusionGemma commits 15 to 20 tokens per pass by denoising the 256-token canvas in parallel, and structured prompts like this one resolve in fewer denoising steps, so they run at the fast end of the range. A similarly sized autoregressive model on the same NVIDIA H100 would land around 250 to 300 tokens per second in the same setting, which is why diffusion-based generation is described as several times faster.

💡

Recommended diffusion sampler settings: For the best balance of speed and quality, the DiffusionGemma authors recommend the Entropy-Bounded (EB) sampler with a maximum of 48 denoising steps, a temperature schedule that decays linearly from 0.8 to 0.4, an entropy bound of 0.1, and adaptive stopping (sampling terminates early once predictions are confident and stable). Thinking mode is available by adding the <|think|> token to the system prompt, and image inputs are supported with visual token budgets from 70 up to 1120. See the model card best practices for details.

📘

The same vLLM workflow powers our other model deployment guides. If you are weighing your options, see How to Deploy Qwen3.5, How to Run DeepSeek-R1, and Deploy DeepSeek-V4, or get up and running faster with the Hyperstack LLM Inference Toolkit.

Why Deploy DiffusionGemma on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads, and a single-GPU, high-throughput model like DiffusionGemma is exactly the kind of deployment it is built for:

01
On-Demand NVIDIA H100 GPUsSpin up a single NVIDIA H100-80G-PCIe in minutes with no waitlist, the exact accelerator behind the 1,100+ tokens per second figure, with transparent pricing billed by the minute.
02
Fast NVMe Ephemeral StorageLarge ephemeral disks let you cache model weights off the root disk, so container restarts are instant and downloads happen only once.
03
Pre-Configured CUDA and DockerThe Ubuntu CUDA images ship with NVIDIA drivers and Docker ready to go, so you spend your time on the model and not on the platform.
04
Cost-Effective and ScalablePay only for what you use and hibernate when idle. When you outgrow one GPU, scale out with our guide to distributed vLLM inference.

Deploy DiffusionGemma on Hyperstack Today

Spin up a single NVIDIA H100 in minutes and serve high-throughput text generation with vLLM. You pay by the minute and hibernate the moment you are done.

Get Started on Hyperstack →

FAQs

What is discrete diffusion and why is it fast?

Instead of generating one token at a time, DiffusionGemma denoises a 256-token canvas in parallel. This shifts the bottleneck from memory bandwidth to compute, which lets a GPU like the NVIDIA H100 use its parallel hardware far more fully and reach much higher tokens-per-second rates.

What hardware do I need to deploy DiffusionGemma?

A single NVIDIA H100-80G is the recommended target and is what we use in this guide. The model is around 48 GB at BF16, roughly 26 GB at FP8, or about 13 to 18 GB at NVFP4, so it also fits comfortably on smaller single GPUs such as an NVIDIA L40S or NVIDIA A100-80G if you do not need peak throughput.

How fast is DiffusionGemma?

In low batch size settings it generates over 1,000 tokens per second on a single NVIDIA H100, exceeding 1,100 tokens per second at FP8. That is roughly 4x the throughput of a comparable autoregressive model, and it reaches 700+ tokens per second even on a consumer NVIDIA RTX 5090 at NVFP4.

Is DiffusionGemma multimodal?

Yes. It processes interleaved text, image, and video input and returns text output. Capabilities include document and PDF parsing, chart comprehension, multilingual OCR, handwriting recognition, and video understanding through frame sequences.

Should I use vLLM or NVIDIA NIM?

Both serve the same OpenAI-compatible API on port 8000. vLLM gives you fine-grained control over sampler settings, quantisation, and batching, which is what we use in this tutorial. NVIDIA NIM is a fully packaged container that only needs your NGC API key, which is convenient if you prefer a turnkey microservice.

Fareed Khan

Fareed Khan

calendar 12 Jun 2026

Read More
tutorials Tutorials link

Deploy AntAngelMed 100B Medical LLM on Hyperstack

What is AntAngelMed? AntAngelMed is the world's first ...

What is AntAngelMed?

AntAngelMed is the world's first open-source 100B-parameter medical large language model, jointly developed by the Health Information Center of Zhejiang Province, Ant Healthcare, and Zhejiang Anzhen'er Medical AI. Built on the Ling-flash-2.0 Mixture-of-Experts architecture, it houses 100B total parameters while activating only 6.1B parameters per token, allowing it to match the performance of dense models several times its active size while delivering inference speeds exceeding 200 tokens per second on H20-class hardware. With a 128K context window, clinical-grade safety alignment via GRPO reinforcement learning, and #1 rankings on HealthBench (open-source category) and the MedBench leaderboard, AntAngelMed sets a new bar for what an openly available medical AI can do.

A major reason AntAngelMed performs so strongly on clinical benchmarks is its rigorous three-stage training pipeline combined with a highly efficient sparse MoE architecture. Here is how the design works:

  1. Sparse MoE Architecture: Built on Ling-flash-2.0's 1/32 activation-ratio Mixture-of-Experts design. Only ~6.1B of the model's 100B parameters activate per token, delivering large-model intelligence at small-model speeds.
  2. Three-Stage Medical Training: Continual pre-training on large-scale medical corpora (encyclopaedias, peer-reviewed research, clinical text), followed by supervised fine-tuning on multi-source medical instructions, and finally GRPO-based reinforcement learning that shapes empathy, safety boundaries, and evidence-based reasoning.
  3. Advanced MoE Optimisations: Expert granularity tuning, shared-expert ratios, no auxiliary loss with sigmoid routing, Multi-Token Prediction (MTP) layers, QK-Norm, and Partial-RoPE all combine to push small-activation MoE up to 7x more efficient than a dense model of comparable size.
  4. Extended Context via YaRN: Native context support extends to 128K tokens through YaRN extrapolation — essential for ingesting long patient histories, multi-document literature reviews, or extended diagnostic conversations.
  5. Clinical Safety Alignment: GRPO reinforcement learning with task-specific reward models explicitly optimises for empathy, structural clarity, safety boundaries, and reduced hallucinations on complex medical cases.
  6. FP8 + EAGLE3 Inference Acceleration: Optional FP8 quantisation paired with EAGLE3 speculative decoding delivers throughput gains of 45–94% over standard FP8 across reasoning workloads, with no measurable loss in stability.

AntAngelMed Features

AntAngelMed is more than a general chat model fine-tuned on medical data — it has been engineered end-to-end for clinical reasoning. Key features include:

Production Throughput Gains with FP8 + EAGLE3

Inference throughput improvement of FP8 + EAGLE3 speculative decoding over standard FP8 alone, measured at a concurrency of 32 on reasoning workloads.

HumanEval CODE REASONING +71% GSM8K MATH WORD PROBLEMS +45% Math-500 ADVANCED MATHS +94% 0% 25% 50% 75% 100%

Throughput improvement of the FP8 + EAGLE3 configuration relative to the FP8-only baseline, as reported by the AntAngelMed authors on the Hugging Face model card under a concurrency of 32.

  • Leading Open-Source Medical Performance: Ranks first among open-source models on OpenAI's HealthBench (with a particularly strong lead on HealthBench-Hard) and #1 overall on the MedBench leaderboard across knowledge, reasoning, language, and safety.
  • 6.1B Active / 100B Total Parameters: Matches the performance of ~40B dense models while running roughly 3x faster, thanks to its sparse-activation MoE design.
  • Clinical-Grade Safety and Ethics: Explicitly trained to favour evidence-based reasoning, structured responses, and safety disclaimers — reducing hallucination on complex diagnostic queries.
  • 128K Context Window: Handles long-form clinical documentation, multi-report synthesis, and extended multi-turn conversations in a single context.
  • OpenAI-Compatible API: Deploys cleanly via vLLM and exposes a standard /v1/chat/completions endpoint, making integration with existing healthcare applications straightforward.
  • Bilingual Medical Fluency: Strong performance across English and Chinese medical text, including medical knowledge Q&A, language understanding, and complex clinical reasoning.

Important Considerations Before Deploying a Medical LLM

Before walking through deployment, it is worth pausing on a point that matters more for medical AI than almost any other workload: where your model runs is as important as how well it performs. Medical data is among the most tightly regulated categories of information globally, governed by frameworks such as the EU GDPR, the UK Data Protection Act, the EU AI Act (which classifies clinical-decision-support and diagnostic AI as high-risk when used in a medical-device capacity), and HIPAA in the United States.

If you intend to use AntAngelMed in any setting that touches real patient data or protected health information, the on-demand deployment shown in this tutorial is a starting point for evaluation, not a production blueprint. For clinical workloads, Hyperstack offers a deployment pattern specifically engineered for regulated industries:

  • Secure Private Cloud: Single-tenant, dedicated GPU infrastructure with isolated networking, commissioned per customer and able to be deployed in specific regions and jurisdictions — including EU/EEA jurisdictions for organisations operating under GDPR. No shared GPU memory, no shared VPCs, no noisy neighbours, and clear audit evidence (tenancy models, access logs, control mappings, data residency confirmation) that compliance and legal teams can verify. This is the recommended path for any clinical workload on a 100B-class model like AntAngelMed.

Important Notice for This Tutorial

The walkthrough below uses a standard on-demand GPU VM for demonstration purposes only. All prompts shown are synthetic, generic medical questions — no real patient data, identifiable health information, or protected records are processed at any point. The model and all generated outputs remain within the dedicated GPU VM and are not exposed externally. For any workload involving real clinical data — particularly where EU data residency or audit-grade isolation is required — please contact our team to provision a Secure Private Cloud environment configured for your compliance posture.

How to Deploy AntAngelMed on Hyperstack

With those considerations in place, let's walk through the deployment process step by step.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM sized appropriately for a 100B-parameter MoE model.

  • Initiate Deployment: Click the "Deploy New Virtual Machine" button on the dashboard.

deploy new vm

  • Select Hardware Configuration: AntAngelMed is approximately 206 GB at BF16, and with KV cache, activations, and headroom for long-context inference, an 8-GPU configuration is recommended. Choose the "8x H100-80G-PCIe" flavour. This gives 640 GB of total GPU memory — comfortable margin for tensor-parallel-size 8 inference at production speeds.

h100 pcie

  • Select a Region: The 8x H100-80G-PCIe flavour is available in the CANADA-1 region on Hyperstack's on-demand cloud. Choose CANADA-1 for this tutorial. For production clinical workloads that require EU data residency, the appropriate path is a Secure Private Cloud deployment, which Hyperstack provisions in specific regions and jurisdictions including the EU/EEA.
  • Choose the Operating System: Select the "Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker" image. This provides a ready-to-use environment with all NVIDIA drivers and Docker pre-installed.

select os image

  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine for remote management.
  • Review and Deploy: Double-check your settings and click "Deploy".

Step 3: Accessing Your VM

Once your VM is running, connect to it via SSH.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and run:

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private SSH key and [your_vm_public_ip] with the IP address of your VM. Once connected, you should see a welcome message confirming you are logged in.

Step 4: Create a Model Cache Directory

AntAngelMed weighs around 206 GB. We'll cache it on the VM's high-speed ephemeral disk so that subsequent container restarts do not re-download the weights.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions to the directory
sudo chmod -R 0777 /ephemeral/hug

This creates a folder named hug inside the /ephemeral disk and sets permissions so the Docker container can read and write model files.

Step 5: Launch the vLLM Server

AntAngelMed uses the custom bailing_moe architecture from Ling-flash-2.0, so we need vLLM v0.11.0 or later and the --trust-remote-code flag. We'll pull the official vLLM image and run the model with tensor parallelism across all eight H100s.

# Pull the vLLM OpenAI-compatible image (v0.11.0 recommended by the model authors)
docker pull vllm/vllm-openai:v0.11.0

# Run the container with the AntAngelMed model
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_antangelmed \
-v /ephemeral/hug:/root/.cache/huggingface \
vllm/vllm-openai:v0.11.0 \
MedAIBase/AntAngelMed \
--tensor-parallel-size 8 \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 32768 \
--gpu-memory-utilization 0.90 \
--served-model-name AntAngelMed \
--host 0.0.0.0 \
--port 8000

Breakdown of the key flags:

  • --gpus all: Use all eight NVIDIA H100 GPUs on the host.
  • --ipc=host: Share the host's IPC namespace for efficient multi-GPU communication.
  • --network host: Expose the container directly on the host network for simpler API access.
  • -v /ephemeral/hug:/root/.cache/huggingface: Mount the cache directory so weights persist across container restarts.
  • MedAIBase/AntAngelMed: Load the model directly from Hugging Face.
  • --tensor-parallel-size 8: Shard the model weights across all eight GPUs.
  • --dtype bfloat16: Use BF16 precision, as recommended by the model authors for Nvidia hardware.
  • --trust-remote-code: Required to load the custom bailing_moe modelling code.
  • --max-model-len 32768: Sets the maximum context length. Can be raised toward 128K with YaRN if your use case requires it.
  • --gpu-memory-utilization 0.90: Allocate up to 90% of GPU memory for weights and KV cache.
  • --served-model-name AntAngelMed: A clean alias used in API requests.

Step 6: Verify the Deployment

Check the container logs to monitor model loading. The first run will download ~206 GB from Hugging Face, which can take several minutes.

docker logs -f vllm_antangelmed

The server is ready once you see: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000.

firewall rules

Now test the API from your local machine, replacing the placeholder IP with your VM's public IP.

# Test the API endpoint from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "AntAngelMed",
"messages": [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "What should I do if I have a headache?"}
],
"max_tokens": 800,
"temperature": 0.6,
"top_p": 0.95,
"extra_body": {
"top_k": 20,
"repetition_penalty": 1.05
}
}'

A successful response returns a JSON object containing the model's structured medical reply:

{
"id": "chatcmpl-7e8a3b2c1d4f5e6a",
"object": "chat.completion",
"created": 1771954823,
"model": "AntAngelMed",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "A headache is a common symptom with many possible causes, ranging from minor (such as tiredness, dehydration, or eye strain) to more serious. Here is some general guidance.\n\n**Self-care measures to try first:**\n- Rest in a quiet, dimly lit room\n- Drink water steadily, as mild dehydration is a frequent trigger\n- Apply a cool compress to the forehead or a warm one to the neck and shoulders\n- Consider an over-the-counter analgesic such as paracetamol or ibuprofen, taken according to the package instructions\n\n**When to seek medical attention:**\nContact a healthcare professional promptly if your headache...",
...
},
"finish_reason": "stop"
}
],
...
}

The response demonstrates exactly the behaviour AntAngelMed was trained for: a structured answer that opens with context, separates self-care from red-flag symptoms, and points the user toward a clinician where appropriate — the kind of safety-aware framing the GRPO stage is designed to reinforce. With this output returning cleanly, MedAIBase/AntAngelMed is successfully deployed on Hyperstack.

Recommended Sampling Parameters

The AntAngelMed authors recommend the following sampling configuration for general medical Q&A:

# Recommended sampling for medical reasoning
temperature=0.6, top_p=0.95, top_k=20,
repetition_penalty=1.05, max_tokens=16384

Step 7: Hibernating Your VM (Optional)

When you are finished with your workload, hibernate the VM to avoid incurring unnecessary costs:

  • In the Hyperstack dashboard, locate your Virtual Machine.
  • Click the "Hibernate" option.
  • This stops billing for compute resources while preserving your setup, so you can resume quickly later.

Using AntAngelMed via the OpenAI-Compatible API

Now that the vLLM server is running, we can interact with AntAngelMed using the standard OpenAI Python client. First, install the library locally:

# Install the OpenAI Python client to interact with the vLLM server
pip3 install openai

Then instantiate an OpenAI-compatible client pointing at our vLLM endpoint. vLLM does not enforce an API key by default, so we pass a placeholder.

from openai import OpenAI

# Create an OpenAI-compatible client that points to the local vLLM server.
client = OpenAI(
base_url="http://localhost:8000/v1", # Local API endpoint exposing OpenAI-style routes
api_key="EMPTY", # Placeholder key; vLLM does not enforce API keys by default
)

All of the prompts in the examples below are synthetic and general — standard medical-education queries, not real patient interactions. This matches the demonstration posture set out at the beginning of this guide.

Example 1: General Medical Q&A

We'll start with the same prompt used in the model authors' Hugging Face example — a generic question about headache management. This is exactly the kind of self-care guidance use case where AntAngelMed's training shines.

# Define the conversation payload with a system role and a generic user question.
messages = [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "What should I do if I have a headache?"}
]

# Send the chat completion request using the recommended sampling parameters.
chat_response = client.chat.completions.create(
model="AntAngelMed", # Served model name from the vLLM container
messages=messages,
max_tokens=800,
temperature=0.6,
top_p=0.95,
extra_body={
"top_k": 20,
"repetition_penalty": 1.05,
},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

This is the kind of response we get back from the model:

AntAngelMed response:
A headache is a common symptom that can stem from many different causes,
most of which are minor and resolve on their own. The right next step
depends on how severe the pain is, how long it has lasted, and what other
symptoms (if any) are present. **General self-care steps to try first:** - Rest in a quiet, dimly lit room and try to relax your shoulders and jaw - Drink water steadily — mild dehydration is one of the most frequent triggers - Apply a cool compress to the forehead, or a warm compress to the neck and shoulders if the pain feels tension-related - Consider an over-the-counter analgesic such as paracetamol or ibuprofen,
taken strictly according to the package instructions - Note possible triggers (skipped meals, screen time, poor sleep, caffeine
withdrawal, stress) so you can address them **When to seek medical attention:** You should contact a healthcare professional promptly if your headache: - Comes on suddenly and severely ("thunderclap" headache) - Is accompanied by fever, neck stiffness, confusion, vision changes,
weakness, slurred speech, or seizures - Follows a head injury - Worsens despite rest and over-the-counter treatment, or recurs frequently - Disrupts your daily activities or sleep **Important note:** This information is for general educational purposes only and does not
replace professional medical advice. If you are unsure or concerned about
your symptoms, please consult a qualified healthcare provider.

Notice how the response is structured cleanly into self-care guidance, escalation criteria, and a safety disclaimer. That structure is not accidental — it is a direct consequence of the GRPO stage of training, where the model was rewarded for empathetic, structurally clear, evidence-based answers that respect safety boundaries. This is the behaviour you want from a medical assistant: useful first-line information, paired with clear signals about when human clinical judgement is required.

Example 2: Medical Education and Concept Explanation

AntAngelMed is equally strong on medical-knowledge questions, which is the kind of workload that powers patient-education portals, internal training tools, and clinical documentation helpers.

# A general medical-education question, no patient data involved.
messages = [
{"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
{"role": "user", "content": "Explain the difference between Type 1 and Type 2 diabetes in simple terms."}
]

chat_response = client.chat.completions.create(
model="AntAngelMed",
messages=messages,
max_tokens=800,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20, "repetition_penalty": 1.05},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

The model returns a clear, well-organised explanation:

AntAngelMed response:
Type 1 and Type 2 diabetes are both long-term conditions in which the body
has difficulty managing blood sugar (glucose), but their underlying causes
and typical treatments are quite different. **Type 1 Diabetes** - An autoimmune condition: the body's immune system mistakenly attacks the
insulin-producing cells in the pancreas - As a result, the pancreas produces little or no insulin - Usually diagnosed in childhood or adolescence, though it can occur at any age - Requires lifelong insulin therapy (injections or a pump) - Not caused by diet or lifestyle **Type 2 Diabetes** - The body either does not produce enough insulin or its cells become
resistant to insulin's effects - Strongly associated with factors such as excess body weight, physical
inactivity, family history, and increasing age - Typically develops in adulthood, although it is increasingly diagnosed
in younger people - Often managed initially with lifestyle changes (diet, exercise, weight
management) and oral medications, with insulin added later if needed **Key distinction:** Type 1 is primarily an autoimmune condition affecting insulin production,
whereas Type 2 is largely a metabolic condition affecting how the body
uses insulin. Both require ongoing medical care to prevent complications. If you or someone you know has been diagnosed with either type, a qualified
healthcare provider can build a personalised management plan.

Again, the structure is consistent: clear categorisation, plain-language framing, balanced coverage, and a closing pointer toward professional care. For applications like patient-facing FAQ tools, internal medical reference, or onboarding content for non-clinical staff, this is precisely the tone and format you want at scale.

Example 3: Disabling Verbose Reasoning for Lower-Latency Responses

For lighter-weight queries where you don't need the model to lay out its reasoning in detail, you can request a more concise output style. This reduces output tokens and improves end-to-end latency.

# A short, factual medical-knowledge question.
messages = [
{"role": "system", "content": "You are AntAngelMed. Answer concisely and factually."},
{"role": "user", "content": "What is the normal resting heart rate range for a healthy adult?"}
]

chat_response = client.chat.completions.create(
model="AntAngelMed",
messages=messages,
max_tokens=200,
temperature=0.6,
top_p=0.95,
extra_body={"top_k": 20, "repetition_penalty": 1.05},
)

print("AntAngelMed response:", chat_response.choices[0].message.content)

And the response:

AntAngelMed response:
The normal resting heart rate for a healthy adult is typically between
60 and 100 beats per minute. Well-trained athletes may have resting rates
as low as 40–60 bpm, which is generally considered healthy for them.
Rates persistently above or below the normal range should be discussed
with a healthcare professional.

Short, accurate, and still framed with the appropriate caveat — useful for chatbot interfaces or low-latency API tiers serving high request volumes.

A Final Word on Data Handling

Everything shown above runs entirely inside your Hyperstack VM. The model weights are pulled to local storage, inference happens on-instance, and API responses are served back over your VM's network — no inference traffic is routed through Hyperstack-managed inference services, and no data leaves your instance unless your own application sends it elsewhere. For workloads that require stronger guarantees — EU/EEA data residency, dedicated single-tenant infrastructure, audit-ready isolation evidence — please talk to our team about a Secure Private Cloud deployment configured for your compliance posture.

Why Deploy AntAngelMed on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads — and it is particularly well suited to medical AI:

EU Data Residency via Secure Private Cloud
Dedicated, single-tenant GPU infrastructure commissioned in EU/EEA jurisdictions for organisations operating under GDPR and the EU AI Act. The right path for any clinical workload involving real patient data, with audit-ready tenancy and data-residency evidence built in.
Latest-Generation GPUs On-Demand
On-demand access to NVIDIA H100 and other top-tier accelerators — the right hardware for serving 100B-class medical models at production throughput, with billing accurate to the minute.
Tier 3 Data Centres, SOC 2 Type II
All Hyperstack regions run in Tier 3 certified data centres with concurrent maintainability and 99.982% annual uptime, independently audited and SOC 2 Type II certified for security, availability, and data integrity.
Simple Deployment
Pre-configured CUDA and Docker images remove most of the setup overhead, so you spend time on the model and not on the platform.
Transparent, Competitive Pricing
Pay only for what you use with our GPU pricing, plus hibernation to pause billing on idle workloads.

FAQs

What is AntAngelMed?

AntAngelMed is the world's first open-source 100B-parameter medical language model, built on the Ling-flash-2.0 MoE architecture with ~6.1B active parameters per token. It ranks first among open-source models on OpenAI's HealthBench and tops the MedBench leaderboard for Chinese medical AI.

What hardware is required to deploy AntAngelMed?

The BF16 model is approximately 206 GB. We recommend 8x NVIDIA H100-80G-PCIe on Hyperstack (available in the CANADA-1 region), which provides comfortable headroom for tensor-parallel-size 8 inference and long-context KV cache. An INT4 quantised version can run on smaller setups if memory is constrained.

What is AntAngelMed's context window?

AntAngelMed natively supports up to 128K tokens via YaRN extrapolation, making it suitable for long-form clinical documentation, multi-document literature review, and extended diagnostic conversations.

Is it safe to deploy a medical LLM on a public cloud?

For real clinical workloads, deployment design matters as much as the model itself. We recommend running AntAngelMed in a Secure Private Cloud deployment, which provides single-tenant dedicated infrastructure and can be commissioned in EU/EEA jurisdictions for GDPR-aligned data residency. The model and its outputs stay within your dedicated environment and are not exposed externally. For evaluation and synthetic-prompt experimentation, the on-demand tutorial above is a suitable starting point.

What are the main use cases for AntAngelMed?

Strong fits include medical knowledge Q&A, patient-education content generation, clinical documentation assistance, internal training tools for non-clinical staff, and research support — with the important caveat that any clinical-decision-support use must be designed in line with applicable regulations such as the EU AI Act.

How fast is inference on H100 hardware?

The model authors report over 200 tokens per second on H20-class hardware, with the sparse MoE design delivering roughly 3x the throughput of a comparable 36B dense model. H100 deployments with tensor parallelism deliver strong real-world throughput for production medical applications.

Fareed Khan

Fareed Khan

calendar 14 May 2026

Read More
tutorials Tutorials link

NVIDIA Nemotron 3 Nano Omni: Process Video, Audio, and Documents ...

What is NVIDIA Nemotron 3 Nano Omni? ...

What is NVIDIA Nemotron 3 Nano Omni?

Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 is an open-weight multimodal foundation model engineered to unify video, audio, image, and text understanding in a single, highly efficient reasoning loop. It replaces the traditional fragmented stack of separate vision, speech, and language models with one production-ready model that can transcribe an hour of audio, summarise a two-minute video, or extract structured data from a complex 100-page document — all from the same endpoint. Built on a 30B-A3B hybrid Mixture-of-Experts architecture and supporting context windows up to 256K tokens, Nemotron 3 Nano Omni delivers state-of-the-art accuracy on document intelligence (OCRBenchV2, MMLongBench-Doc), video understanding (Video-MME, WorldSense), and speech benchmarks (VoiceBench), while remaining lightweight enough to run on a single H100 GPU.

A major reason Nemotron 3 Nano Omni achieves such strong multimodal accuracy at this efficiency level is its carefully engineered training pipeline and specialised hybrid architecture. Here is how it works under the hood:

  1. Hybrid Mamba2-Transformer MoE Backbone: Combines Mamba2 layers (efficient long-sequence state-space modelling) with transformer layers (precise reasoning) inside a Mixture-of-Experts decoder, delivering up to 4× memory and compute efficiency over equivalent dense models.
  2. C-RADIOv4-H Vision Encoder: A robust foundation vision encoder that processes high-resolution images and video frames, capable of focusing on specific patches within a full image to maintain OCR-level precision on dense documents.
  3. Parakeet Speech Encoder: A specialised audio encoder built on NVIDIA Granary and Music Flamingo data that goes far beyond simple transcription to capture tone, intent, and acoustic events.
  4. 3D Convolutional Spatiotemporal Processing: Captures motion and temporal context between video frames natively rather than treating frames as independent images, enabling true video understanding.
  5. Efficient Video Sampling (EVS): An inference-time layer that compresses dense visual tokens from many frames into a concise set the LLM can process without overwhelming the context window — halving video prefill VRAM and TTFT.
  6. Native Reasoning Mode: Chain-of-thought reasoning is enabled by default with budget-controlled thinking, making it ideal for complex multi-step document and video analysis tasks.

Nemotron 3 Nano Omni Features

Nemotron 3 Nano Omni is purpose-built to consolidate enterprise multimodal pipelines into a single deployable model. Its standout capabilities include:

  • True Omnimodal I/O: Accepts video (mp4, up to 2 minutes), audio (wav/mp3, up to 1 hour), images (jpeg/png), and text in a single request — no orchestration between separate models required.
  • Sparse MoE Efficiency: Only ~3B of 31B parameters activate per token, giving large-model intelligence at a fraction of the compute cost.
  • 256K Context Window: Reason across long documents, hour-long meeting recordings, or extended video transcripts in a single pass.
  • Best-in-Class Document Intelligence: Leads OCRBenchV2 (67.04) and MMLongBench-Doc (57.5), making it ideal for contracts, financial filings, scientific papers, and scanned forms.
  • Word-Level Timestamped ASR: Native speech-to-text with word-level timing, plus speech instruction following at 89.39 on VoiceBench.
  • Video Understanding at Scale: Up to ~9.2× greater effective system capacity vs. alternative open omni models at the same per-user interactivity threshold, with 72.2 on Video-MME.
  • Flexible Quantisation: Available in BF16 (62 GB), FP8 (33 GB), and NVFP4 (21 GB) — all staying within ~1 point of BF16 accuracy across 9 multimodal benchmarks.
  • Open by Design: Full weights, datasets, and training recipes released under the NVIDIA Open Model Agreement for unrestricted commercial use.

Multimodal Benchmark Performance

Nemotron Nano VL V2 vs. Nemotron 3 Nano Omni — across 11 industry-standard benchmarks

Nemotron Nano VL V2 Nemotron 3 Nano Omni
CVBench2D
78.3
83.95
+7.2%
OCRBenchV2 (EN)
54.8
67.04
+22.3%
OSWorld
11.1
47.4
+327%
CharXiv Reasoning
41.3
63.6
+54.0%
MMLongBench-Doc
38.0
57.5
+51.3%
MathVista MINI
75.5
82.8
+9.7%
OCR Reasoning
33.9
54.14
+59.7%
Video-MME
Not supported in VL V2
72.2
NEW
WorldSense
Not supported in VL V2
55.4
NEW
DailyOmni
Not supported in VL V2
74.52
NEW
VoiceBench
Not supported in VL V2
89.39
NEW
Reading the chart: All scores reported on a 0–100 scale (higher is better). NEW badges mark capabilities introduced with Nemotron 3 Nano Omni — video, joint video+audio, and speech understanding — that the prior VL-only model could not perform.

How to Deploy Nemotron 3 Nano Omni on Hyperstack

Now, let's walk through the step-by-step process of deploying the necessary infrastructure on Hyperstack to serve Nemotron 3 Nano Omni for production multimodal workloads.

Step 1: Accessing Hyperstack

First, you'll need an account on Hyperstack.

  • Go to the Hyperstack website and log in.
  • If you are new, create an account and set up your billing information. Our documentation can guide you through the initial setup.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, we will launch a new GPU-powered VM sized for the BF16 variant of the model.

  • Initiate Deployment: Look for the "Deploy New Virtual Machine" button on the dashboard and click it.

  • Select Hardware Configuration: Nemotron 3 Nano Omni in BF16 is ~62 GB on disk and fits comfortably on a single H100 80GB. Choose the "1xH100-80G-PCIe" flavour. This single-GPU footprint is ideal for the Mamba2-Transformer hybrid MoE — the architecture activates only ~3B params per token, so tensor parallelism is unnecessary at this size.

  • Choose the Operating System: Select an "Ubuntu Server 22.04 LTS R570 CUDA 12.8 with Docker" image (or newer).
  • Select a Keypair: Choose an existing SSH keypair from your account to securely access the VM.
  • Network Configuration: Ensure you assign a Public IP to your Virtual Machine. This is crucial for remote management and connecting your local development tools.
  • Attach an Ephemeral Disk: The BF16 weights are ~62 GB. Attach an ephemeral NVMe volume — we'll use it as the Hugging Face cache so model loading is bottlenecked by NVMe, not network.
  • Review and Deploy: Double-check your settings and click the "Deploy" button.

Step 3: Accessing Your VM

Once your VM is running, you can connect to it.

  1. Locate SSH Details: In the Hyperstack dashboard, find your VM's details and copy its Public IP address.

  2. Connect via SSH: Open a terminal on your local machine and use the following command, replacing the placeholders with your information.

    # Connect to your VM using your private key and the VM's public IP
    ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private SSH key file and [your_vm_public_ip] with the actual IP address of your VM.

Step 4: Create a Model Cache Directory

We'll create a directory on the VM's high-speed ephemeral disk. Storing the 62 GB BF16 checkpoint here ensures fast cold-starts on subsequent restarts.

# Create a directory for the Hugging Face model cache
sudo mkdir -p /ephemeral/hug

# Grant full read/write permissions so the Docker container can use it
sudo chmod -R 0777 /ephemeral/hug

# Also create a media directory for sample inputs (PDFs, audio, video)
sudo mkdir -p /ephemeral/media
sudo chmod -R 0777 /ephemeral/media

Step 5: Authenticate with Hugging Face

Nemotron 3 Nano Omni is governed by the NVIDIA Open Model Agreement, so the first download requires a Hugging Face token. Generate a read token at huggingface.co/settings/tokens, then export it on the VM:

# Export your HF token so the vLLM container can pull the weights
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Step 6: Launch the vLLM Server

We will use the official vllm/vllm-openai:v0.20.0 image, which is the minimum version that ships the nemotron_v3 reasoning parser and 3D-conv video kernels required by this model. Note that audio packages are not preinstalled in the upstream image, so we install vllm[audio] at container startup before invoking vllm serve.

# Pull the required vLLM 0.20.0 image (CUDA 12.9 build for Hyperstack's R570 driver)
docker pull vllm/vllm-openai:v0.20.0-cu129

# Launch the multimodal server with audio + video + reasoning enabled
docker run -d \
--gpus all \
--ipc=host \
--network host \
--name vllm_nemotron_omni \
--shm-size 16g \
-e HF_TOKEN="$HF_TOKEN" \
-v /ephemeral/hug:/root/.cache/huggingface \
-v /ephemeral/media:/media:ro \
--entrypoint /bin/bash \
 vllm/vllm-openai:v0.20.0-cu129 -c "
pip install vllm[audio] && \
vllm serve nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 131072 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.92 \
--trust-remote-code \
--allowed-local-media-path / \
--video-pruning-rate 0.5 \
--media-io-kwargs '{\"video\": {\"fps\": 2, \"num_frames\": 256}}' \
--reasoning-parser nemotron_v3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
"

This command instructs Docker to:

  • --gpus all: Expose the H100 to the container.
  • --ipc=host & --shm-size 16g: Required for vLLM's multiprocessing and large multimodal tensors during prefill.
  • -v /ephemeral/hug:/root/.cache/huggingface: Persist the 62 GB BF16 checkpoint on the NVMe ephemeral disk between restarts.
  • -v /ephemeral/media:/media:ro: Read-only mount for input PDFs, audio, and video files we'll use later.
  • pip install vllm[audio]: Adds librosa and audio decoders the upstream image omits — required before any audio request.
  • --max-model-len 131072: 128K-token context, more than enough for hour-long audio transcripts and 100-page PDFs.
  • --max-num-seqs 16: Conservative concurrency cap for BF16 on a single 80 GB H100 (62 GB weights leave ~18 GB for KV cache + activations). Increase this value for FP8/NVFP4 deployments.
  • --video-pruning-rate 0.5: Enables Efficient Video Sampling — drops 50% of redundant video tokens to halve video-prefill VRAM and TTFT.
  • --media-io-kwargs: Sets video sampling to 2 FPS and 256 frames per clip — the recommended setting for 720p video on 80 GB GPUs.
  • --reasoning-parser nemotron_v3: Routes chain-of-thought traces into the proper response field for the Nemotron-3 chat template.
  • --tool-call-parser qwen3_coder: The Nemotron 3 Nano Omni release re-uses the qwen3_coder parser for structured tool/function calls.
  • --allowed-local-media-path /: Allows the API to load files from local paths (e.g., the /media mount), avoiding base64 round-trips for large inputs.
⚠️

Version Warning: Only vllm/vllm-openai:v0.20.0-cu129 (the CUDA 12.9 build) is compatible with Hyperstack's current Ubuntu images, which ship with the R570 driver. Do not use the default :v0.20.0 tag — it is built against CUDA 13.0 and requires the R580+ driver, which is not yet available on Hyperstack. Do not use :latest either. 

Step 7: Verify the Deployment

First, follow the container logs to monitor model loading. The 62 GB BF16 checkpoint takes 6–10 minutes to download on first run, then 60–90 seconds to load from the NVMe cache on subsequent starts.

docker logs -f vllm_nemotron_omni

The server is ready when you see: INFO: Uvicorn running on http://0.0.0.0:8000.

Next, add a firewall rule in your Hyperstack dashboard to allow inbound TCP traffic on port 8000. Then run a quick text-only smoke test from your local machine:

# Smoke test from your local terminal
curl http://<YOUR_VM_PUBLIC_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
"messages": [
{"role": "user", "content": "In one sentence, what modalities can you process?"}
],
"max_tokens": 256,
"temperature": 0.2,
"extra_body": {
"top_k": 1,
"chat_template_kwargs": {"enable_thinking": false}
}
}'

If the response stream comes back with a coherent answer about video, audio, image, and text, your endpoint is live and ready for real workloads.

💡

NVIDIA recommends the following sampling parameters for Nemotron 3 Nano Omni:

# Thinking mode (long-doc analysis, multimodal reasoning,
video summarisation)
temperature=0.6, top_p=0.95, max_tokens=20480, reasoning_budget=16384, grace_period=1024 # Instruct (non-thinking) mode for general short-form tasks temperature=0.2, top_k=1, max_tokens=1024 # ASR / transcription tasks temperature=1.0, top_k=1

Step 8: Hibernating Your VM (Optional)

When you're done with a workload, you can hibernate the VM to pause compute billing while preserving the entire setup — including the 62 GB cached weights on the ephemeral volume:

  • In the Hyperstack dashboard, locate your VM.
  • Click the "Hibernate" option.
  • Resume any time without redownloading the model.

Use Case 1: Document Intelligence — PDF Extraction at Scale

Document intelligence is where Nemotron 3 Nano Omni really pulls away from older VLMs. With 67.04 on OCRBenchV2 and 57.5 on MMLongBench-Doc, it can read scanned contracts, parse complex financial tables, and extract structured data from multi-column scientific PDFs. The OpenAI-compatible API does not accept raw PDF uploads, so the standard pattern is to render each page to PNG with PyMuPDF and send pages as base64 images.

First, install the local dependencies on your client machine (not the VM):

pip3 install openai pymupdf pillow

Now we'll build a reusable PDF-to-structured-data extractor. The script below renders each page at 200 DPI, sends it to Nemotron 3 Nano Omni with a structured-extraction prompt, and aggregates the page-level outputs into a single JSON document. We use thinking mode here because long documents benefit significantly from the model's chain-of-thought reasoning.

# pdf_extract.py — page-by-page structured extraction with Nemotron 3 Nano Omni
import base64, io, json
import fitz  # PyMuPDF
from openai import OpenAI

# Point at your Hyperstack VM
client = OpenAI(
    base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1",
    api_key="EMPTY",
)
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

def render_page_to_data_url(pdf_path: str, page_num: int, dpi: int = 200) -> str:
    """Render a single PDF page to a PNG data URL."""
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    pix = page.get_pixmap(dpi=dpi)
    png_bytes = pix.tobytes("png")
    b64 = base64.b64encode(png_bytes).decode("utf-8")
    return f"data:image/png;base64,{b64}"

def extract_page(pdf_path: str, page_num: int) -> dict:
    image_url = render_page_to_data_url(pdf_path, page_num)

    prompt = (
        "You are a document intelligence system. Extract the following from "
        "this page and return a single valid JSON object with these keys:\n"
        "  - page_title (string or null)\n"
        "  - section_headers (list of strings)\n"
        "  - key_facts (list of one-sentence factual statements)\n"
        "  - tables (list of objects with 'caption' and 'rows' as 2D arrays)\n"
        "  - footnotes (list of strings)\n"
        "Respond with ONLY the JSON, no prose."
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }],
        max_tokens=20480,
        temperature=0.6,
        top_p=0.95,
        extra_body={
            "thinking_token_budget": 16384 + 1024,
            "chat_template_kwargs": {
                "enable_thinking": True,
                "reasoning_budget": 16384,
            },
        },
    )

    raw = response.choices[0].message.content.strip()
    # Strip markdown fences if the model wrapped JSON in them
    if raw.startswith("```"):
        raw = raw.split("```")[1].lstrip("json").strip()
    return json.loads(raw)

if __name__ == "__main__":
    pdf_path = "q4_earnings_report.pdf"
    doc = fitz.open(pdf_path)
    full_extraction = []
    for i in range(len(doc)):
        print(f"Processing page {i+1}/{len(doc)}...")
        full_extraction.append({"page": i + 1, **extract_page(pdf_path, i)})

    with open("extracted.json", "w") as f:
        json.dump(full_extraction, f, indent=2)
    print(f"Extracted {len(full_extraction)} pages → extracted.json")

Running this on a sample Q4 earnings report (page 3, which contains a revenue-by-segment table) produces output like:

{
  "page": 3,
  "page_title": "Revenue by Operating Segment",
  "section_headers": [
    "Segment Performance",
    "Year-over-Year Comparison"
  ],
  "key_facts": [
    "Cloud Services revenue grew 34% year-over-year to $4.82B.",
    "Hardware revenue declined 6% YoY to $1.91B due to supply normalization.",
    "Total Q4 revenue reached $8.14B, a 19% increase versus Q4 prior year."
  ],
  "tables": [
    {
      "caption": "Q4 Revenue by Segment ($ millions)",
      "rows": [
        ["Segment", "Q4 Current", "Q4 Prior", "YoY %"],
        ["Cloud Services", "4,820", "3,597", "+34.0%"],
        ["Hardware", "1,910", "2,032", "-6.0%"],
        ["Software Licensing", "1,410", "1,213", "+16.2%"]
      ]
    }
  ],
  "footnotes": [
    "All figures unaudited. Constant-currency growth shown in Appendix B."
  ]
}

Notice that the model not only OCR'd the table accurately but also reasoned about the numbers — computing year-over-year deltas and pulling them into key_facts rather than just regurgitating cells. This is exactly what MMLongBench-Doc measures, and where Nemotron 3 Nano Omni is currently best-in-class among open omni models.

💡

Throughput Tip: For multi-page batches, render all pages first then send them concurrently using asyncio + AsyncOpenAI. With --max-num-seqs 16 on a single H100,  you can comfortably process 1–2 pages/sec with reasoning enabled, or 4–6 pages/sec with reasoning disabled for simpler extraction tasks.

Use Case 2: Audio Transcription & Understanding

Nemotron 3 Nano Omni's audio stack is built on the NVIDIA Parakeet encoder and supports word-level timestamped ASR, speech instruction following (89.39 on VoiceBench), and full audio-content reasoning. It accepts wav and mp3 at 8 kHz or higher, with single files up to one hour long. Unlike a pure ASR model, you can ask it questions about the audio rather than just transcribing it.

The example below shows two patterns: first, a clean transcription with timestamps; second, a content-understanding query against the same audio. We'll use a meeting recording stored on the VM's /ephemeral/media mount.

# audio_pipeline.py — transcribe + understand a meeting recording
import base64
from openai import OpenAI

client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

def audio_to_data_url(path: str) -> str:
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode("utf-8")
    return f"data:audio/wav;base64,{b64}"

audio_url = audio_to_data_url("meeting_q4_planning.wav")

# --- Pattern 1: Word-level timestamped transcription (ASR mode) ---
# NVIDIA recommends temperature=1.0, top_k=1 for ASR tasks.
transcript_resp = client.chat.completions.create(
    model=MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": audio_url}},
            {"type": "text",
             "text": "Transcribe this recording verbatim with word-level timestamps "
                     "in [hh:mm:ss] format at the start of each utterance."},
        ],
    }],
    max_tokens=8192,
    temperature=1.0,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print("=== TRANSCRIPT ===")
print(transcript_resp.choices[0].message.content)

# --- Pattern 2: Content-understanding Q&A on the same audio ---
qa_resp = client.chat.completions.create(
    model=MODEL,
    messages=[{
        "role": "user",
        "content": [
            {"type": "audio_url", "audio_url": {"url": audio_url}},
            {"type": "text",
             "text": "List every action item assigned in this meeting along with "
                     "the owner and any mentioned deadline. Use bullet points."},
        ],
    }],
    max_tokens=2048,
    temperature=0.2,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print("\n=== ACTION ITEMS ===")
print(qa_resp.choices[0].message.content)

For a 6-minute Q4 planning recording, the transcription pattern produces:

=== TRANSCRIPT ===
[00:00:02] Sarah: Alright, let's get started. Thanks everyone for joining
the Q4 planning sync.
[00:00:09] Sarah: First on the agenda is the inference cost review. Marcus,
do you have the latest numbers?
[00:00:17] Marcus: Yeah, so we're trending about 18% over budget on GPU
spend, mostly driven by the new vision pipeline.
[00:00:26] Sarah: Got it. Can you put together a cost-reduction proposal
by next Friday?
[00:00:31] Marcus: Yep, I'll have it ready by the 15th.
...

And the same audio passed to the Q&A pattern returns a structured action-item list:

=== ACTION ITEMS ===
- Marcus: Prepare a GPU-cost reduction proposal — due Friday the 15th.
- Priya: Benchmark the FP8 checkpoint against BF16 on the eval set —
  due end of next week.
- Sarah: Schedule a follow-up with the platform team to review KV-cache
  utilization — by EOD Monday.
- Marcus + Priya: Co-author a one-page summary for the leadership review
  — due before the next planning sync.

This is the consolidation Omni is built for: one model, one endpoint, both raw transcription and semantic understanding. Previously this would have required a separate Whisper-class ASR model plus an LLM running diarisation-aware summarisation on top of its output.

💡

Long-Audio Tip: For recordings over 30 minutes, mount the file into the container at /media and pass it as a file:// URL instead of base64. The --allowed-local-media-path / flag we set in Step 6 enables this and avoids inflating the request payload by ~33% from base64 encoding.

Use Case 3: Video Summarisation & Temporal Q&A

Video is where Nemotron 3 Nano Omni's hybrid architecture shows the most dramatic efficiency gains — up to ~9.2× greater effective system capacity versus alternative open omni models at the same per-user interactivity threshold, thanks to 3D-conv spatiotemporal processing and Efficient Video Sampling. The model accepts mp4 files up to two minutes long; for 1080p content, sample at 1 FPS / 128 frames, and for 720p or lower, you can push to 2 FPS / 256 frames (which is what we configured at server launch).

For video tasks, NVIDIA recommends thinking mode — the chain-of-thought helps the model integrate temporal information across frames before answering. Below is a complete pipeline that produces both a dense summary and answers specific temporal questions about a product demo video.

# video_pipeline.py — dense summary + temporal Q&A on a product demo
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")
MODEL = "nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16"

# Path on the VM — uses the /media mount we created in Step 4
video_url = Path("/media/product_demo.mp4").resolve().as_uri()

REASONING_BUDGET = 16384
GRACE_PERIOD = 1024

def ask_video(question: str, use_audio: bool = True) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[{
            "role": "user",
            "content": [
                {"type": "video_url", "video_url": {"url": video_url}},
                {"type": "text", "text": question},
            ],
        }],
        max_tokens=20480,
        temperature=0.6,
        top_p=0.95,
        extra_body={
            "thinking_token_budget": REASONING_BUDGET + GRACE_PERIOD,
            "chat_template_kwargs": {
                "enable_thinking": True,
                "reasoning_budget": REASONING_BUDGET,
            },
            # Set to True for video+audio joint reasoning (e.g. tutorials, lectures)
            "mm_processor_kwargs": {"use_audio_in_video": use_audio},
        },
    )
    return response.choices[0].message.content

# 1. Dense scene-by-scene summary
summary = ask_video(
    "Provide a dense, scene-by-scene summary of this product demo. "
    "For each scene, include the visible UI elements, what action the "
    "presenter takes, and any on-screen text."
)
print("=== DENSE SUMMARY ===")
print(summary)

# 2. Specific temporal Q&A — the model must integrate across frames
answer = ask_video(
    "At what point in the demo does the presenter switch from the "
    "dashboard view to the settings panel, and what configuration "
    "change do they make there?"
)
print("\n=== TEMPORAL Q&A ===")
print(answer)

For a 90-second product demo of a hypothetical analytics dashboard, the dense summary returns:

=== DENSE SUMMARY ===
Scene 1 (0:00–0:12) — The video opens on a dark-mode dashboard titled
"Revenue Analytics — Q4 2025." Visible UI elements include a top
navigation bar with tabs labeled Overview, Segments, Forecasts, and
Settings. A line chart in the centre shows weekly revenue trending
upward. The presenter's cursor hovers over the "Segments" tab.

Scene 2 (0:12–0:34) — The presenter clicks "Segments." The view
transitions to a stacked bar chart showing four product lines. On-screen
text reads "Cloud Services now contributes 59% of total revenue." The
presenter narrates the year-over-year growth rates while highlighting
each bar in turn.

Scene 3 (0:34–0:51) — A tooltip appears showing the exact dollar value
for the Cloud Services segment ($4,820M). The presenter clicks a small
gear icon in the top right, and the Settings panel slides in from the
right edge of the screen.

Scene 4 (0:51–1:18) — In the Settings panel, the presenter toggles
"Constant-currency adjustment" from Off to On. The chart in the
background re-renders, and the Cloud Services bar shrinks slightly...

And the temporal-reasoning question returns:

=== TEMPORAL Q&A ===
The presenter switches from the dashboard view to the settings panel at
approximately 0:34, by clicking a small gear icon in the top-right
corner of the dashboard. In the settings panel, the only configuration
change made is toggling "Constant-currency adjustment" from Off to On —
this causes the underlying revenue chart to re-render with adjusted
figures, after which the presenter returns to the dashboard view at
roughly 1:18.

Both responses required the model to track UI state across frames, not just describe individual screenshots. This is precisely what 3D convolutions and EVS are built for, and it's why a single Nemotron 3 Nano Omni call replaces what would otherwise be a video-frame-extractor + per-frame VLM + temporal-reasoning LLM pipeline.

💡

Frame-Sampling Tip: If your video has fast motion (sports, gameplay) or rapidly changing UI, push toward 256 frames at 2 FPS. For static talking-head content, 64–128 frames at 1 FPS gives the same accuracy at half the prefill VRAM. The --video-pruning-rate 0.5 EVS flag we set at launch automatically discards redundant tokens regardless of which sampling rate you choose.

Disabling "Thinking" Mode for Latency-Sensitive Tasks

Reasoning mode is on by default and is the right choice for long documents, video summarisation, and any multi-step analysis. But for short, latency-sensitive requests — image classification, simple ASR, single-fact extraction — the chain-of-thought adds tokens you don't need. Disable it per-request with the chat_template_kwargs override:

from openai import OpenAI
client = OpenAI(base_url="http://<YOUR_VM_PUBLIC_IP>:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16",
    messages=[{"role": "user",
               "content": "Classify this invoice as 'paid', 'overdue', or 'pending' in one word."}],
    max_tokens=16,
    temperature=0.2,
    extra_body={
        "top_k": 1,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)
print(response.choices[0].message.content)
# → "overdue"

For text and image inputs, NVIDIA recommends keeping reasoning mode on. For pure ASR, video, and audio Q&A, try both modes against your eval set — the right answer depends on how much temporal/contextual integration the question requires.

Why Deploy Nemotron 3 Nano Omni on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads, and Nemotron 3 Nano Omni's multimodal pipeline is exactly the kind of GPU-bound, memory-sensitive deployment Hyperstack is engineered for:

01
On-Demand H100 AccessSpin up a single NVIDIA H100 80GB in minutes, exactly the minimum-spec GPU NVIDIA lists for the BF16 checkpoint.
02
High-Speed Ephemeral NVMe62 GB BF16 weights load from local NVMe in 60 to 90 seconds on warm restarts, with no slow re-downloads from Hugging Face.
03
Seamless Scale-Up PathWhen you outgrow a single H100, move to multi-GPU H100, H200, or B200 flavours without rewriting your serving stack.
04
Hibernate to Reduce CostsPause compute billing between batches while keeping the cached model and your full setup intact. See our GPU pricing for details.
05
Native vLLM and Docker WorkflowPre-configured CUDA images mean the docker run command above runs without driver wrangling or kernel patches.

FAQs

What is NVIDIA Nemotron 3 Nano Omni?

Nemotron 3 Nano Omni is NVIDIA's open-weight multimodal model that unifies video, audio, image, and text understanding in a single 30B-parameter Mamba2-Transformer hybrid MoE (with ~3B active parameters per token), released under the NVIDIA Open Model Agreement for commercial use.

Can Nemotron 3 Nano Omni run on a single H100?

Yes. The BF16 checkpoint is 62 GB and fits on a single NVIDIA H100 80GB, which is the minimum spec NVIDIA lists. FP8 (33 GB) and NVFP4 (21 GB) variants run on smaller GPUs with negligible accuracy loss.

What is the maximum context window?

Up to 256K tokens. On a single H100 with BF16, 128K is a comfortable practical setting that leaves enough VRAM for the KV cache plus a reasonable batch size.

What input formats does the model accept?

Video as mp4 up to 2 minutes, audio as wav or mp3 up to 1 hour, images as jpeg/png, and text. PDFs must be rendered to images first — the API does not accept raw PDF uploads.

Why do I need to install vllm[audio] manually?

The upstream vllm/vllm-openai:v0.20.0 Docker image ships without audio decoders to keep the image size down. Audio support requires librosa and related packages, which are pulled in by the [audio] extra. Running it as part of the container's startup command (as shown in Step 6) is the recommended pattern.

Is reasoning mode worth the extra latency?

For long documents, video summarisation, and any task requiring temporal or cross-modal integration — yes, the accuracy gains are substantial and reasoning is on by default. For short, single-fact tasks (classification, simple ASR), disable it via chat_template_kwargs.enable_thinking: false for a cleaner, faster response.

What's the easiest way to scale beyond a single H100?

Switch to the FP8 or NVFP4 quantised checkpoints to free up VRAM for higher concurrency on the same GPU, or move to a multi-GPU Hyperstack flavour and add --tensor-parallel-size N to the vllm serve command. Both paths are drop-in changes that don't require modifying client code.

Fareed Khan

Fareed Khan

calendar 4 May 2026

Read More