Updated on 6 May 2026

Step-by-Step Guide to Deploying Mistral Medium 3.5 128B on Hyperstack

TABLE OF CONTENTS

NVIDIA H100 SXM On-Demand

Key Takeaways

Mistral Medium 3.5 is Mistral AI's first flagship merged model — a single dense 128B checkpoint that unifies instruction-following, reasoning, coding, and vision in one set of weights
It supports a 256k token context window, multilingual output across 24 languages, native function calling, and configurable reasoning effort per request
The model scores 77.6% on SWE-Bench Verified and 91.4% on τ³-Telecom, outperforming all previous Mistral coding models including Devstral 2
Deployment on an 8×H100-80GB PCIe (NVLink) Hyperstack VM using vLLM nightly is the recommended production setup
This tutorial walks through the complete deployment: VM provisioning, vLLM installation, serving, and testing with the OpenAI-compatible API
Hyperstack gives you on-demand access to NVIDIA H100

What is Mistral Medium 3.5?

Mistral Medium 3.5 is Mistral AI's latest open-weight flagship model, released under a Modified MIT License that covers both commercial and non-commercial use. Unlike its predecessors, it is a merged model, a single dense 128B checkpoint that replaces three separate Mistral products at once: Mistral Medium 3.1 (instruction following), Magistral (reasoning) and Devstral 2 (agentic coding). It now powers both Le Chat and Mistral's Vibe coding agent.

Attribute	Value
Architecture	Dense transformer, 128B parameters
Context window	256k tokens
Modalities	Text + image input, text output
Reasoning	Configurable per request (`none` / `high`)
Languages	24, including EN, FR, DE, ZH, JA, KO, AR
Precision	FP8
License	Modified MIT (open commercial use)

Key Capabilities of Mistral Medium 3.5

Below mentioned are the key capabilities of Mistral Medium 3.5:

Unified reasoning and instruct. Reasoning effort is toggled per API request — set reasoning_effort="none" for fast chat replies and reasoning_effort="high" for complex agentic runs. No separate model swap required.
Vision. Mistral Medium 3.5 includes a vision encoder trained from scratch to handle variable image sizes and aspect ratios, accepting image inputs alongside text.
Agentic and function calling. Native tool use with JSON output and strong system-prompt adherence make it well-suited for multi-step agentic pipelines and coding agents.
Long context. The 256k context window handles entire codebases, long legal documents, and extended conversation histories in a single pass.
EAGLE speculative decoding. Mistral AI released a companion EAGLE model that can be used alongside vLLM to accelerate inference through speculative decoding, reducing latency on interactive workloads.

Why Deploy Mistral Medium 3.5 on Hyperstack?

Hyperstack is a cloud platform purpose-built to accelerate AI and machine learning workloads. Here is why it is the right choice for running Mistral Medium 3.5:

Right-sized GPU availability: Hyperstack offers on-demand 8×H100-80GB PCIe (NVLink) VMs, exactly the configuration needed to serve a 128B FP8 model across 8 GPUs with tensor parallelism without overprovisioning.
Pre-configured ML environments: Ubuntu 22.04 LTS images ship with NVIDIA drivers, CUDA 12.2 and Docker pre-installed, eliminating environment setup friction.
OpenAI-compatible out of the box: Serving with vLLM means any existing OpenAI SDK integration points directly at your Hyperstack endpoint with a one-line base URL change.
Cost-efficient on-demand access: Pay only for the hours you use, spin up for a benchmark run, hibernate when idle and scale to multi-node when you need it.

Mistral Medium 3.5 Hardware Requirements

Mistral Medium 3.5 is a dense 128B FP8 model. At FP8 precision, the weights alone occupy approximately 128 GB of VRAM. An 8×H100-80GB PCIe (NVLink) configuration provides 640 GB of total GPU memory, giving comfortable headroom for the KV cache at the 256k context window.

This tutorial uses the 8×H100-80GB PCIe (NVLink) flavour on Hyperstack. NVLink interconnects between GPUs on this configuration ensures high-bandwidth all-reduce operations across the tensor-parallel shards, which is critical for low-latency token generation on a 128B dense model.

How to Deploy Mistral Medium 3.5 on Hyperstack

Step 1: Accessing Hyperstack

First, you will need a Hyperstack account.

Go to the Hyperstack console and log in.
If you are new, create an account and complete your billing setup. See the Hyperstack getting started guide for a full walkthrough.

Step 2: Deploying a New Virtual Machine

From the Hyperstack dashboard, launch a new GPU-powered VM.

Initiate deployment

Click "Deploy New Virtual Machine" on the dashboard.

Select hardware configuration

Choose the 8×H100-80GB PCIe (NVLink) flavour. This provides 640 GB of combined GPU memory with NVLink interconnects — the right balance of memory capacity and bandwidth for a 128B dense model.

Choose the operating system

Select Ubuntu Server 22.04 LTS R535 CUDA 12.2 with Docker. This image comes pre-installed with Ubuntu 22.04 LTS, NVIDIA drivers (R535), CUDA 12.2 and Docker, everything needed to run vLLM without manual environment setup.

Select a keypair

Choose an existing SSH keypair from your account. If you do not have one, follow the instructions in the Hyperstack getting started guide to create one before proceeding.

Network configuration

Assign a Public IP to your VM, this is required for SSH access and for reaching the inference API remotely.
Enable SSH connections so you can connect to and manage the VM.

Open firewall port

Open port 8000 for inbound TCP traffic. This is the port vLLM will listen on. In production, restrict this port to trusted IP ranges rather than leaving it open to the public internet.

Review and deploy

Verify all settings and click "Deploy". VM initialisation typically takes 2–5 minutes.

Step 3: Connecting to Your VM via SSH

Once the VM is running, connect from your local machine.

Locate the public IP

In the Hyperstack dashboard, open the VM's details page and copy the public IP address.

Connect via SSH

# Connect to your VM using your private key and the VM's public IP
ssh -i [path_to_your_ssh_key] ubuntu@[your_vm_public_ip]

Replace [path_to_your_ssh_key] with the path to your private key file, and [your_vm_public_ip] with the IP from the dashboard.

Once connected, you will see the Ubuntu welcome prompt, confirming you are on the Hyperstack VM.

Step 4: Create a Model Cache Directory

Mistral Medium 3.5 is a 128B FP8 model. Storing the weights on the VM's high-speed ephemeral disk speeds up loading and means the model only downloads once across restarts.

# Create a dedicated cache directory on the ephemeral disk
sudo mkdir -p /ephemeral/hug

# Grant full access so the vLLM process can read and write model files
sudo chmod -R 0777 /ephemeral/hug

# Point Hugging Face at the ephemeral disk so weights actually land there
export HF_HOME=/ephemeral/hug

# Persist across SSH sessions so subsequent commands and reboots use the same cache
echo 'export HF_HOME=/ephemeral/hug' >> ~/.bashrc
source ~/.bashrc

Step 5: Set Your Hugging Face Token

Mistral Medium 3.5 is a gated model on Hugging Face. You must accept the model's license on the model page with your Hugging Face account before downloading.

Once accepted, generate a read token at huggingface.co/settings/tokens, then export it on the VM:

export HF_TOKEN=hf_your_token_here

To make this persistent across sessions, add it to your shell profile:

echo 'export HF_TOKEN=hf_your_token_here' >> ~/.bashrc
source ~/.bashrc

Step 6: Install vLLM Nightly

Mistral Medium 3.5 requires vLLM nightly and its associated dependencies (mistral_common >= 1.11.1, transformers >= 5.4.0). The nightly build is the only version with full support for the Mistral 3.5 architecture.

First, install uv — the fast Python package manager used in Mistral's official setup instructions:

curl -Ls https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

Then install vLLM nightly:

uv pip install -U vllm \
   --torch-backend=auto \
   --extra-index-url https://wheels.vllm.ai/nightly

This will automatically pull in the required mistral_common and transformers versions. Verify both after installation:

python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"

You should see mistral_common >= 1.11.1 and transformers >= 5.4.0 printed. If either is below the required version, re-run the install command above.

Important: Do not use an older stable vLLM release. Mistral AI's model card explicitly requires the nightly build for this model — older versions will fail to load the architecture or produce degraded output.

Step 7: Launch the vLLM Inference Server

With vLLM installed, start the OpenAI-compatible inference server. The following command distributes the model across all 8 H100s using tensor parallelism:

vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000

Here is what each flag does:

Flag	Purpose
`--tensor-parallel-size 8`	Shards the model across all 8 H100 GPUs
`--tool-call-parser mistral`	Parses tool calls into the structured format the OpenAI client expects — required for function calling in Step 13
`--enable-auto-tool-choice`	Allows the model to autonomously decide when to invoke tools provided via the OpenAI tools API
`--reasoning-parser mistral`	Separates reasoning traces from the final response — required for the reasoning-mode examples in Step 11
`--max-model-len 32768`	Sets the maximum context to 32,768 tokens for this deployment
`--gpu-memory-utilization 0.90`	Reserves 90% of GPU VRAM for model weights and KV cache
`--host 0.0.0.0`	Binds to all network interfaces so the API is reachable externally
`--port 8000`	Serves on port 8000

On --max-model-len: Mistral Medium 3.5 supports up to 256k tokens natively. However, serving the full 256k context requires substantial KV cache memory. Starting at --max-model-len 32768 is recommended for an 8×H100-80GB setup. You can increase this — try 65536 or 131072 — and monitor VRAM headroom with nvidia-smi until you find the practical ceiling for your workload.

To run the server in the background so it persists after you disconnect:

nohup vllm serve mistralai/Mistral-Medium-3.5-128B \
  --tensor-parallel-size 8 \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 \
  --port 8000 > vllm.log 2>&1 &

Monitor startup logs with:

tail -f vllm.log

Step 8: Monitor Startup

Model download and initialisation take 10–20 minutes on the first run, depending on network speed. The weights are large, so allow time for the full download before expecting the server to respond.

Watch the logs:

tail -f vllm.log

The server is ready when you see:

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

You can verify GPU utilisation during loading:

nvidia-smi

All 8 GPUs should show high memory usage (~70–75 GB each) once the model is fully loaded.

Step 9: Verify the Deployment

From inside the VM, confirm the server is responding:

curl http://localhost:8000/v1/models

You should receive a JSON response listing mistralai/Mistral-Medium-3.5-128B as an available model.

From your local machine, test the full round-trip with a chat completion. Replace [YOUR_VM_PUBLIC_IP] with your VM's IP address:

curl http://[YOUR_VM_PUBLIC_IP]:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "mistralai/Mistral-Medium-3.5-128B",
    "messages": [
      {
        "role": "user",
        "content": "Explain the difference between dense and mixture-of-experts transformer architectures in two paragraphs."
      }
    ],
    "max_tokens": 400,
    "temperature": 0.3
  }'

A successful response looks like this:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1771900000,
  "model": "mistralai/Mistral-Medium-3.5-128B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "A dense transformer activates all of its parameters for every token in every forward pass..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 210,
    "total_tokens": 238
  }
}

Step 10: Interacting via the Python OpenAI Client

Install the OpenAI Python client on your local machine or inside the VM:

pip install openai

Create a Python script and point it at your Hyperstack endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://[YOUR_VM_PUBLIC_IP]:8000/v1",
    api_key="EMPTY",
)

response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[
        {
            "role": "system",
            "content": "You are an expert software engineer. Be concise and precise."
        },
        {
            "role": "user",
            "content": "Write a Python class implementing a thread-safe LRU cache with a configurable max size."
        }
    ],
    max_tokens=600,
    temperature=0.3,
    top_p=0.95,
)

print(response.choices[0].message.content)

Run it:

python mistral_medium_test.py

Step 11: Using Reasoning Mode

One of Mistral Medium 3.5's signature features is configurable reasoning effort. Pass reasoning_effort via the extra_body parameter:

from openai import OpenAI

client = OpenAI(
    base_url="http://[YOUR_VM_PUBLIC_IP]:8000/v1",
    api_key="EMPTY",
)

# High reasoning effort — for complex agentic or math tasks
response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[
        {
            "role": "user",
            "content": "A train travels at 80 km/h. It needs to cover 340 km. If it stops for 15 minutes every 2 hours, how long does the full journey take?"
        }
    ],
    max_tokens=800,
    temperature=0.7,
    top_p=0.95,
    extra_body={"reasoning_effort": "high"},
)

print(response.choices[0].message.content)

Use reasoning_effort="none" for fast, direct replies where extended chain-of-thought is not needed:

# Fast mode — for chat replies, summarisation, simple Q&A
response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[
        {"role": "user", "content": "What is the capital of Portugal?"}
    ],
    max_tokens=50,
    temperature=0.2,
    extra_body={"reasoning_effort": "none"},
)

print(response.choices[0].message.content)

Step 12: Streaming Responses

For chat applications and long outputs, enable streaming for real-time token delivery:

from openai import OpenAI

client = OpenAI(
    base_url="http://[YOUR_VM_PUBLIC_IP]:8000/v1",
    api_key="EMPTY",
)

stream = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[
        {
            "role": "user",
            "content": "Explain how transformer attention mechanisms work, step by step."
        }
    ],
    max_tokens=600,
    temperature=0.5,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="", flush=True)

print()

Step 13: Using Function Calling

Mistral Medium 3.5 has native function calling support. Here is a complete example:

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://[YOUR_VM_PUBLIC_IP]:8000/v1",
    api_key="EMPTY",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name, e.g. London"
                  },
                  "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                  }
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="mistralai/Mistral-Medium-3.5-128B",
    messages=[
        {"role": "user", "content": "What is the weather like in Paris today?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0.2,
)

tool_call = response.choices[0].message.tool_calls[0]
print(f"Function called: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")

Step 14: Hibernating Your VM (Optional but Recommended)

When you are done with your workload, hibernate the VM to pause compute billing without losing your setup:

In the Hyperstack dashboard, locate your running VM.
Select the "Hibernate" option.
Confirm and compute billing stops immediately, and your disk and configuration are preserved.

FAQs

What is Mistral Medium 3.5?

Mistral Medium 3.5 is Mistral AI's first flagship merged model — a single dense 128B checkpoint that combines instruction following, reasoning, coding, vision, and function calling in one set of weights. It replaces Mistral Medium 3.1, Magistral, and Devstral 2 as Mistral AI's primary production model.

What makes it a "merged" model?

Previous Mistral models were separate specialised checkpoints — one for instruction following, one for reasoning, one for coding. Mistral Medium 3.5 unifies these capabilities through a two-stage post-training process that trains domain specialists independently then merges them into a single model, so you get strong performance across all tasks without choosing a variant.

Why does the tutorial use vLLM nightly instead of the stable release?

Mistral AI's model card explicitly requires vLLM nightly for Mistral Medium 3.5. The stable release does not include support for the Mistral 3.5 architecture. Using the stable release will result in load failures or degraded output.

What context length can I run on an 8×H100-80GB setup?

At --gpu-memory-utilization 0.90 with --tensor-parallel-size 8, the practical ceiling for KV cache storage leaves room for 32k–131k tokens depending on batch size and concurrent requests. Start at 32768, increase in steps, and use nvidia-smi to confirm you have enough free VRAM before accepting production traffic.

What is the EAGLE model and should I use it?

Mistral AI released a companion EAGLE speculative decoding model that can be loaded alongside vLLM to accelerate inference by predicting multiple tokens per step. It reduces latency on interactive workloads at the cost of slightly higher VRAM usage. It is recommended for interactive chat deployments but not required for batch processing.

Is the API OpenAI-compatible?

Yes. vLLM exposes a fully OpenAI-compatible REST API. Any client that targets the OpenAI API can point at your Hyperstack endpoint by changing only base_url — the Python openai SDK, LangChain, LlamaIndex, and any other OpenAI-compatible tooling all work without code changes.

What license does Mistral Medium 3.5 use?

Mistral Medium 3.5 is released under a Modified MIT License that permits both commercial and non-commercial use, with exceptions for companies above a certain revenue threshold. Review the full license before deploying in a commercial context.

AI, H100, New Models

Subscribe to Hyperstack!

Enter your email to get updates to your inbox every week

Get Started

Ready to build the next big thing in AI?

Talk to an expert

Share On Social Media

link

NVIDIA Nemotron 3 Nano Omni: Process Video, Audio, and ...

What is NVIDIA Nemotron 3 Nano Omni? Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16 is an ...

link

Deploy DeepSeek-V4-Pro on Hyperstack: A Multi-Node ...

DeepSeek-V4-Pro is the flagship of DeepSeek's V4 preview family — a sparse ...

Step-by-Step Guide to Deploying Mistral Medium 3.5 128B on Hyperstack

Key Takeaways

What is Mistral Medium 3.5?

Key Capabilities of Mistral Medium 3.5

Why Deploy Mistral Medium 3.5 on Hyperstack?

Mistral Medium 3.5 Hardware Requirements

How to Deploy Mistral Medium 3.5 on Hyperstack

Step 1: Accessing Hyperstack

Step 2: Deploying a New Virtual Machine

Initiate deployment

Select hardware configuration

Choose the operating system

Select a keypair

Network configuration

Open firewall port

Review and deploy

Step 3: Connecting to Your VM via SSH

Locate the public IP

Connect via SSH

Step 4: Create a Model Cache Directory

Step 5: Set Your Hugging Face Token

Step 6: Install vLLM Nightly

Step 7: Launch the vLLM Inference Server

Step 8: Monitor Startup

Step 9: Verify the Deployment

Step 10: Interacting via the Python OpenAI Client

Step 11: Using Reasoning Mode

Step 12: Streaming Responses

Step 13: Using Function Calling

Step 14: Hibernating Your VM (Optional but Recommended)

FAQs

What is Mistral Medium 3.5?

What makes it a "merged" model?

Why does the tutorial use vLLM nightly instead of the stable release?

What context length can I run on an 8×H100-80GB setup?

What is the EAGLE model and should I use it?

Is the API OpenAI-compatible?

What license does Mistral Medium 3.5 use?

Subscribe to Hyperstack!

Get Started

Related Post

NVIDIA Nemotron 3 Nano Omni: Process Video, Audio, and ...

Deploy DeepSeek-V4-Pro on Hyperstack: A Multi-Node ...

United Kingdom (Head office)

Registered Office

Spain

Solutions

Resources

Site map

Products

Legal