mistralai/Mistral-Medium-3.5-128B
Mistral Medium 3.5 (128B) dense vision-language model with native FP8 weights and 256K context
Guide
Overview
Mistral-Medium-3.5 is a 128B dense vision-language model from Mistral AI. The
weights ship pre-quantized to FP8 (E4M3) with the vision tower, multimodal
projector, and lm_head retained in BF16. Image input is supported up to
1540x1540 (Pixtral-style encoder, patch size 14). Context length is 256K
via YaRN scaling (factor 64x over the 4K base).
Reasoning is opt-in per request via reasoning_effort: "high" — when set,
the model emits [THINK]...[/THINK] blocks that the Mistral reasoning
parser surfaces as message.reasoning_content. Tool calling uses the
[AVAILABLE_TOOLS] / [TOOL_CALLS] chat-template tokens.
Prerequisites
- Hardware: 8xH200 (recommended), 4xB200 or 2-8xMI300X; single B200 / MI300X also fits the weights (~134 GB raw) but leaves little room for the 256K KV cache - see below.
- vLLM nightly (Mistral 3.5 architecture support has not yet shipped in a stable release).
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm --torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
This pulls in mistral_common >= 1.11.1 and transformers >= 5.4.0 automatically.
Launch command
8xH200 (or 8xB200):
vllm serve mistralai/Mistral-Medium-3.5-128B \
--tensor-parallel-size 8 \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
Useful flags:
--max-model-len: default 262144; lower it (e.g. 65536) to free VRAM for larger batch sizes on tighter GPU pools.--language-model-only: skip the vision encoder entirely for text-only workloads.--mm-encoder-tp-mode data: run the small vision encoder data-parallel instead of tensor-parallel — avoids the all-reduce overhead.--limit-mm-per-prompt.image N: cap images per request.
EAGLE speculative decoding
Mistral ships a dedicated EAGLE draft head at
mistralai/Mistral-Medium-3.5-128B-EAGLE.
It is not included in the default config — toggle the spec_decoding feature.
Mistral's recommended serve command (from the EAGLE model card):
vllm serve mistralai/Mistral-Medium-3.5-128B --tensor-parallel-size 8 \
--tool-call-parser mistral --enable-auto-tool-choice --reasoning-parser mistral \
--max_num_batched_tokens 16384 --max_num_seqs 128 --gpu_memory_utilization 0.8 \
--speculative_config '{"model":"mistralai/Mistral-Medium-3.5-128B-EAGLE","num_speculative_tokens":3,"method":"eagle","max_model_len":"65536"}'
The draft model is a 2-layer Mistral-style head trained on the 128B target; it shares the tokenizer and runs at TP=8 alongside the target.
Client usage
Reasoning + tool calling against the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
resp = client.chat.completions.create(
model="mistralai/Mistral-Medium-3.5-128B",
messages=[{"role": "user", "content": "Plan a 3-day Paris trip."}],
extra_body={"reasoning_effort": "high"},
temperature=0.7, max_tokens=4096,
)
msg = resp.choices[0].message
print("reasoning:", getattr(msg, "reasoning_content", None))
print("answer:", msg.content)
Image input (vision):
resp = client.chat.completions.create(
model="mistralai/Mistral-Medium-3.5-128B",
messages=[{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://..."}},
{"type": "text", "text": "Describe this image."},
],
}],
max_tokens=512,
)
AMD MI300X (ROCm)
The command builder fits the model on one GPU on MI300X. This is prone to fail due to OOM. Use the following instead.
Single GPU - Use --tensor-parallel-size 1 and limit context, e.g.
--max-model-len 131072. Without that limit, KV-cache allocation fails at the
default 262144 context even in text-only mode. You can also reduce
--max-num-batched-tokens or change --gpu-memory-utilization if you still hit OOM.
Here is a text-only example:
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --group-add video \
--privileged --ipc=host -p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e SAFETENSORS_FAST_GPU=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 1 \
--no-enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--reasoning-parser mistral \
--max-model-len 131072 \
--language-model-only
Text-only - Set --tensor-parallel-size to the number of GPUs you wish to use:
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --group-add video \
--privileged --ipc=host -p 8000:8000 \
-e VLLM_ROCM_USE_AITER=1 \
-e SAFETENSORS_FAST_GPU=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser mistral \
--reasoning-parser mistral \
--language-model-only
For lm_eval against a text-only server, pass fix_mistral_regex=True in
--model_args (Mistral 3.5 tokenizer quirk).
Multimodal (text + image) - Enable the image inputs and set the limit for images per prompt, here 1, as necessary:
docker run --device=/dev/kfd --device=/dev/dri \
--security-opt seccomp=unconfined --group-add video \
--privileged --ipc=host -p 8000:8000 \
-e SAFETENSORS_FAST_GPU=1 \
-e VLLM_ROCM_USE_AITER=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai-rocm:nightly mistralai/Mistral-Medium-3.5-128B \
--tokenizer_mode mistral --config_format mistral --load_format mistral \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--enable-mm \
--limit-mm-per-prompt '{"image": 1}' \
--enable-auto-tool-choice --tool-call-parser mistral \
--reasoning-parser mistral
MI300X benchmarking (text-only)
Serving benchmark: vllm bench serve with 100 requests, max concurrency 32,
1024 input + 1024 output tokens per request.
Accuracy: lm_eval GSM8k (5-shot), local-completions backend.
| TP | Output tok/s | Mean TTFT (ms) | Mean TPOT (ms) | GSM8k flexible | GSM8k strict |
|---|---|---|---|---|---|
| 1 | 438 | 5719 | 55.2 | 0.928 | 0.876 |
| 2 | 530 | 5310 | 45.0 | 0.923 | 0.867 |
| 4 | 729 | 3317 | 33.0 | 0.928 | 0.873 |
| 8 | 915 | 2522 | 26.1 | 0.929 | 0.876 |
Troubleshooting
- OOM at full 256K context on H200 or MI300X: drop
--max-model-lento 131072 or 65536, or set--language-model-onlyif you don't need vision. reasoning_effortrejected: only"none"and"high"are accepted by the chat template — anything else raises an exception.