vLLM v0.23.0: Model Runner V2, Multi-Tier KV Offloading, and the Growing Rust Frontend

The vLLM v0.23.0 release landed last week with 408 commits from 200 contributors, and it packs several changes that directly affect how you serve models in production. Model Runner V2 is now the default for Llama and Mistral dense models, multi-tier KV cache offloading got an object-store secondary tier, and the experimental Rust frontend added features that make it closer to production-ready. Here’s what matters and how it affects your deployment.

Model Runner V2: Default for More Architectures

Model Runner V2 (MRv2) was already the default for Qwen3 dense models. In v0.23.0, it expands to cover Llama and Mistral dense models as well. MRv2 brings several improvements under the hood:

A FlashInfer-based sampler replaces the legacy sampler implementation, which reduces sampling latency. Breakable CUDA graphs mean the engine can break out of the graph for dynamic workloads (like requests with widely varying sequence lengths) without sacrificing the throughput gains that CUDA graphs provide for the common case. Pipeline-parallel bubble elimination reduces idle time when using pipeline parallelism across multiple GPUs — a meaningful improvement when you’re running large models that don’t fit on a single GPU.

If you’re serving Llama or Mistral models with vLLM, you get these improvements automatically after upgrading. If you rely on features not yet supported in MRv2, vLLM falls back to MRv1 transparently.

# Upgrade vLLM
pip install vllm==0.23.0

# Launch a Llama 3.1 70B server — MRv2 is now the default
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Multi-Tier KV Cache Offloading

One of the more significant architectural changes in this release is the expansion of the multi-tier KV cache offloading framework. The previous version introduced a Python filesystem secondary tier for offloading KV cache to CPU memory. v0.23.0 adds an object-store secondary tier, which means you can offload KV cache pages to remote storage (S3, GCS, or any object-store-compatible backend) when CPU memory isn’t enough.

Heterogeneous Memory Access (HMA) is now enabled by default for capable connectors, and the per-request offloading policy hook (on_new_request) lets you decide which requests get offloaded based on priority, latency budget, or model size. This is useful when you’re serving multiple models or handling a mix of short and long-context requests — you can keep short requests on GPU memory and offload long-context ones to CPU or remote storage.

# Multi-tier KV cache offloading with an object-store secondary tier
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --kv-transfer-config '{
    "kv_connector": "OffloadingConnector",
    "kv_role": "kv_both",
    "kv_connector_extra_config": {
      "spec_name": "TieringOffloadingSpec",
      "cpu_bytes_to_use": 10000000000,
      "secondary_tiers": [{
        "type": "obj",
        "bucket": "your-bucket",
        "endpoint_override": "your-s3-endpoint:9000",
        "access_key": "YOUR_ACCESS_KEY",
        "secret_key": "YOUR_SECRET_KEY",
        "scheme": "http"
      }]
    }
  }'

The Rust Frontend Grows Up

The experimental Rust frontend has been gaining features steadily. In v0.23.0, it added the critical pieces needed for real-world serving: a streaming generate endpoint, dynamic LoRA endpoints, and standard health-check endpoints (/version and /server_info). The server-router extension hook and request-ID headers make it viable for load-balanced deployments behind a reverse proxy.

New tool-call parsers landed for InternLM2, hy_v3, Phi-4-mini, and Gemma 4, expanding the set of models that work with function calling through the Rust frontend. The Rust frontend is still marked as experimental, but it’s closing in on feature parity with the Python OpenAI-compatible server — and it offers lower tail latency for request routing and serialization.

# Build the Rust frontend binary
cd vllm/rust && cargo build --release

# Point vLLM to the Rust frontend via environment variable
export VLLM_RUST_FRONTEND_PATH=/path/to/vllm/target/release/vllm-frontend
vllm serve meta-llama/Meta-Llama-3.1-8B-Instruct --port 8080

# The Rust server exposes the same OpenAI-compatible API
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "prompt": "Explain KV cache offloading in LLM inference",
    "max_tokens": 256,
    "stream": true
  }'

DeepSeek-V4 Production Readiness

Following its introduction in v0.22.0, DeepSeek-V4 received a substantial hardening pass. The sparse Multi-head Latent Attention (MLA) metadata is now decoupled from DeepSeek-V3.2, which means both models can coexist in the same deployment without conflicts. A TRTLLM-gen attention kernel, EPLB (Expert Parallel Load Balancing) support for the Mega-MoE architecture, and selective prefix-cache retention for the sliding-window KV cache all contribute to better throughput and memory efficiency.

The model has also been detached from torch.compile, and its attention and RoPE (Rotary Position Embedding) paths have been refactored. If you’re running DeepSeek-V4 in production, this release resolves several stability and performance issues from the initial support.

New Models and Transformers v5 Compatibility

v0.23.0 adds support for several new models: Step-3.7-Flash, Cosmos3 Reasoner, Gemma 4 Unified (encoder-free), JetBrains Mellum v2, Granite Speech Plus, and Cohere Mini Code. The Gemma 4 support includes native ViT linear layers, vision-embedder exclusion from quantization, and fixes for multi-GPU and batched processing.

On the compatibility side, vLLM now targets Transformers v5, with vendored MiniCPM-V/O processors and compatibility fixes for several models. If you’re using a recent version of the Transformers library, this release should integrate more smoothly.

Speculative Decoding Improvements

Speculative decoding — where a smaller draft model proposes tokens that a larger target model verifies in parallel — got several fixes and optimizations. Causal DFlash (draft-flash) speculative decoding is now supported, and the proper allocation of lookahead slots fixes a correctness issue from previous versions. Independent drafter attention-backend selection and attention-group splitting give you more flexibility in configuring your speculative decoding pipeline.

from vllm import LLM, SamplingParams

# Speculative decoding with a draft model
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    speculative_model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    num_speculative_tokens=5,
    speculative_max_model_len=8192,
)

# Tokens are proposed by the 8B draft and verified by the 70B target
outputs = llm.generate(
    ["Write a concise explanation of speculative decoding."],
    SamplingParams(temperature=0.0, max_tokens=256),
)
print(outputs[0].outputs[0].text)

Hardware-Specific Optimizations

Beyond the headline features, there are targeted performance improvements across hardware stacks. On NVIDIA, CUTLASS FP8 scaled-mm padding bypass gives a ~20% improvement for certain workloads, and MoE-permute buffer pre-allocation adds 9–14% throughput for mixture-of-experts models. The Triton MoE backend is now the default on Hopper GPUs.

AMD ROCm support caught up with native W4A16 and fused-MoE W4A16 kernels for RDNA3 (gfx1100). Intel XPU gained a transparent sleep mode, block_fp8_moe quantization, and Triton selective-scan operations. CPU inference got zentorch-accelerated W8A8/W4A16 on AMD Zen CPUs, plus Triton-based top-k/top-p sampling.

What This Means for Your Deployments

If you’re running vLLM in production, v0.23.0 is a worthwhile upgrade. MRv2 becoming the default for the two most popular open model families means better sampling performance without any configuration changes. Multi-tier KV offloading opens the door to serving larger models or more concurrent requests on the same hardware. And the Rust frontend is approaching the point where you can consider it for latency-sensitive production workloads.

If you’re still on v0.22.x, check the v0.22.1 patch release notes first — it fixed a few regressions in multi-node data-parallel serving and DeepSeek-V4 initialization. Then upgrade to v0.23.0 for the feature improvements.

The vLLM project continues to move fast — 83k stars on GitHub and growing. The release notes span everything from kernel-level optimizations to serving architecture changes, and the breadth of supported hardware (NVIDIA, AMD, Intel, CPU, TPU) reflects the reality that production LLM serving runs on diverse infrastructure.

Leave a Reply

Your email address will not be published. Required fields are marked *