Qwen3.7-Max: Built for the Agent Era, Not the Chat Era

Qwen just dropped Qwen3.7-Max, and it’s not another incremental chatbot upgrade. This model is purpose-built for something different: being an agent. While most frontier models are optimized for single-turn question answering or chat, Qwen3.7-Max targets the exploding demand for models that can do things — write and debug code across entire repositories, automate office workflows, and sustain coherent execution across hundreds or even thousands of tool calls.

Released May 20, 2026, Qwen3.7-Max sits squarely in the “agent era” of AI. The benchmarks tell part of the story, but the real signal is in how it was trained and what it can sustain over time. Let’s break down what matters.

What Makes Qwen3.7-Max Different

Most LLM benchmarks measure what happens in a single interaction. You ask a question, you get an answer, you score it. Agent work is fundamentally different — a model might need to read files, run tests, edit code, debug failures, and iterate for hours. Qwen3.7-Max was trained specifically for this loop using what the team calls environment scaling: exposing the model to a massive diversity of real-world agent environments during training, then measuring generalization to entirely unseen environments at evaluation time.

The key insight mirrors pretraining itself: just as language models generalize from diverse text, agent capabilities generalize from diverse training environments. The results suggest this actually works.

The Benchmarks That Matter for Agents

Qwen3.7-Max was evaluated against Opus-4.6 Max, Kimi K2.6 Thinking, GLM-5.1 Thinking, and DeepSeek V4 Pro Max. Here’s where it stands out:

Coding Agent Performance

Terminal Bench 2.0: 69.7 — ahead of DeepSeek V4 Pro Max (67.9) and Opus-4.6 Max (65.4)
SWE-Pro: 60.6 — top of the leaderboard, edging out Kimi K2.6 (59.5)
SWE-Multilingual: 78.3 — leading, ahead of Opus-4.6 Max (77.5)
SciCode: 53.5 — best in class, ahead of Kimi K2.6 (52.2)

On SWE-Verified (80.4), it’s essentially tied with Opus-4.6 Max (80.8) and DeepSeek V4 Pro Max (80.6) — the three models are within a statistical whisker of each other on the most widely-cited SWE benchmark.

General Agent Benchmarks

MCP-Mark: 60.8 — best-in-class on tool-use through Model Context Protocol
Skillsbench: 59.2 — significant lead over Kimi K2.6 (56.2)
SpreadSheetBench-v1: 87.0 — near the top for office automation tasks
Kernel Bench L3: 1.98x median speedup with 96% win rate — generating production-grade GPU kernels

Reasoning

The reasoning numbers are arguably the most striking:

GPQA Diamond: 92.4 — ahead of Opus-4.6 Max (91.3)
HMMT 2026 Feb: 97.1 — top score, beating Opus-4.6 Max (96.2)
Apex: 44.5 — a dramatic lead over DeepSeek V4 Pro Max (38.3) and Opus-4.6 Max (34.5)

The 35-Hour Kernel Optimization Experiment

Raw benchmarks are one thing. What’s genuinely novel is the 35-hour autonomous kernel optimization experiment. The team gave Qwen3.7-Max a real task: optimize the Extend Attention operator in SGLang for a T-Head ZW-M890 PPU — a hardware platform the model had never seen during training. No profiling data, no documentation, no example kernels.

Over 35 hours of continuous execution, the model made 1,158 tool calls across 432 kernel evaluations. It wrote code, compiled it, diagnosed failures, identified bottlenecks through runtime profiling, and redesigned the kernel architecture multiple times. The result: 10x geometric mean speedup over the Triton reference implementation.

For comparison, under identical conditions: GLM 5.1 reached 7.3x, Kimi K2.6 reached 5.0x, DeepSeek V4 Pro reached 3.3x, and Qwen3.6-Plus managed only 1.1x. The model was still finding meaningful improvements after 30+ hours — it didn’t plateau early and coast.

Cross-Scaffold Generalization

One persistent problem with agent models is overfitting to a specific scaffold. A model might perform brilliantly with one framework but fall apart with another. Qwen3.7-Max addresses this through a training architecture that decouples tasks, harnesses, and verifiers — the same task is paired with diverse harnesses during training, forcing the model to learn generalizable strategies rather than scaffold-specific shortcuts.

In practice, Qwen3.7-Max works with Claude Code, OpenClaw, and Qwen Code out of the box. The API supports both OpenAI-compatible and Anthropic-compatible protocols, so integration with existing tooling is straightforward.

Getting Started with the API

Qwen3.7-Max is available (or coming soon, depending on region) through Alibaba Cloud Model Studio. The API follows the standard OpenAI chat completions format with a few additions for agentic use:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.environ["DASHSCOPE_API_KEY"],
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

messages = [
    {"role": "user", "content": "Refactor this module to use async/await throughout."}
]

completion = client.chat.completions.create(
    model="qwen3.7-max",
    messages=messages,
    extra_body={
        "enable_thinking": True,
        # preserve_thinking keeps reasoning from prior turns
        # recommended for multi-step agent tasks
        "preserve_thinking": True,
    },
    stream=True,
)

for chunk in completion:
    delta = chunk.choices[0].delta
    if hasattr(delta, "reasoning_content") and delta.reasoning_content:
        # Internal reasoning — useful for debugging agent behavior
        pass
    if hasattr(delta, "content") and delta.content:
        print(delta.content, end="", flush=True)

The preserve_thinking flag is worth highlighting. For agentic workflows where the model makes multiple tool calls across many turns, preserving the chain-of-thought reasoning from previous turns helps maintain coherence. Without it, long-running agents tend to lose context and repeat mistakes.

Using Qwen3.7-Max with Claude Code

Because the API is Anthropic-compatible, you can drop Qwen3.7-Max directly into Claude Code as a backend:

npm install -g @anthropic-ai/claude-code

export ANTHROPIC_MODEL="qwen3.7-max"
export ANTHROPIC_SMALL_FAST_MODEL="qwen3.7-max"
export ANTHROPIC_BASE_URL="https://dashscope-intl.aliyuncs.com/apps/anthropic"
export ANTHROPIC_AUTH_TOKEN="your_api_key"

claude

This is a practical pattern if you want to benchmark Qwen3.7-Max against Claude’s own models on real-world coding tasks — same interface, same tooling, different backend.

What to Watch For

Qwen3.7-Max is a proprietary model — no open weights. It’s available through Alibaba Cloud Model Studio with endpoints in Beijing, Singapore, and Virginia. The model supports a 1M token context window via the OpenClaw integration config (though the default API context may differ). Pricing hasn’t been publicly detailed yet, but expect it to be competitive with other frontier proprietary models.

The “environment scaling” training approach is the most interesting technical contribution. If the claimed predictability of scaling holds — where performance gains on any subset of benchmarks reliably predict gains on the rest — it suggests a path to more systematic agent capability development. The upcoming technical report should provide more detail on this methodology.

The bottom line: if you’re building agent systems and evaluating backbone models, Qwen3.7-Max deserves a spot in your benchmarking pipeline. The cross-scaffold generalization and long-horizon coherence are real differentiators, and the Anthropic-compatible API means you can test it with minimal integration work.