The open-source LLM landscape just got a new heavyweight contender. Z.ai (Zhipu AI) released GLM-5.2, a 753B-parameter mixture-of-experts model that currently sits at the top of Artificial Analysis’s Intelligence Index among all open-weight models. With an MIT license, a solid 1M-token context window, and strong benchmark scores across reasoning, coding, and agentic tasks, GLM-5.2 is worth understanding — especially if you’re evaluating models for production workloads or running inference yourself.
The Numbers at a Glance
GLM-5.2 is a text-only, reasoning-capable model built on a Dynamic Sparse Attention (DSA) architecture. Here are the key specs:
- Total parameters: 753B (MoE)
- Active parameters per token: 40B
- Context window: 1M tokens
- License: MIT — fully open, no regional restrictions
- Output speed: ~112 tokens/second
- API pricing: $1.40 per 1M input tokens, $4.40 per 1M output tokens
The MIT license is notable. While many “open-weight” models use custom licenses with usage restrictions, GLM-5.2 ships under one of the most permissive open-source licenses available. That means you can deploy it commercially, fine-tune it, and redistribute it without concerns about acceptable-use clauses.
Architecture: IndexShare for Efficient Long Context
The most technically interesting aspect of GLM-5.2 isn’t its parameter count — it’s how Z.ai made a 1M context window computationally practical. The model builds on the DeepSeek Sparse Attention (DSA) pattern, where a lightweight “lightning indexer” selects the top-k most relevant tokens per query, reducing core attention from O(L²) to O(Lk).
The problem: the indexer itself still has O(L²) complexity, and it runs independently at every transformer layer. Z.ai observed that the top-k selections produced by neighboring layers are highly similar. Their solution, called IndexShare, partitions layers into groups of four. Only the first layer in each group runs its own indexer; the remaining three simply reuse those indices. This eliminates 75% of indexer computations with negligible quality degradation, delivering a 2.9× reduction in per-token FLOPs at 1M context length.
GLM-5.2 also improves its Multi-Token Prediction (MTP) layer for speculative decoding. By applying IndexShare to the MTP layer as well and introducing KV cache sharing across MTP steps, the acceptance length for speculative decoding increases by up to 20%. This directly translates to faster inference without sacrificing quality.
Benchmark Performance
GLM-5.2 is positioned as a model for “long-horizon” tasks — multi-hour coding sessions, complex debugging, and agentic workflows where the model needs to sustain quality over long contexts. The benchmarks reflect this focus.
Coding Benchmarks
On coding tasks, GLM-5.2 is the strongest open-source model tested:
- SWE-bench Pro: 62.1 — ahead of Qwen3.7-Max (60.6) and DeepSeek-V4-Pro (55.4), trailing only Claude Opus 4.8 (69.2)
- Terminal Bench 2.1: 81.0 — a massive jump from GLM-5.1’s 63.5, closing in on Claude Opus 4.8 (85.0)
- DeepSWE: 46.2 — up from 18.0 in GLM-5.1, surpassing DeepSeek-V4-Pro (8.0) and Gemini 3.1 Pro (10.0)
- FrontierSWE: 74.4 — trails Claude Opus 4.8 by just 1 percentage point, ahead of GPT-5.5 (72.6)
Reasoning Benchmarks
- AIME 2026: 99.2 — the highest score among all models listed, including Claude Opus 4.8 (95.7) and GPT-5.5 (98.3)
- GPQA-Diamond: 91.2
- HMMT Feb. 2026: 92.5
- CritPt: 20.9 — matching Claude Opus 4.8, second only to GPT-5.5 (27.1)
Agentic Benchmarks
GLM-5.2 scores 76.8 on MCP-Atlas (Public Set), competitive with Claude Opus 4.8 (77.8) and ahead of DeepSeek-V4-Pro (73.6). On Tool-Decathlon it scores 48.2, though this is an area where Claude Opus 4.8 (59.9) and GPT-5.5 (55.6) still lead significantly.
Important note: Some competitor scores in the benchmark table (marked with asterisks in the original data) represent max reasoning effort modes, not default settings. Direct comparisons should account for the reasoning budget each model was allowed.
Flexible Effort Levels
GLM-5.2 introduces configurable thinking effort, similar to the low/high/max modes available in Claude and DeepSeek. Users can explicitly trade off reasoning depth against speed and cost. At lower effort levels, GLM-5.2 delivers strong performance with lower latency and token consumption; at the max level, it allocates additional computation for harder problems. The Z.ai blog positions its capability at comparable token budgets as falling between Claude Opus 4.7 and Claude Opus 4.8.
Serving GLM-5.2 Locally
GLM-5.2 has first-class support across the major inference frameworks. Here’s how to get it running with vLLM:
# Install vLLM (v0.23.0+ required)
pip install vllm
# Start the server — loads the 753B MoE model
vllm serve "zai-org/GLM-5.2"
# Query via the OpenAI-compatible API
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "zai-org/GLM-5.2",
"messages": [
{
"role": "user",
"content": "Explain the IndexShare architecture for sparse attention"
}
]
}'
The model is also supported by SGLang (v0.5.13+), Transformers (v0.5.12+), and KTransformers (v0.5.12+). An FP8 quantized variant is available at zai-org/GLM-5.2-FP8 for reduced memory footprint.
Where GLM-5.2 Fits in the Landscape
GLM-5.2 lands at an interesting moment in the open-source LLM race. DeepSeek-V4-Pro still holds the title of most-discussed open-weight model, but GLM-5.2 outperforms it on most coding benchmarks (SWE-bench Pro: 62.1 vs 55.4, Terminal Bench: 81.0 vs 64.0). Qwen3.7-Max remains competitive but is a proprietary API-only model. The real competition for GLM-5.2 is the closed-source frontier — Claude Opus 4.8 and GPT-5.5 — where it trails by single-digit margins on many benchmarks while offering full model weight access under MIT terms.
The IndexShare architecture is a meaningful engineering contribution. At 1M context length, most sparse attention implementations become bottlenecked by the indexer itself, not just the core attention computation. By demonstrating that cross-layer index reuse works at production scale without quality loss, Z.ai has produced a technique that other model developers will likely adopt.
Things to Watch
- SWE-Marathon gap: On ultra-long-horizon tasks (building compilers, optimizing kernels), GLM-5.2 scores 13.0 compared to Claude Opus 4.8’s 26.0. The 1M context works well, but extended multi-day engineering tasks still favor the frontier closed models.
- Verbosity: Artificial Analysis rates GLM-5.2 as “somewhat verbose” (140M output tokens for the Intelligence Index vs an average of 110M). This has practical implications for cost — you’re paying for more output tokens per task.
- API pricing: At $4.40 per 1M output tokens, GLM-5.2 is on the expensive side for an open-weight model. Running it self-hosted with vLLM or SGLang is likely more cost-effective at scale.
- Text-only for now: Unlike multimodal competitors, GLM-5.2 handles text input and output only. If your use case requires vision or document understanding, you’ll need a different model or a vision pipeline feeding into it.
Getting Started
If you want to try GLM-5.2 without self-hosting, the Z.ai chat interface offers a playground. For API access, the Z.ai API platform provides OpenAI-compatible endpoints. The model weights are on HuggingFace under the MIT license, and the GitHub repository contains additional resources and deployment examples. The full technical details are available in the GLM-5 technical report and the IndexShare paper.