This Week in AI: GLM-5.1, Gemma 4, and the Self-Evolving MiniMax-M2.7

This week marks one of the most consequential stretches in open-weight AI we’ve seen in months. From a 754-billion-parameter agentic powerhouse to Google’s latest Gemma family and a model that literally improved itself by 30%, the landscape is shifting fast. I’ve been tracking these releases closely, and today I want to walk you through the five most impactful model launches of the past seven days — what they do, how they benchmark, and why they matter for your day-to-day engineering work.

1. GLM-5.1: Z.ai’s 754B Agentic Flagship

The biggest story this week is GLM-5.1 from Z.ai — a 754-billion-parameter Mixture-of-Experts model released under the permissive MIT license. What makes GLM-5.1 remarkable isn’t just its size; it’s the philosophy behind it. While most frontier models plateau after an initial burst of problem-solving, GLM-5.1 is explicitly designed to sustain effectiveness over long agentic horizons.

The model achieves state-of-the-art performance on SWE-Bench Pro and leads its predecessor GLM-5 by significant margins on NL2Repo (full repository generation) and Terminal-Bench 2.0 (real-world terminal tasks). On AIME 2026, it scores 95.3 — competitive with Claude Opus 4.6 (95.6) and not far behind Gemini 3.1 Pro (98.2) and GPT-5.4 (98.7). On the harder HLE benchmark with tools, it reaches 52.3, essentially matching the proprietary frontier.

The key innovation is what Z.ai calls “long-horizon agentic reasoning.” The model breaks down ambiguous problems, runs experiments, reads results, identifies blockers, revisits its reasoning, and revises its strategy through repeated iteration. It sustains optimization over hundreds of rounds and thousands of tool calls. For anyone building AI-powered development tools, this is a model worth watching closely.

2. Google Gemma 4: Open Multimodal at Every Scale

Google released the Gemma 4 family this week, and it’s a serious upgrade. The lineup includes four sizes — E2B, E4B, 26B-A4B (Mixture-of-Experts), and 31B (Dense) — all released under the Apache 2.0 license. Every model in the family is multimodal, processing text and images natively, with the smaller E2B and E4B models also supporting video and audio input.

The architectural highlights are worth noting for anyone deploying models in production:

Hybrid attention: Interleaves local sliding window attention with full global attention, keeping the final layer always global. This gives you the speed of a lightweight model without sacrificing long-context awareness.
256K context window on the medium models (31B and 26B), 128K on the small ones — generous for open-weights.
Configurable thinking modes for reasoning tasks across all sizes.
Native function calling and system prompt support — critical for agentic workflows.
Multilingual support in over 140 languages out of the box.

The 31B dense model has already accumulated over 2.24 million downloads on HuggingFace. For local deployment, the E2B and E4B are specifically optimized for on-device execution on phones and laptops. If you’re building anything that needs to run models at the edge, this is your new baseline.

3. MiniMax-M2.7: The Self-Evolving Model

MiniMax released M2.7, a 229B-parameter model with a genuinely novel angle: it’s the first model that deeply participated in its own evolution. During development, M2.7 was allowed to update its own memory, build complex skills for RL experiments, and improve its own learning process based on experiment results.

The results are striking. An internal version of M2.7 autonomously optimized a programming scaffold over 100+ rounds — analyzing failure trajectories, modifying code, running evaluations, and deciding to keep or revert — achieving a 30% performance improvement without human intervention. On MLE-Bench Lite (22 ML competitions), M2.7 achieved a 66.6% medal rate, second only to Claude Opus 4.6 and GPT-5.4.

For software engineering specifically, M2.7 scored 56.22% on SWE-Pro (matching GPT-5.3-Codex), 76.5 on SWE Multilingual, and 57.0% on Terminal-Bench 2. MiniMax claims they’ve reduced live production incident recovery time to under three minutes using this model. It also introduces Agent Teams — native multi-agent collaboration with stable role identity and autonomous decision-making.

4. Qwen3.5-27B-Claude-Opus-Reasoning-Distilled: Community Fine-Tuning Done Right

Sometimes the most interesting models come from the community. This fine-tune of Qwen3.5-27B distills reasoning patterns from Claude 4.6 Opus into a smaller, more deployable package. With over 578,000 downloads and 2,600 likes on HuggingFace, it’s clearly resonated with developers.

What makes it notable is the approach: the creator used Unsloth for efficient fine-tuning and published the complete training notebook, codebase, and a comprehensive PDF guide. The model fixes a crash in the official Qwen template caused by Jinja not supporting the “developer” role (commonly sent by coding agents like Claude Code and OpenCode), and it allows agents to run continuously for over 9 minutes without interruption.

This is a pattern I expect to see more of: taking a strong base model, distilling reasoning capabilities from a frontier model, and releasing both the weights and the methodology. It’s practical, reproducible, and genuinely useful for developers who need strong reasoning in a smaller footprint.

5. Netflix VOID: Video Object and Interaction Deletion

Rounding out the list is something different: VOID from Netflix. It’s a video-to-video model built on CogVideoX (5B parameters) that removes objects from videos — but with a twist. It doesn’t just paint over the object; it handles physical interactions. Remove a person who was holding a cup, and the cup falls. Remove a support beam, and objects resting on it settle naturally.

The technique uses a “quadmask” conditioning approach with four values encoding the primary object to remove, overlap regions, affected regions (falling objects, displaced items), and background to keep. Released under Apache 2.0 with a full GitHub repo and Colab notebook, it requires an A100 (40GB+ VRAM) for inference. For anyone working in video post-production, VFX, or content moderation, this is a production-ready tool.

What This Means for Engineers

Three trends stand out from this week’s releases:

Agentic capability is table stakes. GLM-5.1, MiniMax-M2.7, and Gemma 4 all emphasize long-horizon reasoning, tool use, and autonomous problem-solving. If you’re evaluating models for production agents, these benchmarks (SWE-Bench Pro, Terminal-Bench, NL2Repo) are the ones to watch.
The open-weight gap is closing fast. Look at the HLE-with-tools scores: GLM-5.1 at 52.3 is essentially matching Claude Opus 4.6 (53.1), Gemini 3.1 Pro (51.4), and GPT-5.4 (52.1). Six months ago, that gap was enormous. Today it’s within measurement noise.
Distillation is democratizing frontier reasoning. The Qwen3.5-Opus distill shows you can take reasoning patterns from a $150/M-token API model and bake them into a model you can run locally. That changes the economics of AI-powered tooling dramatically.

Next week should bring even more movement — the pace of releases is accelerating, not slowing down. If you’re building AI-powered products, now is the time to establish your evaluation pipelines and benchmarking infrastructure. The models are moving too fast to rely on intuition alone.