This week in AI has been a landmark for open-weight models. Google dropped Gemma 4, a family of multimodal models that run from phones to servers, all under the permissive Apache 2.0 license. Within days, Z.ai unveiled GLM-5.1, a 754B-parameter Mixture-of-Experts model designed specifically for agentic engineering—and released it under MIT. Together, these two releases signal a shift: the most capable AI models are no longer locked behind APIs.
If you’ve been waiting for the right moment to deploy open-weight models in production, this is it. Let’s break down what’s new, how these models perform, and how you can start running them today.
Google Gemma 4: Open-Weight Multimodal at Every Scale
Google DeepMind released the Gemma 4 family on April 3rd, and it’s already racked up over 1.3 million downloads on Hugging Face. The headline: four model sizes, all multimodal, all open-weight, all Apache 2.0.
The Model Lineup
- Gemma 4 E2B — 2.3B effective parameters (5.1B with embeddings). Text, image, video, and audio input. 128K context window. Designed for phones and edge devices.
- Gemma 4 E4B — 4.5B effective parameters (8B with embeddings). Same multimodal coverage as E2B, with more headroom. Runs on laptops.
- Gemma 4 26B-A4B — 27B total parameters, Mixture-of-Experts architecture activating only 4B per token. Text and image input. 256K context window. Sweet spot for consumer GPUs.
- Gemma 4 31B — 30.7B parameters, dense architecture. Text and image input. 256K context window. The flagship for workstations and servers.
What Makes Gemma 4 Different
Three things set Gemma 4 apart from previous open-weight releases:
1. Hybrid Attention Architecture. Gemma 4 uses a hybrid of local sliding-window attention and full global attention, interleaved across layers. The final layer always uses global attention. This gives you the speed and low memory footprint of a small model without sacrificing deep context understanding—critical for those 256K-token windows.
2. Native System Prompts and Function Calling. Gemma 4 introduces native system role support and built-in function calling. This isn’t an afterthought—it’s baked into the training. You can build agentic workflows that call tools, parse structured outputs, and maintain multi-turn conversations without prompt engineering hacks.
3. Configurable Thinking Modes. All models in the family support configurable reasoning/thinking modes. You can toggle deeper reasoning on or off depending on your latency budget, which is a game-changer for production deployments where you need to balance quality against response time.
Running Gemma 4 Locally
Thanks to Ollama’s recent MLX integration for Apple Silicon (previewed March 30th), running Gemma 4 on Mac hardware is straightforward:
# Pull and run the MoE model (activates only 4B params)
ollama run gemma4:26b-a4b
# Or run the dense 31B flagship
ollama run gemma4:31b
# For edge devices, the E4B model
ollama run gemma4:e4b
For Python deployments using Hugging Face Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "google/gemma-4-31B-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain the sliding window attention pattern."},
]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt"
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
GLM-5.1: The Open-Weight Model Built for Agentic Engineering
Released on April 9th by Z.ai, GLM-5.1 is a 754B-parameter Mixture-of-Experts model with a clear mission: dominate at agentic coding tasks. And it delivers—achieving state-of-the-art results on SWE-Bench Pro and outperforming its predecessor GLM-5 by wide margins on NL2Repo (repository generation) and Terminal-Bench 2.0 (real-world terminal tasks).
The Long-Horizon Advantage
What makes GLM-5.1 special isn’t just first-pass accuracy. Most AI models—including strong ones—plateau quickly. They apply familiar techniques, get some initial gains, then stall. Giving them more time doesn’t help.
GLM-5.1 is explicitly designed to break that pattern. It stays effective over hundreds of rounds and thousands of tool calls. The model breaks down complex problems, runs experiments, reads results, identifies blockers, revisits its reasoning, and revises its strategy. The longer it runs, the better the result.
Benchmark Performance
Against frontier closed-source models, GLM-5.1 holds its own:
- AIME 2026: 95.3 (vs. GPT-5.4 at 98.7, Claude Opus 4.6 at 95.6)
- HLE with Tools: 52.3 (vs. Claude Opus 4.6 at 53.1, GPT-5.4 at 52.1)
- SWE-Bench Verified: Competitive with Claude Opus 4.6
These numbers from an open-weight, MIT-licensed model are remarkable. For coding-specific tasks, GLM-5.1 is in the same conversation as the best closed-source models—and you can run it on your own infrastructure.
Deploying GLM-5.1
Given its 754B parameter count, GLM-5.1 requires significant hardware for local inference. Here’s how to approach it:
# Use the Z.ai API (simplest approach)
curl https://api.z.ai/v1/chat/completions \
-H "Authorization: Bearer $ZAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-5.1",
"messages": [
{"role": "user", "content": "Refactor this function to be more idiomatic Go"}
]
}'
For self-hosted deployment, use vLLM with tensor parallelism across multiple GPUs:
# Requires 8x A100 80GB or equivalent
python -m vllm.entrypoints.openai.api_server \
--model zai-org/GLM-5.1 \
--tensor-parallel-size 8 \
--max-model-len 32768 \
--dtype bfloat16
The Local Deployment Landscape in April 2026
These releases don’t exist in a vacuum. The tooling ecosystem for running open-weight models has matured dramatically:
- Ollama + MLX — Apple Silicon users can now run models with MLX acceleration, delivering significant speedups over the previous llama.cpp backend. Gemma 4’s smaller models run comfortably on M-series Macs.
- vLLM — Still the go-to for production GPU deployments. Supports tensor parallelism, quantization (GPTQ, AWQ, GGUF), and OpenAI-compatible serving.
- llama.cpp — The universal fallback. Gemma 4’s GGUF quantizations are already available, letting you run the 26B-A4B model on as little as 8GB VRAM.
- Hugging Face Transformers — Native support for both Gemma 4 and GLM-5.1 out of the box, with optimized attention implementations.
Other Notable Releases This Week
- Netflix Void Model — Netflix open-sourced a video-to-video model on Hugging Face, targeting content transformation and style transfer workflows.
- VoxCPM2 by OpenBMB — A new text-to-speech model gaining traction, with high-quality multilingual voice synthesis.
- OmniVoice by k2-fsa — Another strong TTS entry with 200K+ downloads in its first week, designed for real-time voice applications.
- Qwen3.5-27B Reasoning Distillations — Community fine-tunes distilling reasoning capabilities from Claude 4.6 Opus into the Qwen3.5 architecture continue to trend, showing the power of open-weight model composability.
What This Means for Engineers
The practical takeaway is straightforward: open-weight models have caught up to closed-source alternatives for most production use cases. Here’s my recommendation matrix:
- Edge/mobile deployment: Gemma 4 E2B or E4B. Apache 2.0, multimodal, 128K context. No brainer.
- Consumer GPU coding assistant: Gemma 4 26B-A4B. The MoE architecture means you’re only activating 4B parameters per token—fast inference with frontier-level quality.
- Production API replacement: GLM-5.1 via Z.ai’s API or self-hosted. MIT license, state-of-the-art agentic coding, competitive with Claude and GPT.
- Workstation-class general purpose: Gemma 4 31B dense. The most capable all-rounder in the family with 256K context.
Getting Started
The fastest path to trying these models:
# 1. Install Ollama (if you haven't)
curl -fsSL https://ollama.com/install.sh | sh
# 2. Pull Gemma 4
ollama pull gemma4:26b-a4b
# 3. Start chatting
ollama run gemma4:26b-a4b "Write a Go HTTP handler that validates JSON input"
# 4. For GLM-5.1, use the Z.ai API
pip install openai
# Then set ZAI_API_KEY in your environment
Both models are available now on Hugging Face with full weights, model cards, and evaluation results. The era of choosing between capability and openness is over—you can have both.
Key Takeaways
- Google Gemma 4 brings multimodal capabilities (text, image, audio, video) to open-weight models at every scale, from 2B to 31B parameters, all Apache 2.0 licensed.
- GLM-5.1 is the first open-weight model explicitly optimized for long-horizon agentic engineering, achieving state-of-the-art on SWE-Bench Pro under an MIT license.
- The deployment tooling (Ollama MLX, vLLM, llama.cpp) has caught up—you can run these models locally with minimal setup.
- Apache 2.0 and MIT licensing means no restrictions on commercial use. These are production-ready models.
Have you tried either model yet? I’d love to hear about your deployment experiences—drop a comment below or reach out on X.