The open-weight AI landscape just had one of its most significant weeks in recent memory. Three major model families dropped updates within days of each other — DeepSeek V4, Kimi K2.6, and Qwen 3.6 — each pushing the boundaries of what’s possible with models you can actually download and run yourself. Here’s what you need to know.
DeepSeek V4: Million-Token Context Goes Open Source
DeepSeek released two Mixture-of-Experts models under the V4 banner, both licensed under MIT — the most permissive license you can get. The headline model, DeepSeek-V4-Pro, packs 1.6T total parameters with 49B activated per token, while DeepSeek-V4-Flash comes in at 284B total with just 13B activated.
The real story here is the architecture. DeepSeek V4 introduces a hybrid attention mechanism combining Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA). This dramatically reduces the compute cost of long-context processing — at 1M tokens, V4-Pro uses only a fraction of the KV cache compared to V3.2. Both models support a one million token context window natively, putting them in the same conversation as Gemini’s long-context capabilities — but fully open.
The architecture also incorporates multi-head residual connections (mHC) for more stable training across deep layers, and the team used the Muon optimizer for faster convergence. Pre-training was done on 32T tokens, followed by a novel post-training pipeline that cultivates domain-specific experts independently before consolidating them through on-policy distillation.
For the Pro model’s maximum reasoning mode (V4-Pro-Max), DeepSeek claims it’s the best open-source model available, with top-tier coding benchmarks and performance approaching leading closed-source models on reasoning and agentic tasks.
Kimi K2.6: The Agentic MoE Model
Moonshot AI’s Kimi K2.6 takes a different approach — it’s a native multimodal agentic model built from the ground up for long-horizon tasks. With 1T total parameters and 32B activated, it’s designed for practical autonomous work rather than just chat completions.
The key differentiators:
- Long-Horizon Coding: Robust performance across Rust, Go, and Python, spanning frontend, DevOps, and performance optimization tasks.
- Coding-Driven Design: Can transform prompts and visual inputs into production-ready interfaces with structured layouts and animations.
- Agent Swarm: Scales to 300 sub-agents executing 4,000 coordinated steps, decomposing tasks into parallel domain-specific workflows.
- Proactive Orchestration: Powers 24/7 background agents that manage schedules, execute code, and orchestrate cross-platform operations autonomously.
Kimi K2.6 is already the second most downloaded model on Hugging Face this week with 376k downloads, reflecting strong developer interest in agentic AI capabilities. The model uses a modified MIT license and is available with compressed tensor support for efficient deployment.
Qwen 3.6: Small Model, Big Architecture
Alibaba’s Qwen team released Qwen 3.6-35B-A3B, and the numbers tell an interesting story: 35B total parameters with only 3B activated per token. That’s an incredibly efficient MoE ratio, making it the most practical model for local deployment among this week’s releases.
What makes Qwen 3.6 architecturally unique is its Gated DeltaNet attention mechanism. Instead of relying purely on standard attention layers, Qwen 3.6 uses a hybrid layout: 10 blocks of 3 repetitions of (Gated DeltaNet → MoE), followed by 1 block of (Gated Attention → MoE). This gives it linear attention properties for most layers — dramatically reducing memory and compute — while retaining full attention in the final block for precision tasks.
The MoE configuration is equally aggressive: 256 experts with 8 routed + 1 shared activated per token. Despite its small active footprint, it supports 262k token context natively, extensible to over 1M tokens. Licensed under Apache 2.0, it’s already the most downloaded model on Hugging Face with 1.18M downloads.
Qwen 3.6 also introduces Thinking Preservation — the ability to retain reasoning context from historical messages across a conversation, reducing overhead in iterative development workflows. This is particularly useful for agentic coding sessions where the model needs to maintain its reasoning chain across multiple tool calls.
Running These Models Locally
The most practical path for running these models locally is through Ollama, which recently added MLX-powered inference on Apple Silicon for significant speedups. Here’s how to get started with the smaller variants:
# Install Ollama (macOS)
brew install ollama
# Run Qwen 3.6 (3B active - runs on most hardware)
ollama run qwen3.6:35b-a3b
# Run DeepSeek V4 Flash (13B active - needs ~16GB RAM)
ollama run deepseek-v4-flash
# Check available models
ollama list
For the larger models like DeepSeek V4-Pro (49B active) and Kimi K2.6 (32B active), you’ll want at least 32GB of RAM or a dedicated GPU. The DeepSeek V4 models are available in FP8 mixed precision, which helps with memory requirements. GGUF quantized versions are also available through community contributors like Unsloth for even smaller memory footprints.
The Bigger Picture: What This Means
This week’s releases highlight three converging trends in open-weight AI:
- Mixture-of-Experts is the new default. Every major release uses MoE to deliver large-model capabilities with small-model compute costs. The 3B active parameter footprint of Qwen 3.6 would have been considered a toy model two years ago — now it’s competitive on real-world benchmarks.
- Million-token context is becoming table stakes. Both DeepSeek V4 and Qwen 3.6 support 1M+ token contexts, with novel attention mechanisms (CSA, HCA, Gated DeltaNet) that make this practical rather than just a marketing number.
- Agentic capabilities are the new frontier. Kimi K2.6’s agent swarm architecture and Qwen 3.6’s thinking preservation both target real-world autonomous coding workflows, not just chat completions.
On the closed-source side, OpenAI made waves by arguing that SWE-bench Verified no longer measures frontier coding capabilities — a tacit acknowledgment that benchmarks are saturating. As models get better, we need better ways to evaluate them. The community will likely need new benchmarks that test multi-file reasoning, long-horizon task completion, and real-world agentic workflows.
Next Steps
If you’re a developer looking to stay current, here’s what I’d recommend:
- Download Qwen 3.6-35B-A3B via Ollama and test it on your coding workflows — it runs on most modern laptops and the 3B active parameter count makes it surprisingly fast.
- Read DeepSeek’s technical report on the V4 hybrid attention mechanism if you’re interested in efficient long-context architectures.
- Experiment with Kimi K2.6’s agentic capabilities if you’re building autonomous workflows — the swarm architecture is particularly interesting for parallel task decomposition.
- Keep an eye on how the community responds to OpenAI’s benchmark critique — new evaluation frameworks are likely coming.
The pace of open-weight model releases continues to accelerate. The gap between open and closed-source models is narrowing faster than most predicted, and this week’s releases are a clear signal that the best AI models may soon be the ones you can download, inspect, and run on your own hardware.
Sources
- DeepSeek-V4-Pro Model Card — Hugging Face
- Kimi K2.6 Model Card — Hugging Face
- Qwen 3.6-35B-A3B Model Card — Hugging Face
- Ollama is now powered by MLX on Apple Silicon in preview — Ollama Blog (March 30, 2026)
- Why SWE-bench Verified no longer measures frontier coding capabilities — OpenAI (April 26, 2026)