We skipped two weeks. The leaderboard didn’t wait. Kimi K2.6 crashed onto BenchLM at #2 overall. DeepSeek V4-Flash got its community benchmarks and the news isn’t great for single-GPU setups. Mistral shipped a 128B dense model that nobody with consumer hardware can run. And through all of it, Qwen 3.6 quietly held every position that actually matters on the hardware most people own.
Here’s what changed, what didn’t, and what it means for running models at home.
Kimi K2.6: Open Weights, Closed Hardware Requirements
Moonshot AI’s Kimi K2.6 is the biggest new entry. A trillion-parameter MoE with 32 billion active per token, open weights, 256K context window. It scores 84 on BenchLM, slotting in at #2 behind DeepSeek V4-Pro. On Artificial Analysis it hits an Intelligence Index of 54—well above the open-weight median of 30.
The coding numbers are the headline. 80.2% on SWE-Bench Verified. 58.6% on SWE-Bench Pro—beating GPT-5.4 (xhigh) at 57.7%. An Elo of 1520 on GDPval-AA, up from K2.5’s 1309. For agentic coding, this is the best open-weight model available right now.
The catch: 1T total parameters with 32B active. Even with aggressive quantization, you’re looking at multi-GPU server territory. A Q4 quantization still puts you around 500GB+ for the full model weights. The MoE architecture means all expert weights need to be in memory, not just the active 32B. This isn’t running on your RTX 4090.
Cerebras is already offering hosted inference, which tells you something about the target deployment. For this column, K2.6 is important as a benchmark reference—it shows where open weights can reach—but it’s not a consumer hardware model. Not yet.
DeepSeek V4-Flash: The Verdict Is In
Three weeks ago, V4-Flash was our wildcard. Now the community quantizations and benchmarks have arrived, and the picture is clearer.
The practical minimum for V4-Flash is around 33GB at heavy quantization—that’s two RTX 4090s or one RTX 6000 Ada. A reasonable INT4 build needs roughly four RTX 4090s. Q8 quantization—where quality stays within 1-2 points of full precision—requires 42-46GB.
For single-GPU consumer setups, V4-Flash remains out of reach. The 13B active parameters are efficient for inference compute, but the full 284B parameter set still needs to live in memory. MoE architecture continues to be the fundamental tension in local AI: great for compute efficiency, terrible for memory efficiency.
V4-Flash drops from wildcard status to “cloud or multi-GPU only” in our rankings.
Mistral Medium 3.5: Impressive but Irrelevant (for Us)
Mistral shipped Medium 3.5 on April 29—a 128B dense model. It scores 77.6% on SWE-Bench Verified. Strong EU-friendly licensing. Good coding performance.
But 128B dense means ~256GB in BF16. Even at INT4 quantization you need roughly 64GB minimum, and real-world deployment wants four H100 80GB GPUs. Consumer GPUs need not apply. Moving on.
Qwen 3.6: Still the Consumer GPU Champion
While the big models grabbed headlines, Qwen 3.6 kept doing what matters most for this column: running well on hardware people actually own.
The 35B-A3B MoE remains the speed story. On an RTX 3090 (a used card you can find for ~$800), it generates at 112 tokens per second with the full 262K context window. On an RTX 5060 Ti (16GB), it runs at 47-51 tok/s at 160K context. Only 3B parameters active per token means your GPU does less work while the model punches far above its apparent weight class.
The 27B dense model still leads for raw coding quality among models that fit on a single 24GB card. SWE-Bench Verified 77.2%, Terminal-Bench and SkillsBench advantages over the MoE sibling. Slower at ~30-40 tok/s on an RTX 4090, but when you need the best possible output for a coding agent, that trade-off makes sense.
Three weeks later, the Qwen 3.6 independent benchmarking question from week 13 has mostly resolved. Community usage has been extensive enough that the model’s quality is well-established, even if formal third-party SWE-Bench reproduction with standardized scaffolding is still patchy. People are shipping code with these models daily. The numbers are real enough.
Gemma 4: Quietly Holding Ground
Google’s Gemma 4 didn’t get any splashy updates in the past three weeks, but the models have settled into their niches.
The 31B dense model holds its #3 position on Arena AI among open models. AIME 2026 score of 89.2% remains the best in its weight class. On a 24GB card, it needs about 20GB at 4-bit quantization, generating at 30+ tok/s.
The 26B MoE continues to be the 16GB sweet spot—18GB at 4-bit, 40+ tok/s. Apache 2.0 licensing. For non-coding workloads on mid-range hardware, nothing has displaced it.
The E2B and E4B variants deserve a mention for edge deployment. E2B fits in ~1.5GB RAM and runs at 60+ tok/s on edge hardware. If you’re building something that needs to run on a phone or a Raspberry Pi, these are the best options available under a permissive license.
The Consumer GPU Ceiling
A pattern emerged over these three weeks. The biggest leaps in open-weight capability—Kimi K2.6, DeepSeek V4-Pro, Mistral Medium 3.5—all landed above the consumer GPU line. The trillion-parameter MoE models and 128B+ dense models are genuine breakthroughs, but they need server hardware.
Meanwhile, the models that actually run on a 16-24GB card haven’t changed much. Qwen 3.6 and Gemma 4 are iterating, not leaping. That’s not a complaint—these models are genuinely useful for daily work. But the gap between “what open weights can do” and “what open weights can do on your desktop” is widening.
The MoE architecture is the bottleneck. A model with 32B active parameters sounds consumer-friendly until you realize the full 1T parameter set needs to live in VRAM. Sparse activation saves compute but not memory. Until quantization or offloading techniques close this gap, the biggest open-weight models will remain cloud-hosted for most users.
Updated Rankings: What Runs on Your Hardware
| Model | Type | Total / Active Params | Context | Best Benchmark | Speed (RTX 4090, Q4) |
|---|---|---|---|---|---|
| Qwen 3.6 27B | Dense | 27B / 27B | 262K+ | SWE-Bench V. 77.2% | ~30-40 tok/s |
| Gemma 4 31B | Dense | 31B / 31B | 256K | AIME 89.2%, Arena #3 | ~30 tok/s |
| Qwen 3.6 35B-A3B | MoE | 35B / ~3B | 262K+ | SWE-Bench V. 73.4% | ~100-120 tok/s |
| Gemma 4 26B-A4B | MoE | 26B / 3.8B | 256K | Arena #6 | ~40-50 tok/s |
| Nemotron 3 Nano | MoE | 31.6B / ~3.6B | 1M | — | ~50+ tok/s |
| GPT-OSS 20B | MoE | 21B / 3.6B | 131K | — | ~45 tok/s |
| Gemma 4 E4B | Dense | 4B / 4B | 128K | — | ~85 tok/s |
Removed from consumer rankings: DeepSeek V4-Flash (needs 33GB+ minimum, multi-GPU for usable quality), Llama 4 Scout (55GB+ at Q4, doesn’t fit single consumer card).
Not ranked (too large): Kimi K2.6 (1T/32B, server only), DeepSeek V4-Pro (1.6T/49B, server only), Mistral Medium 3.5 (128B dense, server only).
My Picks This Week
Best for coding quality (24GB GPU): Qwen 3.6 27B. Three weeks of heavy community use confirmed what the benchmarks claimed. SWE-Bench 77.2% holds up in practice. Dense architecture keeps deployment simple.
Best for general intelligence (24GB GPU): Gemma 4 31B. Arena AI #3, AIME 89.2%, Apache 2.0. For reasoning, math, and analysis on a single card, this is still the one.
Best for speed (16-24GB GPU): Qwen 3.6 35B-A3B. 100-120 tok/s on a 4090, 47-51 tok/s on a 16GB card. Only 3B active parameters but benchmarks far above what that number suggests. The best ratio of quality-to-latency in the open-weight world.
Best for 16GB cards: Gemma 4 26B MoE. Fits in 18GB at Q4, Apache 2.0, competitive with models twice its size. The default recommendation if you have an RTX 4060 Ti 16GB or RTX 4080.
Best for edge/mobile: Gemma 4 E2B. 1.5GB RAM, 60+ tok/s, Apache 2.0. Nothing else comes close at this size.
What to Watch
Kimi K2.6 quantization experiments. At 1T parameters with 32B active, aggressive quantization could theoretically bring this into dual-4090 territory. Early GGUF quantizations are starting to appear. If someone gets it running at acceptable quality on ~48GB total VRAM, the consumer rankings reshuffle immediately.
Qwen 3.7 rumors. Codersera and others are tracking hints of a Qwen 3.7 release. If Alibaba follows their recent cadence, another iteration focused on the 27B-35B range could arrive within weeks.
Gemma 4 vLLM fix. The FlashAttention compatibility issue we’ve been tracking since week 9 still isn’t fully resolved. llama.cpp remains the pragmatic serving choice for Gemma 4. This matters for anyone building production workflows around these models.
The memory bandwidth problem. Every major new open-weight model is MoE. Every MoE model needs its full parameter count in memory. Consumer GPUs max at 24GB (or 48GB on the 5090). The research community is working on offloading and speculative expert loading, but until those techniques mature, the biggest open-weight models stay behind the consumer line. This is the structural bottleneck to watch.
Three weeks, three trillion-parameter models, and the best thing you can run on your RTX 4090 is still a 27-35B parameter model from Alibaba. That’s not a failure of progress—the frontier is genuinely moving. It’s just moving faster at the top of the hardware stack than at the bottom. For consumer GPU users, the win is that models in the 27-35B range keep getting better at what they do. The ceiling might not be lifting, but the floor is rising.