The open-weight landscape split further this week. Xiaomi’s MiMo-V2-Flash proved you can squeeze 309 billion parameters onto a single RTX 4090. Meanwhile, Zhipu AI’s GLM-5 set new benchmarks that require hardware most of us will never own. And Llama 4 Scout remains the model everyone wants to run locally but can’t.
Here’s what actually changed for consumer hardware users.
MiMo-V2-Flash: The Consumer GPU Breakthrough
Xiaomi released MiMo-V2-Flash with a MIT license and a promise: frontier-class reasoning on consumer hardware. The model packs 309 billion total parameters but activates only 15 billion per token through its MoE architecture.
The benchmark numbers that matter:
- 94.1% on AIME 2025 (mathematical reasoning—within 0.5% of GPT-5 High)
- 89.2% on GPQA Diamond (scientific reasoning)
- 150 tokens/second on optimized infrastructure
But those server-side numbers don’t tell the consumer story. Community testing shows what actually happens on real hardware:
RTX 4090 Performance (24GB VRAM)
| Quantization | VRAM Used | Speed | Quality |
|---|---|---|---|
| Q3_K_S | ~22GB | 20-25 t/s | Good for reasoning |
| IQ3_XS | ~20GB | 25-30 t/s | Slight quality drop |
| INT8 | ~24GB | 18-22 t/s | Best quality fit |
These speeds are interactive. The architecture innovations that make this possible:
- Hybrid attention with 128-token sliding windows reduces KV-cache storage by 6x
- Multi-Token Prediction generates multiple tokens per forward pass
- Aggressive MoE sparsity keeps only 15B parameters hot at any time
The catch: context length. MiMo-V2-Flash’s aggressive memory optimizations limit practical context to around 8K tokens on consumer GPUs. For longer conversations or document analysis, you’ll need more VRAM or accept significant slowdowns from CPU offloading.
GLM-5: Frontier Performance, Datacenter Requirements
Zhipu AI’s GLM-5 set the new high-water mark for open models in February. The 744B parameter MoE model (40B active) beats Claude Opus 4.5 on Humanity’s Last Exam and holds the #1 spot on SWE-bench Verified at 77.8%.
Notable: the entire model trained on Huawei Ascend chips using MindSpore. Zero NVIDIA dependency. That’s a technical achievement worth acknowledging regardless of where you stand on geopolitics.
But running GLM-5 locally is a different story:
| Format | Memory Required | Hardware Needed |
|---|---|---|
| FP16/BF16 | ~1.5 TB | 8x H200 minimum |
| 2-bit GGUF | ~300 GB | Single 24GB GPU + 300GB RAM |
| 1-bit | ~180 GB | Impractical speed |
The 2-bit quantization works but requires serious CPU offloading. You can technically run GLM-5 on a 24GB GPU with 300GB of system RAM using MoE offloading, but expect single-digit tokens per second. It’s a proof of concept, not a daily driver.
This is the growing divide in “open-weight” models: technically available weights that require datacenter hardware to actually run.
Llama 4 Scout: So Close, Yet So Far
Meta’s Llama 4 Scout remains the most discussed model that nobody can run locally. The 17B active parameter MoE (109B total) needs approximately 55GB VRAM at Q4 quantization—more than double what an RTX 4090 provides.
The frustration is real because the benchmarks are excellent:
- 88.8% on ChartQA (image understanding)
- 94.4% on DocVQA (document parsing)
- 10M token context window
Some users report success with Unsloth’s 1.78-bit quantization fitting in 24GB VRAM at around 20 tokens/second. But 1.78-bit quantization involves significant quality tradeoffs—you’re not getting the full Scout experience.
The practical situation:
| Hardware | Scout Status |
|---|---|
| RTX 4090 (24GB) | Too large at useful quantizations |
| Mac Studio 128GB | Runs at Q4, competitive speed |
| Dual GPU (48GB) | Works but complex setup |
| 1.78-bit quant | Fits but degraded quality |
For most users, Llama 4 Scout is a “wait for distillation” model. Meta will likely release smaller variants trained on Scout outputs—that’s when consumer hardware users get access.
The Updated Local AI Tier List
Based on this week’s findings and community benchmarks:
24GB VRAM (RTX 3090/4090)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Reasoning | MiMo-V2-Flash (Q3) | 20-30 t/s | 94.1% AIME, fits in VRAM |
| Coding | GLM-4.7-Flash | 60-80 t/s | 59.2% SWE-Bench Verified |
| General | Qwen3.5-35B-A3B | 60-100 t/s | Best overall balance |
| Quality | Qwen3.5-27B (8-bit) | 20-25 t/s | Dense model, no MoE shortcuts |
12-16GB VRAM (RTX 4070/4080)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Balanced | Qwen3.5-35B-A3B (4-bit) | 40-60 t/s | MoE efficiency |
| Coding | GLM-4.7-Flash (4-bit) | 45-60 t/s | Still dominates SWE-Bench |
| General | Gemma 3 27B QAT | 20-30 t/s | Runs in 14.1GB |
The “Wait List” (Too Large for Consumer GPUs)
| Model | Why Wait |
|---|---|
| Llama 4 Scout | Needs 55GB+ at Q4 |
| GLM-5 | Needs 300GB+ for usable speeds |
| Llama 4 Maverick | Needs 200GB+ |
What This Means
The open-weight ecosystem is bifurcating. On one side: models like MiMo-V2-Flash, Qwen 3.5, and GLM-4.7 that are genuinely usable on consumer hardware. On the other: “open-weight” frontier models that require enterprise infrastructure.
This isn’t necessarily bad. GLM-5’s existence proves open models can compete at the absolute frontier. The research and techniques will filter down to smaller, more accessible models over time.
But it does mean the phrase “open-weight” needs context. Open weights you can download are different from open weights you can run.
For this week’s practical takeaways:
- MiMo-V2-Flash is the new reasoning champion for 24GB GPUs—94.1% on AIME at 20+ t/s is remarkable
- GLM-4.7-Flash remains the coding king if you primarily work with code
- Qwen 3.5’s MoE models stay the general-purpose default for balance of speed and quality
- Llama 4 Scout isn’t a consumer model yet—wait for distillations or buy a Mac Studio
What You Can Do
If you have 24GB VRAM and want reasoning: Download MiMo-V2-Flash from Hugging Face, grab the Q3_K_S GGUF, and run through Ollama or llama.cpp. Expect 20-30 t/s and competitive performance with GPT-5 on math and science reasoning.
If you want the fastest coding assistant: GLM-4.7-Flash at 4-bit still beats everything else on SWE-Bench Verified while hitting 60+ t/s.
If you’re frustrated with Llama 4 Scout: Check Unsloth’s 1.78-bit quantizations if you’re willing to accept quality tradeoffs. Otherwise, wait for Meta’s inevitable smaller variants.
If you have Mac hardware: The M4 Max and Studio remain the best value for local inference. 128GB unified memory runs Scout at full Q4 quality with room for context. The RTX 4090 can’t match that capability ceiling despite faster raw throughput on smaller models.
The home GPU leaderboard updates frequently—check it before downloading anything to verify current rankings. The open-weight space moves fast, and what was optimal last month may have been eclipsed.