Open-Weight LLM Showdown Week 6: MiMo-V2-Flash Brings 309B Parameters to Consumer GPUs

The open-weight landscape split further this week. Xiaomi’s MiMo-V2-Flash proved you can squeeze 309 billion parameters onto a single RTX 4090. Meanwhile, Zhipu AI’s GLM-5 set new benchmarks that require hardware most of us will never own. And Llama 4 Scout remains the model everyone wants to run locally but can’t.

Here’s what actually changed for consumer hardware users.

MiMo-V2-Flash: The Consumer GPU Breakthrough

Xiaomi released MiMo-V2-Flash with a MIT license and a promise: frontier-class reasoning on consumer hardware. The model packs 309 billion total parameters but activates only 15 billion per token through its MoE architecture.

The benchmark numbers that matter:

94.1% on AIME 2025 (mathematical reasoning—within 0.5% of GPT-5 High)
89.2% on GPQA Diamond (scientific reasoning)
150 tokens/second on optimized infrastructure

But those server-side numbers don’t tell the consumer story. Community testing shows what actually happens on real hardware:

RTX 4090 Performance (24GB VRAM)

Quantization	VRAM Used	Speed	Quality
Q3_K_S	~22GB	20-25 t/s	Good for reasoning
IQ3_XS	~20GB	25-30 t/s	Slight quality drop
INT8	~24GB	18-22 t/s	Best quality fit

These speeds are interactive. The architecture innovations that make this possible:

Hybrid attention with 128-token sliding windows reduces KV-cache storage by 6x
Multi-Token Prediction generates multiple tokens per forward pass
Aggressive MoE sparsity keeps only 15B parameters hot at any time

The catch: context length. MiMo-V2-Flash’s aggressive memory optimizations limit practical context to around 8K tokens on consumer GPUs. For longer conversations or document analysis, you’ll need more VRAM or accept significant slowdowns from CPU offloading.

GLM-5: Frontier Performance, Datacenter Requirements

Zhipu AI’s GLM-5 set the new high-water mark for open models in February. The 744B parameter MoE model (40B active) beats Claude Opus 4.5 on Humanity’s Last Exam and holds the #1 spot on SWE-bench Verified at 77.8%.

Notable: the entire model trained on Huawei Ascend chips using MindSpore. Zero NVIDIA dependency. That’s a technical achievement worth acknowledging regardless of where you stand on geopolitics.

But running GLM-5 locally is a different story:

Format	Memory Required	Hardware Needed
FP16/BF16	~1.5 TB	8x H200 minimum
2-bit GGUF	~300 GB	Single 24GB GPU + 300GB RAM
1-bit	~180 GB	Impractical speed

The 2-bit quantization works but requires serious CPU offloading. You can technically run GLM-5 on a 24GB GPU with 300GB of system RAM using MoE offloading, but expect single-digit tokens per second. It’s a proof of concept, not a daily driver.

This is the growing divide in “open-weight” models: technically available weights that require datacenter hardware to actually run.

Llama 4 Scout: So Close, Yet So Far

Meta’s Llama 4 Scout remains the most discussed model that nobody can run locally. The 17B active parameter MoE (109B total) needs approximately 55GB VRAM at Q4 quantization—more than double what an RTX 4090 provides.

The frustration is real because the benchmarks are excellent:

88.8% on ChartQA (image understanding)
94.4% on DocVQA (document parsing)
10M token context window

Some users report success with Unsloth’s 1.78-bit quantization fitting in 24GB VRAM at around 20 tokens/second. But 1.78-bit quantization involves significant quality tradeoffs—you’re not getting the full Scout experience.

The practical situation:

Hardware	Scout Status
RTX 4090 (24GB)	Too large at useful quantizations
Mac Studio 128GB	Runs at Q4, competitive speed
Dual GPU (48GB)	Works but complex setup
1.78-bit quant	Fits but degraded quality

For most users, Llama 4 Scout is a “wait for distillation” model. Meta will likely release smaller variants trained on Scout outputs—that’s when consumer hardware users get access.

The Updated Local AI Tier List

Based on this week’s findings and community benchmarks:

24GB VRAM (RTX 3090/4090)

Best For	Model	Speed	Why
Reasoning	MiMo-V2-Flash (Q3)	20-30 t/s	94.1% AIME, fits in VRAM
Coding	GLM-4.7-Flash	60-80 t/s	59.2% SWE-Bench Verified
General	Qwen3.5-35B-A3B	60-100 t/s	Best overall balance
Quality	Qwen3.5-27B (8-bit)	20-25 t/s	Dense model, no MoE shortcuts

12-16GB VRAM (RTX 4070/4080)

Best For	Model	Speed	Why
Balanced	Qwen3.5-35B-A3B (4-bit)	40-60 t/s	MoE efficiency
Coding	GLM-4.7-Flash (4-bit)	45-60 t/s	Still dominates SWE-Bench
General	Gemma 3 27B QAT	20-30 t/s	Runs in 14.1GB

The “Wait List” (Too Large for Consumer GPUs)

Model	Why Wait
Llama 4 Scout	Needs 55GB+ at Q4
GLM-5	Needs 300GB+ for usable speeds
Llama 4 Maverick	Needs 200GB+

What This Means

The open-weight ecosystem is bifurcating. On one side: models like MiMo-V2-Flash, Qwen 3.5, and GLM-4.7 that are genuinely usable on consumer hardware. On the other: “open-weight” frontier models that require enterprise infrastructure.

This isn’t necessarily bad. GLM-5’s existence proves open models can compete at the absolute frontier. The research and techniques will filter down to smaller, more accessible models over time.

But it does mean the phrase “open-weight” needs context. Open weights you can download are different from open weights you can run.

For this week’s practical takeaways:

MiMo-V2-Flash is the new reasoning champion for 24GB GPUs—94.1% on AIME at 20+ t/s is remarkable
GLM-4.7-Flash remains the coding king if you primarily work with code
Qwen 3.5’s MoE models stay the general-purpose default for balance of speed and quality
Llama 4 Scout isn’t a consumer model yet—wait for distillations or buy a Mac Studio

What You Can Do

If you have 24GB VRAM and want reasoning: Download MiMo-V2-Flash from Hugging Face, grab the Q3_K_S GGUF, and run through Ollama or llama.cpp. Expect 20-30 t/s and competitive performance with GPT-5 on math and science reasoning.

If you want the fastest coding assistant: GLM-4.7-Flash at 4-bit still beats everything else on SWE-Bench Verified while hitting 60+ t/s.

If you’re frustrated with Llama 4 Scout: Check Unsloth’s 1.78-bit quantizations if you’re willing to accept quality tradeoffs. Otherwise, wait for Meta’s inevitable smaller variants.

If you have Mac hardware: The M4 Max and Studio remain the best value for local inference. 128GB unified memory runs Scout at full Q4 quality with room for context. The RTX 4090 can’t match that capability ceiling despite faster raw throughput on smaller models.

The home GPU leaderboard updates frequently—check it before downloading anything to verify current rankings. The open-weight space moves fast, and what was optimal last month may have been eclipsed.