Open-Weight LLM Showdown Week 7: Mistral Small 4 Impresses but Stays Out of Reach

Mistral Small 4 arrived with fanfare this month. It unifies reasoning, vision, and coding into one model, runs Apache 2.0, and beats GPT-OSS 120B on multiple benchmarks. The catch: you’ll need enterprise hardware to run it.

Meanwhile, Qwen 3.5’s MoE models continue to dominate consumer GPUs. A used $800 RTX 3090 now delivers 112 tokens per second with frontier-competitive quality. That’s the real story this week.

Mistral Small 4: The Three-in-One Model

Mistral released Small 4 on March 17 under Apache 2.0. The pitch: one model that combines Magistral’s reasoning, Pixtral’s vision, and Devstral’s coding capabilities.

The architecture is clever. 119 billion total parameters with 128 expert subnetworks, but only 4 experts active per token—roughly 6 billion parameters per forward pass. Add a 256K context window and native multimodal understanding.

The benchmark numbers back up the hype:

Benchmark	Small 4	GPT-OSS 120B
GPQA Diamond	71.2%	68.4%
MMLU-Pro	78.0%	76.3%
LiveCodeBench	Higher	Lower
Output Length	1.6K chars	5.8K chars

That last row matters. Mistral’s testing shows Small 4 produces 20% less output than competitors for equivalent task completion. Concise responses at lower cost per token.

The Consumer Hardware Problem

Here’s where excitement meets reality. Small 4’s 119B parameters need:

Format	Memory Required	Hardware
BF16	~242 GB	4x H100
NVFP4	~66 GB	DGX Spark or 3x RTX 4090
INT4	~60-70 GB	Dual RTX 4090 (complex)

A DGX Spark—NVIDIA’s “entry-level” AI workstation—runs Small 4 acceptably with 128GB unified memory. But a DGX Spark costs $3,999.

For most users, Small 4 remains an API model. Mistral’s pricing at $0.15/1M input and $0.60/1M output is competitive—5-7x cheaper than GPT-5.4 Mini. But “open weights” doesn’t mean “runs on your GPU.”

Qwen 3.5 35B-A3B: The Actual Consumer Champion

While Mistral grabbed headlines, Qwen 3.5 quietly became the model to beat on consumer hardware. The 35B-A3B variant exemplifies why MoE architectures matter for local inference.

The numbers:

GPU	Speed	Notes
RTX 3090 (24GB)	112 t/s	Full 262K context
RTX 5060 Ti (16GB)	44 t/s	100K context
RTX 4090 (24GB)	100-130 t/s	Varies by quant

112 tokens per second on a used $800 card. That’s faster than reading speed. The architecture—256 experts with only 3 billion parameters active per token—keeps VRAM usage manageable while maintaining quality.

Benchmark comparisons show the 35B-A3B punching above its weight:

TAU2-Bench: 81.2% (22.7 points better than Qwen3-235B)
AndroidWorld: 71.1% (GUI understanding)
ScreenSpot Pro: 68.6% (UI interaction)
SWE-bench Verified: 69.2% (code generation)

That SWE-bench score on an $800 GPU setup deserves attention. It’s approaching what required datacenter hardware six months ago.

The Consumer GPU Tier List (Updated)

Based on this week’s community benchmarks:

24GB VRAM (RTX 3090/4090)

Best For	Model	Speed	Why
Speed + Quality	Qwen3.5-35B-A3B	100-130 t/s	Best overall balance
Reasoning	MiMo-V2-Flash (Q3)	20-30 t/s	94.1% AIME
Coding	GLM-4.7-Flash	60-80 t/s	59.2% SWE-Bench
Dense Quality	Qwen3.5-27B (8-bit)	20-25 t/s	No MoE shortcuts

12-16GB VRAM (RTX 4060/4070/4080)

Best For	Model	Speed	Why
Balanced	Qwen3.5-35B-A3B (4-bit)	40-60 t/s	MoE efficiency
Coding	GLM-4.7-Flash (4-bit)	45-60 t/s	SWE-Bench leader
General	Gemma 3 27B QAT	20-30 t/s	Fits in 14.1GB

8GB VRAM (RTX 4060 Laptop, etc.)

Best For	Model	Speed	Why
General	Qwen3-14B (4-bit)	30-40 t/s	Best quality/VRAM ratio
Small Tasks	Gemma 3 4B	50-80 t/s	4.2GB RAM
Coding	GLM-4.7-Flash (2-bit)	25-35 t/s	Quality tradeoffs

Mac vs RTX: The Divide Clarifies

This week’s testing reinforced a pattern. Mac systems with unified memory excel at running larger models that don’t fit in VRAM. RTX cards win on throughput for models that do fit.

Hardware	Strength	Weakness
M4 Max (128GB)	Runs Llama 4 Scout at Q4	Slower per-token than RTX
RTX 4090 (24GB)	130 t/s on Qwen3.5-35B	Can’t fit Scout
RTX 3090 (24GB)	$800 used, 112 t/s	Same limits as 4090

If your target model fits in 24GB VRAM at useful quantization, RTX wins on speed and cost. If you need 50GB+ for models like Llama 4 Scout, Mac Studio remains the accessible option.

What This Means

The open-weight ecosystem continues splitting into two tiers:

Consumer-accessible: Models designed for efficient inference on 8-24GB VRAM. Qwen 3.5’s MoE family, GLM-4.7 Flash, MiMo-V2-Flash. These deliver genuine value on hardware people actually own.

Consumer-aspirational: Models like Mistral Small 4, Llama 4 Scout, and GLM-5. Technically open weights, practically requiring $4,000+ workstations or cloud deployment.

Both tiers matter. Small 4’s existence pushes the industry forward. Its techniques will filter into smaller models. But today’s practical choice for most users remains clear: Qwen 3.5’s MoE variants for general use, GLM-4.7 Flash for coding, MiMo-V2-Flash for reasoning.

What You Can Do

If you want the fastest general assistant: Grab Qwen3.5-35B-A3B from Hugging Face. At Q4 quantization on a 24GB card, expect 100+ t/s with quality that rivals much larger models.

If coding is your priority: GLM-4.7-Flash still leads SWE-bench Verified among consumer-runnable models. Run through Ollama for easy setup.

If you’re buying hardware: A used RTX 3090 at $700-800 delivers 90% of the RTX 4090’s practical value for local LLMs. The extra CUDA cores matter less than the identical 24GB VRAM limit.

If Mistral Small 4 appeals: Use the API at $0.15/1M input. It’s cheaper than running your own 4x H100 setup and you get full BF16 quality. Self-hosting makes sense when you hit API cost thresholds—for Small 4, that’s significant volume.

The home GPU leaderboard tracks which models perform best on specific hardware configurations. Check before downloading—the landscape shifts weekly.