Open-Weight LLM Showdown Week 7: Mistral Small 4 Impresses but Stays Out of Reach

Mistral Small 4's 119B MoE unifies reasoning, vision, and coding—but needs datacenter hardware. Qwen 3.5 35B-A3B remains the consumer GPU king at 112 t/s.

Close-up of computer circuit board with microchips and electronic components

Mistral Small 4 arrived with fanfare this month. It unifies reasoning, vision, and coding into one model, runs Apache 2.0, and beats GPT-OSS 120B on multiple benchmarks. The catch: you’ll need enterprise hardware to run it.

Meanwhile, Qwen 3.5’s MoE models continue to dominate consumer GPUs. A used $800 RTX 3090 now delivers 112 tokens per second with frontier-competitive quality. That’s the real story this week.

Mistral Small 4: The Three-in-One Model

Mistral released Small 4 on March 17 under Apache 2.0. The pitch: one model that combines Magistral’s reasoning, Pixtral’s vision, and Devstral’s coding capabilities.

The architecture is clever. 119 billion total parameters with 128 expert subnetworks, but only 4 experts active per token—roughly 6 billion parameters per forward pass. Add a 256K context window and native multimodal understanding.

The benchmark numbers back up the hype:

BenchmarkSmall 4GPT-OSS 120B
GPQA Diamond71.2%68.4%
MMLU-Pro78.0%76.3%
LiveCodeBenchHigherLower
Output Length1.6K chars5.8K chars

That last row matters. Mistral’s testing shows Small 4 produces 20% less output than competitors for equivalent task completion. Concise responses at lower cost per token.

The Consumer Hardware Problem

Here’s where excitement meets reality. Small 4’s 119B parameters need:

FormatMemory RequiredHardware
BF16~242 GB4x H100
NVFP4~66 GBDGX Spark or 3x RTX 4090
INT4~60-70 GBDual RTX 4090 (complex)

A DGX Spark—NVIDIA’s “entry-level” AI workstation—runs Small 4 acceptably with 128GB unified memory. But a DGX Spark costs $3,999.

For most users, Small 4 remains an API model. Mistral’s pricing at $0.15/1M input and $0.60/1M output is competitive—5-7x cheaper than GPT-5.4 Mini. But “open weights” doesn’t mean “runs on your GPU.”

Qwen 3.5 35B-A3B: The Actual Consumer Champion

While Mistral grabbed headlines, Qwen 3.5 quietly became the model to beat on consumer hardware. The 35B-A3B variant exemplifies why MoE architectures matter for local inference.

The numbers:

GPUSpeedNotes
RTX 3090 (24GB)112 t/sFull 262K context
RTX 5060 Ti (16GB)44 t/s100K context
RTX 4090 (24GB)100-130 t/sVaries by quant

112 tokens per second on a used $800 card. That’s faster than reading speed. The architecture—256 experts with only 3 billion parameters active per token—keeps VRAM usage manageable while maintaining quality.

Benchmark comparisons show the 35B-A3B punching above its weight:

  • TAU2-Bench: 81.2% (22.7 points better than Qwen3-235B)
  • AndroidWorld: 71.1% (GUI understanding)
  • ScreenSpot Pro: 68.6% (UI interaction)
  • SWE-bench Verified: 69.2% (code generation)

That SWE-bench score on an $800 GPU setup deserves attention. It’s approaching what required datacenter hardware six months ago.

The Consumer GPU Tier List (Updated)

Based on this week’s community benchmarks:

24GB VRAM (RTX 3090/4090)

Best ForModelSpeedWhy
Speed + QualityQwen3.5-35B-A3B100-130 t/sBest overall balance
ReasoningMiMo-V2-Flash (Q3)20-30 t/s94.1% AIME
CodingGLM-4.7-Flash60-80 t/s59.2% SWE-Bench
Dense QualityQwen3.5-27B (8-bit)20-25 t/sNo MoE shortcuts

12-16GB VRAM (RTX 4060/4070/4080)

Best ForModelSpeedWhy
BalancedQwen3.5-35B-A3B (4-bit)40-60 t/sMoE efficiency
CodingGLM-4.7-Flash (4-bit)45-60 t/sSWE-Bench leader
GeneralGemma 3 27B QAT20-30 t/sFits in 14.1GB

8GB VRAM (RTX 4060 Laptop, etc.)

Best ForModelSpeedWhy
GeneralQwen3-14B (4-bit)30-40 t/sBest quality/VRAM ratio
Small TasksGemma 3 4B50-80 t/s4.2GB RAM
CodingGLM-4.7-Flash (2-bit)25-35 t/sQuality tradeoffs

Mac vs RTX: The Divide Clarifies

This week’s testing reinforced a pattern. Mac systems with unified memory excel at running larger models that don’t fit in VRAM. RTX cards win on throughput for models that do fit.

HardwareStrengthWeakness
M4 Max (128GB)Runs Llama 4 Scout at Q4Slower per-token than RTX
RTX 4090 (24GB)130 t/s on Qwen3.5-35BCan’t fit Scout
RTX 3090 (24GB)$800 used, 112 t/sSame limits as 4090

If your target model fits in 24GB VRAM at useful quantization, RTX wins on speed and cost. If you need 50GB+ for models like Llama 4 Scout, Mac Studio remains the accessible option.

What This Means

The open-weight ecosystem continues splitting into two tiers:

Consumer-accessible: Models designed for efficient inference on 8-24GB VRAM. Qwen 3.5’s MoE family, GLM-4.7 Flash, MiMo-V2-Flash. These deliver genuine value on hardware people actually own.

Consumer-aspirational: Models like Mistral Small 4, Llama 4 Scout, and GLM-5. Technically open weights, practically requiring $4,000+ workstations or cloud deployment.

Both tiers matter. Small 4’s existence pushes the industry forward. Its techniques will filter into smaller models. But today’s practical choice for most users remains clear: Qwen 3.5’s MoE variants for general use, GLM-4.7 Flash for coding, MiMo-V2-Flash for reasoning.

What You Can Do

If you want the fastest general assistant: Grab Qwen3.5-35B-A3B from Hugging Face. At Q4 quantization on a 24GB card, expect 100+ t/s with quality that rivals much larger models.

If coding is your priority: GLM-4.7-Flash still leads SWE-bench Verified among consumer-runnable models. Run through Ollama for easy setup.

If you’re buying hardware: A used RTX 3090 at $700-800 delivers 90% of the RTX 4090’s practical value for local LLMs. The extra CUDA cores matter less than the identical 24GB VRAM limit.

If Mistral Small 4 appeals: Use the API at $0.15/1M input. It’s cheaper than running your own 4x H100 setup and you get full BF16 quality. Self-hosting makes sense when you hit API cost thresholds—for Small 4, that’s significant volume.

The home GPU leaderboard tracks which models perform best on specific hardware configurations. Check before downloading—the landscape shifts weekly.