Mistral Small 4 arrived with fanfare this month. It unifies reasoning, vision, and coding into one model, runs Apache 2.0, and beats GPT-OSS 120B on multiple benchmarks. The catch: you’ll need enterprise hardware to run it.
Meanwhile, Qwen 3.5’s MoE models continue to dominate consumer GPUs. A used $800 RTX 3090 now delivers 112 tokens per second with frontier-competitive quality. That’s the real story this week.
Mistral Small 4: The Three-in-One Model
Mistral released Small 4 on March 17 under Apache 2.0. The pitch: one model that combines Magistral’s reasoning, Pixtral’s vision, and Devstral’s coding capabilities.
The architecture is clever. 119 billion total parameters with 128 expert subnetworks, but only 4 experts active per token—roughly 6 billion parameters per forward pass. Add a 256K context window and native multimodal understanding.
The benchmark numbers back up the hype:
| Benchmark | Small 4 | GPT-OSS 120B |
|---|---|---|
| GPQA Diamond | 71.2% | 68.4% |
| MMLU-Pro | 78.0% | 76.3% |
| LiveCodeBench | Higher | Lower |
| Output Length | 1.6K chars | 5.8K chars |
That last row matters. Mistral’s testing shows Small 4 produces 20% less output than competitors for equivalent task completion. Concise responses at lower cost per token.
The Consumer Hardware Problem
Here’s where excitement meets reality. Small 4’s 119B parameters need:
| Format | Memory Required | Hardware |
|---|---|---|
| BF16 | ~242 GB | 4x H100 |
| NVFP4 | ~66 GB | DGX Spark or 3x RTX 4090 |
| INT4 | ~60-70 GB | Dual RTX 4090 (complex) |
A DGX Spark—NVIDIA’s “entry-level” AI workstation—runs Small 4 acceptably with 128GB unified memory. But a DGX Spark costs $3,999.
For most users, Small 4 remains an API model. Mistral’s pricing at $0.15/1M input and $0.60/1M output is competitive—5-7x cheaper than GPT-5.4 Mini. But “open weights” doesn’t mean “runs on your GPU.”
Qwen 3.5 35B-A3B: The Actual Consumer Champion
While Mistral grabbed headlines, Qwen 3.5 quietly became the model to beat on consumer hardware. The 35B-A3B variant exemplifies why MoE architectures matter for local inference.
The numbers:
| GPU | Speed | Notes |
|---|---|---|
| RTX 3090 (24GB) | 112 t/s | Full 262K context |
| RTX 5060 Ti (16GB) | 44 t/s | 100K context |
| RTX 4090 (24GB) | 100-130 t/s | Varies by quant |
112 tokens per second on a used $800 card. That’s faster than reading speed. The architecture—256 experts with only 3 billion parameters active per token—keeps VRAM usage manageable while maintaining quality.
Benchmark comparisons show the 35B-A3B punching above its weight:
- TAU2-Bench: 81.2% (22.7 points better than Qwen3-235B)
- AndroidWorld: 71.1% (GUI understanding)
- ScreenSpot Pro: 68.6% (UI interaction)
- SWE-bench Verified: 69.2% (code generation)
That SWE-bench score on an $800 GPU setup deserves attention. It’s approaching what required datacenter hardware six months ago.
The Consumer GPU Tier List (Updated)
Based on this week’s community benchmarks:
24GB VRAM (RTX 3090/4090)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Speed + Quality | Qwen3.5-35B-A3B | 100-130 t/s | Best overall balance |
| Reasoning | MiMo-V2-Flash (Q3) | 20-30 t/s | 94.1% AIME |
| Coding | GLM-4.7-Flash | 60-80 t/s | 59.2% SWE-Bench |
| Dense Quality | Qwen3.5-27B (8-bit) | 20-25 t/s | No MoE shortcuts |
12-16GB VRAM (RTX 4060/4070/4080)
| Best For | Model | Speed | Why |
|---|---|---|---|
| Balanced | Qwen3.5-35B-A3B (4-bit) | 40-60 t/s | MoE efficiency |
| Coding | GLM-4.7-Flash (4-bit) | 45-60 t/s | SWE-Bench leader |
| General | Gemma 3 27B QAT | 20-30 t/s | Fits in 14.1GB |
8GB VRAM (RTX 4060 Laptop, etc.)
| Best For | Model | Speed | Why |
|---|---|---|---|
| General | Qwen3-14B (4-bit) | 30-40 t/s | Best quality/VRAM ratio |
| Small Tasks | Gemma 3 4B | 50-80 t/s | 4.2GB RAM |
| Coding | GLM-4.7-Flash (2-bit) | 25-35 t/s | Quality tradeoffs |
Mac vs RTX: The Divide Clarifies
This week’s testing reinforced a pattern. Mac systems with unified memory excel at running larger models that don’t fit in VRAM. RTX cards win on throughput for models that do fit.
| Hardware | Strength | Weakness |
|---|---|---|
| M4 Max (128GB) | Runs Llama 4 Scout at Q4 | Slower per-token than RTX |
| RTX 4090 (24GB) | 130 t/s on Qwen3.5-35B | Can’t fit Scout |
| RTX 3090 (24GB) | $800 used, 112 t/s | Same limits as 4090 |
If your target model fits in 24GB VRAM at useful quantization, RTX wins on speed and cost. If you need 50GB+ for models like Llama 4 Scout, Mac Studio remains the accessible option.
What This Means
The open-weight ecosystem continues splitting into two tiers:
Consumer-accessible: Models designed for efficient inference on 8-24GB VRAM. Qwen 3.5’s MoE family, GLM-4.7 Flash, MiMo-V2-Flash. These deliver genuine value on hardware people actually own.
Consumer-aspirational: Models like Mistral Small 4, Llama 4 Scout, and GLM-5. Technically open weights, practically requiring $4,000+ workstations or cloud deployment.
Both tiers matter. Small 4’s existence pushes the industry forward. Its techniques will filter into smaller models. But today’s practical choice for most users remains clear: Qwen 3.5’s MoE variants for general use, GLM-4.7 Flash for coding, MiMo-V2-Flash for reasoning.
What You Can Do
If you want the fastest general assistant: Grab Qwen3.5-35B-A3B from Hugging Face. At Q4 quantization on a 24GB card, expect 100+ t/s with quality that rivals much larger models.
If coding is your priority: GLM-4.7-Flash still leads SWE-bench Verified among consumer-runnable models. Run through Ollama for easy setup.
If you’re buying hardware: A used RTX 3090 at $700-800 delivers 90% of the RTX 4090’s practical value for local LLMs. The extra CUDA cores matter less than the identical 24GB VRAM limit.
If Mistral Small 4 appeals: Use the API at $0.15/1M input. It’s cheaper than running your own 4x H100 setup and you get full BF16 quality. Self-hosting makes sense when you hit API cost thresholds—for Small 4, that’s significant volume.
The home GPU leaderboard tracks which models perform best on specific hardware configurations. Check before downloading—the landscape shifts weekly.