Two major releases landed this week: Mistral Small 4 dropped at GTC with 128 experts under Apache 2.0, and DeepSeek V4 finally emerged from its months-long stealth mode. Meanwhile, dual RTX 5090 benchmarks confirm what enthusiasts suspected—consumer hardware can now match enterprise GPUs on 70B inference.
Here’s what matters.
Mistral Small 4: The New Efficiency King
Announced at GTC on March 16, Mistral Small 4 is a 119B parameter MoE model that activates only 6B parameters per forward pass. That’s 128 experts with 4 active per token—a design choice that makes it remarkably efficient.
The headline numbers:
- 119B total parameters, 6-8B active (8B including embeddings)
- 256K context window—up from Small 3’s 128K
- Apache 2.0 license—the most permissive option
- Multimodal input—text and images
What sets Small 4 apart is configurable reasoning. You can toggle between fast, low-latency responses for simple tasks and deep, reasoning-intensive outputs for complex problems. According to benchmarks, this delivers 40% lower latency and triple the throughput compared to Small 3.
Performance Comparison
On standard benchmarks, Small 4 positions itself between Small 3.2 and Large 3:
| Benchmark | Small 3.2 | Small 4 | Large 3 |
|---|---|---|---|
| MMLU | 80.5% | 83.2% | 85.5% |
| HumanEval | 92.9% | 94.1% | 95.8% |
| Arena Hard | 43.1% | 67.4% | 78.2% |
| IFEval | 82.3% | 88.7% | 92.1% |
The Arena Hard jump—from 43% to 67%—represents a significant improvement in real-world conversational ability.
Local Deployment
For local inference, Small 4’s MoE architecture means the model is surprisingly runnable on consumer hardware. At INT4 quantization, it fits in approximately 40GB VRAM—achievable on dual RTX 4090s or a single 5090 with room to spare.
Ollama support is already available: ollama run mistral-small-4
DeepSeek V4: The Wait Is Over
After missed release windows in mid-February, late February, and early March, DeepSeek V4 finally launched around March 3. The developer community’s reaction has been mixed—enthusiasm about capabilities, skepticism about self-reported benchmarks.
- ~1 trillion total parameters, ~32B active
- 1 million token context window
- Native multimodal (text, image, video input; image generation)
- MIT license
V4 introduces what DeepSeek calls “Manifold-Constrained Hyper-Connections” for training stability at trillion-parameter scale, plus “Engram Conditional Memory” for efficient retrieval over million-token contexts.
Benchmark Claims (Unverified)
Leaked benchmarks suggest V4 is competitive with current frontier models:
- HumanEval: ~90% (would match Claude Opus 4.6)
- SWE-bench Verified: 80%+ (top tier for code)
- MATH: 92.4% (if accurate, best-in-class)
All V4 benchmark claims remain unverified until DeepSeek publishes official reports. The community has been burned before by inflated numbers.
The Practical Reality
V4’s 32B active parameters make it more demanding than V3’s 21B. Even with aggressive quantization, you’re looking at:
- RTX 5090 (32GB): Tight fit at INT4, limited context
- Dual 5090 (64GB): Comfortable at INT4, reasonable context
- Mac Studio M4 Ultra 512GB: Full precision possible
For most local users, V3.2 remains the practical choice. V4 is more relevant for API access or enterprise deployments.
Dual RTX 5090: Consumer Hardware Hits Enterprise Territory
The most surprising development this week came from dual GPU benchmarks. Two RTX 5090s running Ollama now match H100 performance on 70B models—at a fraction of the cost.
The numbers:
- DeepSeek-R1 70B: 33 tokens/second at 30K context
- Llama 3.3 70B: 27 tokens/second (matching H100)
- Cost comparison: 2× 5090 (
$4K MSRP, $10K+ scalped) vs H100 ($30K+)
Important caveat: Ollama doesn’t parallelize inference across GPUs—it just pools VRAM. You won’t see 2× speedup from 2 cards. What you get is the ability to run larger models without spilling to CPU RAM.
For 110B+ models like Qwen 3.5 full, dual 5090s still struggle. GPU utilization caps at 20%, and inference drops to 7 tokens/second. Enterprise hardware retains its edge at the largest scales.
The Sweet Spots
Based on current benchmarks:
| Setup | Best Model Class | Tokens/sec | Notes |
|---|---|---|---|
| Single RTX 5090 | 32B dense | 61-65 | Qwen 3.5 32B optimal |
| Single RTX 5090 | 30B MoE | 234 | Qwen 3 MoE screams |
| Dual RTX 5090 | 70B quantized | 27-33 | H100 territory |
| Single RTX 4090 | 27B dense | 35-45 | Still the value king |
Updated Rankings
Combining leaderboard data with this week’s releases:
For Coding
- Qwen 3.5 - GPQA Diamond 88.4%, LiveCodeBench leader
- Mistral Small 4 - HumanEval 94.1%, configurable depth
- DeepSeek V4 - SWE-bench 80%+ (if benchmarks hold)
For Reasoning
- Kimi K2.5 - IFEval 94.0, AIME 96.1%
- Qwen 3.5 - Best GPQA Diamond (88.4%)
- Llama 4 Scout - 10M context for document reasoning
For Speed/Efficiency
- Mistral Small 4 - 40% lower latency than Small 3
- Gemma 3 27B - Dense architecture, no MoE overhead
- Qwen 3.5 Small - 9B runs everywhere
For Local Deployment
- Qwen 3.5-9B - Best quality under 10B
- Mistral Small 4 Q4 - Fits single 5090
- Gemma 3 27B Q4 - 14GB with QAT
Hardware Recommendations (March 21, 2026)
RTX 5090 Owners
You have options now:
- Mistral Small 4 Q4 - The new efficiency standard
- Qwen 3.5-32B full - Best dense model at this scale
- Llama 4 Scout INT8 - When you need 10M context
RTX 4090 Owners
Still the practical sweet spot:
- Mistral Small 4 Q4 - Tight but works
- Gemma 3 27B Q4 - 14GB leaves room for context
- Qwen 3.5-9B - Quality that rivals 70B from 2024
Dual GPU Enthusiasts
If you can acquire two 5090s:
- DeepSeek-R1 70B - Full reasoning model at 33 tok/s
- Llama 3.3 70B - H100-matching inference
- Qwen 3.5-70B - The frontier, locally
Mac Users
Unified memory continues to differentiate:
- M4 Max 128GB: Llama 4 Scout usable, V4 at reduced context
- M4 Ultra 512GB: Everything fits, eventually
The Bottom Line
This week marked a shift. Mistral Small 4 proves that Apache-licensed, MoE-based models can compete with proprietary options while running on consumer hardware. DeepSeek V4’s arrival—despite benchmark skepticism—adds another trillion-parameter option to the open-weight ecosystem.
The dual 5090 benchmarks are perhaps most significant. Consumer hardware matching H100 performance on 70B models wasn’t expected this soon. Yes, you still can’t buy a 5090 at MSRP. But the performance ceiling for home labs keeps rising.
For most users, the practical action remains unchanged: Qwen 3.5-9B via Ollama handles 90% of tasks. When you need more, Mistral Small 4 and Gemma 3 27B offer excellent quality-to-resource ratios.
What to Try This Week
-
Mistral Small 4 -
ollama run mistral-small-4and test configurable reasoning -
DeepSeek V4 via API - If self-hosting is impractical, try the hosted version first
-
Dual GPU owners - Benchmark DeepSeek-R1 70B at extended context
-
Everyone else - Qwen 3.5-9B remains the default recommendation
Next week: We’ll see if V4’s benchmarks hold under independent testing, and whether the 5090 supply situation improves.