Open-Weight LLM Showdown: Mistral Small 4 Arrives, DeepSeek V4 Finally Lands

Two major releases landed this week: Mistral Small 4 dropped at GTC with 128 experts under Apache 2.0, and DeepSeek V4 finally emerged from its months-long stealth mode. Meanwhile, dual RTX 5090 benchmarks confirm what enthusiasts suspected—consumer hardware can now match enterprise GPUs on 70B inference.

Here’s what matters.

Mistral Small 4: The New Efficiency King

Announced at GTC on March 16, Mistral Small 4 is a 119B parameter MoE model that activates only 6B parameters per forward pass. That’s 128 experts with 4 active per token—a design choice that makes it remarkably efficient.

The headline numbers:

119B total parameters, 6-8B active (8B including embeddings)
256K context window—up from Small 3’s 128K
Apache 2.0 license—the most permissive option
Multimodal input—text and images

What sets Small 4 apart is configurable reasoning. You can toggle between fast, low-latency responses for simple tasks and deep, reasoning-intensive outputs for complex problems. According to benchmarks, this delivers 40% lower latency and triple the throughput compared to Small 3.

Performance Comparison

On standard benchmarks, Small 4 positions itself between Small 3.2 and Large 3:

Benchmark	Small 3.2	Small 4	Large 3
MMLU	80.5%	83.2%	85.5%
HumanEval	92.9%	94.1%	95.8%
Arena Hard	43.1%	67.4%	78.2%
IFEval	82.3%	88.7%	92.1%

The Arena Hard jump—from 43% to 67%—represents a significant improvement in real-world conversational ability.

Local Deployment

For local inference, Small 4’s MoE architecture means the model is surprisingly runnable on consumer hardware. At INT4 quantization, it fits in approximately 40GB VRAM—achievable on dual RTX 4090s or a single 5090 with room to spare.

Ollama support is already available: ollama run mistral-small-4

DeepSeek V4: The Wait Is Over

After missed release windows in mid-February, late February, and early March, DeepSeek V4 finally launched around March 3. The developer community’s reaction has been mixed—enthusiasm about capabilities, skepticism about self-reported benchmarks.

The specs:

~1 trillion total parameters, ~32B active
1 million token context window
Native multimodal (text, image, video input; image generation)
MIT license

V4 introduces what DeepSeek calls “Manifold-Constrained Hyper-Connections” for training stability at trillion-parameter scale, plus “Engram Conditional Memory” for efficient retrieval over million-token contexts.

Benchmark Claims (Unverified)

Leaked benchmarks suggest V4 is competitive with current frontier models:

HumanEval: ~90% (would match Claude Opus 4.6)
SWE-bench Verified: 80%+ (top tier for code)
MATH: 92.4% (if accurate, best-in-class)

All V4 benchmark claims remain unverified until DeepSeek publishes official reports. The community has been burned before by inflated numbers.

The Practical Reality

V4’s 32B active parameters make it more demanding than V3’s 21B. Even with aggressive quantization, you’re looking at:

RTX 5090 (32GB): Tight fit at INT4, limited context
Dual 5090 (64GB): Comfortable at INT4, reasonable context
Mac Studio M4 Ultra 512GB: Full precision possible

For most local users, V3.2 remains the practical choice. V4 is more relevant for API access or enterprise deployments.

Dual RTX 5090: Consumer Hardware Hits Enterprise Territory

The most surprising development this week came from dual GPU benchmarks. Two RTX 5090s running Ollama now match H100 performance on 70B models—at a fraction of the cost.

The numbers:

DeepSeek-R1 70B: 33 tokens/second at 30K context
Llama 3.3 70B: 27 tokens/second (matching H100)
Cost comparison: 2× 5090 (~~$4K MSRP, $10K+ scalped) vs H100 (~~$30K+)

Important caveat: Ollama doesn’t parallelize inference across GPUs—it just pools VRAM. You won’t see 2× speedup from 2 cards. What you get is the ability to run larger models without spilling to CPU RAM.

For 110B+ models like Qwen 3.5 full, dual 5090s still struggle. GPU utilization caps at 20%, and inference drops to 7 tokens/second. Enterprise hardware retains its edge at the largest scales.

The Sweet Spots

Based on current benchmarks:

Setup	Best Model Class	Tokens/sec	Notes
Single RTX 5090	32B dense	61-65	Qwen 3.5 32B optimal
Single RTX 5090	30B MoE	234	Qwen 3 MoE screams
Dual RTX 5090	70B quantized	27-33	H100 territory
Single RTX 4090	27B dense	35-45	Still the value king

Updated Rankings

Combining leaderboard data with this week’s releases:

For Coding

Qwen 3.5 - GPQA Diamond 88.4%, LiveCodeBench leader
Mistral Small 4 - HumanEval 94.1%, configurable depth
DeepSeek V4 - SWE-bench 80%+ (if benchmarks hold)

For Reasoning

Kimi K2.5 - IFEval 94.0, AIME 96.1%
Qwen 3.5 - Best GPQA Diamond (88.4%)
Llama 4 Scout - 10M context for document reasoning

For Speed/Efficiency

Mistral Small 4 - 40% lower latency than Small 3
Gemma 3 27B - Dense architecture, no MoE overhead
Qwen 3.5 Small - 9B runs everywhere

For Local Deployment

Qwen 3.5-9B - Best quality under 10B
Mistral Small 4 Q4 - Fits single 5090
Gemma 3 27B Q4 - 14GB with QAT

Hardware Recommendations (March 21, 2026)

RTX 5090 Owners

You have options now:

Mistral Small 4 Q4 - The new efficiency standard
Qwen 3.5-32B full - Best dense model at this scale
Llama 4 Scout INT8 - When you need 10M context

RTX 4090 Owners

Still the practical sweet spot:

Mistral Small 4 Q4 - Tight but works
Gemma 3 27B Q4 - 14GB leaves room for context
Qwen 3.5-9B - Quality that rivals 70B from 2024

Dual GPU Enthusiasts

If you can acquire two 5090s:

DeepSeek-R1 70B - Full reasoning model at 33 tok/s
Llama 3.3 70B - H100-matching inference
Qwen 3.5-70B - The frontier, locally

Mac Users

Unified memory continues to differentiate:

M4 Max 128GB: Llama 4 Scout usable, V4 at reduced context
M4 Ultra 512GB: Everything fits, eventually

The Bottom Line

This week marked a shift. Mistral Small 4 proves that Apache-licensed, MoE-based models can compete with proprietary options while running on consumer hardware. DeepSeek V4’s arrival—despite benchmark skepticism—adds another trillion-parameter option to the open-weight ecosystem.

The dual 5090 benchmarks are perhaps most significant. Consumer hardware matching H100 performance on 70B models wasn’t expected this soon. Yes, you still can’t buy a 5090 at MSRP. But the performance ceiling for home labs keeps rising.

For most users, the practical action remains unchanged: Qwen 3.5-9B via Ollama handles 90% of tasks. When you need more, Mistral Small 4 and Gemma 3 27B offer excellent quality-to-resource ratios.

What to Try This Week

Mistral Small 4 - ollama run mistral-small-4 and test configurable reasoning
DeepSeek V4 via API - If self-hosting is impractical, try the hosted version first
Dual GPU owners - Benchmark DeepSeek-R1 70B at extended context
Everyone else - Qwen 3.5-9B remains the default recommendation

Next week: We’ll see if V4’s benchmarks hold under independent testing, and whether the 5090 supply situation improves.