A few months ago, matching GPT-5 on reasoning benchmarks required an API subscription to a frontier lab. That changed this week when MiroMind AI released MiroThinker 72B, an open-source research agent that beats OpenAI’s GPT-5-high on Humanity’s Last Exam and achieves 81.9% on GAIA—putting it in frontier territory for complex reasoning tasks.
The catch? There isn’t one. The model weights are on HuggingFace, the code is on GitHub, and you can run it locally if you have the hardware.
The Numbers
MiroThinker 72B posts strong results across multiple benchmarks:
| Benchmark | MiroThinker 72B | GPT-5-high |
|---|---|---|
| GAIA-Val-165 | 81.9% | ~82% |
| Humanity’s Last Exam (HLE) | 37.7% | 35.2% |
| BrowseComp | 47.1% | — |
| BrowseComp-ZH | 55.6% | — |
The GAIA benchmark tests general AI assistant capabilities requiring reasoning, multi-modality, web browsing, and tool use. Humanity’s Last Exam (HLE) is designed to be extremely difficult—questions contributed by experts across fields to challenge AI systems. MiroThinker’s 37.7% beats GPT-5-high’s 35.2% on the text-only subset.
What Makes It Different
MiroThinker doesn’t just scale model size. The team at MiroMind AI introduced what they call “interactive scaling”—a third dimension of performance improvement alongside parameters and context length.
The concept: train the model to handle deeper and more frequent interactions with its environment. Rather than generating answers directly, MiroThinker runs verification cycles, calls tools to gather information, and refines its reasoning based on feedback. The arXiv paper shows performance gains of 8-10 points as interaction depth increases.
The technical setup supports this approach:
- 256K context window for extended reasoning chains
- Up to 600 tool calls per task for deep research
- Recency-based context retention that preserves recent observations while managing memory
The training pipeline combines supervised fine-tuning, direct preference optimization (DPO), and group relative policy optimization (GRPO) reinforcement learning. The result is a model that learns when to explore versus when to commit to an answer.
Why This Matters
Open-source models have been narrowing the gap with commercial alternatives for a while now. What makes MiroThinker notable is where it’s competitive: complex reasoning and research tasks that require sustained multi-step analysis.
Previous open models excelled at simpler benchmarks while struggling on tasks requiring extended chains of thought. MiroThinker’s interactive scaling approach suggests a path forward that doesn’t require ever-larger parameter counts—it requires smarter interaction patterns.
For researchers and developers, this means frontier-level research capabilities without API costs or usage restrictions. For organizations concerned about data privacy, it means keeping sensitive research queries on local infrastructure.
Running It Yourself
MiroThinker comes in multiple sizes:
- MiroThinker-1.7-mini (30B) — Fits on consumer GPUs with quantization
- MiroThinker-72B — Requires ~140GB VRAM at full precision, or ~40GB quantized
- MiroThinker-235B — Datacenter scale
The 72B version is the sweet spot for most users with capable hardware. A quantized version should run on dual 3090s or a single 4090 with aggressive quantization.
The GitHub repository includes tool integrations for web search, code execution, and file operations. You’ll need to configure API keys for any external tools you want the agent to use.
What This Means
The frontier model moat is leaking. MiroThinker demonstrates that sophisticated training approaches can compensate for smaller parameter counts, and that open-source projects can deliver research-grade reasoning capabilities.
This doesn’t mean GPT-5 is obsolete—OpenAI’s model still leads on many benchmarks and offers convenience that local deployment can’t match. But for research tasks where you need transparency, reproducibility, or privacy, MiroThinker is now a credible alternative.
The broader trend: AI reasoning capabilities are becoming commoditized. Models that seemed impossibly advanced a year ago now run on prosumer hardware. If you’re building applications that require complex reasoning, the cost of that capability just dropped significantly.
What You Can Do
If you have capable hardware:
- Download from HuggingFace
- Start with the 30B version to test the workflow
- Configure tool integrations for your research needs
If you don’t have the hardware:
- Wait for hosted inference options (several providers are already setting up MiroThinker endpoints)
- Try the online demo to evaluate capabilities
For security-sensitive research:
- MiroThinker’s tool-calling capabilities introduce the same risks as any AI agent with system access
- Sandbox thoroughly before connecting to production systems
- The model has no built-in safeguards against harmful queries—that’s your responsibility
The message is clear: the era of open-source frontier AI agents has arrived. Whether that’s exciting or concerning depends on your perspective. Either way, it’s happening.