Open Source AI Wins: NVIDIA Goes Local-First, OpenAI Returns to Its Roots, Qwen 3.5 Beats Models 13x Its Size

The week of March 18-25 may be remembered as when open-source AI stopped playing catch-up. NVIDIA released local-first agent infrastructure at GTC. OpenAI shipped Apache 2.0 models for the first time since GPT-2. And a 9 billion parameter model from Alibaba beat competitors thirteen times its size.

NVIDIA Goes Local-First with Nemotron 3

At GTC 2026, NVIDIA announced Nemotron 3 Super—a 120 billion parameter model with 12 billion active during inference. The design choice matters: using a Mixture-of-Experts architecture means you get large-model capability at small-model compute costs.

The model runs locally through NemoClaw, NVIDIA’s new open-source agent framework. The pitch is direct: run autonomous AI agents on RTX PCs and DGX systems without paying per-token cloud fees.

On PinchBench—a benchmark for measuring how well models perform with agent tasks—Nemotron 3 Super scored 85.6%, making it the top open model in its class. Multi-Token Prediction delivers 3x faster inference by predicting multiple words simultaneously.

The models are available now through Ollama, LM Studio, and llama.cpp.

Why it matters: NVIDIA is betting that local inference will matter more than cloud inference. That’s a significant strategic shift from the company that sells most of its chips to hyperscalers.

OpenAI Returns to Open Source

OpenAI released gpt-oss-120b and gpt-oss-20b—their first open-source models since GPT-2 in 2019. Both ship under Apache 2.0, allowing commercial use and derivative works without restriction.

The architecture uses Mixture-of-Experts: gpt-oss-120b activates 5.1 billion parameters per token (from 117B total), and gpt-oss-20b activates 3.6 billion (from 21B total). The 120B model achieves near-parity with o4-mini on reasoning benchmarks while running on a single 80GB GPU. The 20B model runs on devices with just 16GB of memory.

Both models include adjustable reasoning effort—low, medium, or high—letting developers trade latency for accuracy.

Why it matters: OpenAI built its early reputation on open research before pivoting to closed models in 2020. This release signals a competitive response to Meta’s Llama success and the broader shift toward open-weight models in enterprise deployments.

Qwen 3.5: Small Models, Big Results

Alibaba’s Qwen 3.5 Small series includes four models at 0.8B, 2B, 4B, and 9B parameters. The 9B model produced the headline: it beats gpt-oss-120b on MMLU-Pro (82.5 vs 80.8), GPQA Diamond (81.7 vs 80.1), and multilingual MMMLU (81.2 vs 78.2).

That’s a 9B model outperforming a 120B model—13x the parameter difference.

Every model is natively multimodal, supporting text, images, and video through the same weights without separate vision adapters. All four ship under Apache 2.0.

On video understanding benchmarks, the 9B scores 84.5 on Video-MME, significantly ahead of Gemini 2.5 Flash-Lite at 74.6.

Why it matters: Efficient small models matter more than ever for on-device deployment. If a 9B model can match or beat 100B+ models on key benchmarks, the economics of AI deployment shift dramatically.

ByteDance Opens Deer-Flow

Deer-Flow is ByteDance’s new open-source agent architecture for tasks that span “several minutes to multiple hours.” The framework handles research, coding, and creative production through hierarchical sub-agents that can work in parallel or sequence.

Four foundational pillars:

Sandboxing: Controlled environments for safe code execution
Memory systems: Consistency across extended durations
Tools and skills: Domain-specific capabilities
Hierarchical sub-agents: Task decomposition and coordination

The main agent can deploy sub-agents for specific project segments—essential for large-scale work where multiple steps require coordination.

Why it matters: Long-running autonomous agents are the current frontier. Deer-Flow provides infrastructure for the kind of multi-hour tasks that most frameworks can’t handle reliably.

GitHub Trends

Beyond the major releases, the broader open-source ecosystem continues growing:

Ollama crossed 162,000 stars
Open WebUI hit 124,000 stars with 282 million downloads
Dify reached 130,000 stars
Obra/superpowers, the agentic skills framework, sits at 92,100 stars

OpenClaw remains the breakout story of 2026, now past 210,000 stars after going viral in January. But the security incidents around it—covered elsewhere—show the risks of rapid open-source adoption.

What This Means

The pattern this week: major players are racing to open-source competitive models.

For developers: Local inference options have never been better. Nemotron 3 through NemoClaw, gpt-oss through standard inference frameworks, and Qwen 3.5 through essentially anything—you can run production-grade models without cloud dependencies.

For enterprises: The Apache 2.0 licensing on these releases removes commercial restrictions. You can fine-tune, deploy, and modify without legal complexity.

For privacy: Local models mean your data never leaves your infrastructure. As cloud AI faces increasing scrutiny, local deployment becomes a competitive advantage.

What You Can Do

Try Nemotron 3 locally:

ollama run nemotron3-super

Run gpt-oss on your hardware: Both models work through Ollama, LM Studio, and llama.cpp. The 20B model fits on consumer GPUs with 16GB VRAM.

Test Qwen 3.5 multimodal: The 9B model handles text, images, and video. Available through Hugging Face with Apache 2.0 licensing.

Explore Deer-Flow: If you’re building agents that need to run for hours, ByteDance’s hierarchical architecture provides a starting point.

Open source isn’t waiting for permission anymore. The best models are being released openly, and the infrastructure to run them locally is mature. The question isn’t whether to use open models—it’s which ones fit your use case.