On December 22, 2025, Chinese AI company Zhipu AI released GLM-4.7. Within weeks, it was outscoring Claude Sonnet 4.5 on tool-use benchmarks. Access costs $3 per month. The lightweight version runs on a laptop.
This is what the AI race looks like now: frontier-competitive models from China, open weights, consumer pricing.
The Benchmarks That Matter
GLM-4.7 isn’t just competitive — it leads in several categories that matter for real-world coding work:
Coding Performance
| Benchmark | GLM-4.7 | DeepSeek-V3.2 | Best Competitor |
|---|---|---|---|
| SWE-bench Verified | 73.8% | 73.1% | MiMo-V2-Flash (73.4%) |
| LiveCodeBench-v6 | 84.9% | 83.3% | Kimi K2 (83.1%) |
| τ²-Bench (tool use) | 84.7% | — | Claude Sonnet 4.5 (lower) |
SWE-bench measures ability to resolve real GitHub issues. LiveCodeBench tests algorithmic reasoning. τ²-Bench evaluates multi-step tool invocation — the kind of work AI coding assistants actually do.
On that last metric, GLM-4.7 beats Claude Sonnet 4.5 for open-source state-of-the-art.
Math and Reasoning
| Benchmark | Score |
|---|---|
| AIME 2025 | 95.7% |
| HMMT Feb 2025 | 97.1% |
| GPQA-Diamond | 85.7% |
| Humanity’s Last Exam | 42.8% |
The 42.8% on Humanity’s Last Exam represents a 41% improvement over GLM-4.6. This benchmark tests complex reasoning with tool use — exactly the capability that makes AI useful for serious work.
What Makes GLM-4.7 Different
Native Chain-of-Thought
Most models require prompting to “think step by step.” GLM-4.7 has reasoning built into its inference cycle. Zhipu calls this “Interleaved Thinking” — the model automatically pauses to reason through complex problems before responding.
You don’t have to ask it to think. It just does.
Preserved Thinking Across Turns
Here’s the feature that matters for coding assistants: GLM-4.7 maintains its reasoning chains across multiple conversation turns instead of resetting.
Anyone who’s used AI coding tools knows the frustration: you explain a complex codebase, the AI understands it, then the next message it’s forgotten everything and you start over. GLM-4.7 specifically targets this problem.
Turn-Level Thinking Control
Developers can control when and how deeply the model reasons. For quick tasks, skip the deep thinking. For complex multi-file refactors, let it work through the problem thoroughly.
The Specs
| Specification | Value |
|---|---|
| Parameters | ~400 billion |
| Context Window | 200,000 tokens |
| Max Output | 128,000 tokens |
| Inference Speed | 55 tokens/second |
| Weights | Open (Hugging Face, ModelScope) |
| API Access | $3/month via chat.z.ai |
That 200K context window with 128K output is massive. For comparison, most models top out at 4K-8K output tokens. This matters for generating entire files or large refactors.
GLM-4.7-Flash: Run It Locally
Released January 19, 2026, GLM-4.7-Flash brings this capability to consumer hardware.
Architecture
- 30 billion parameters total
- 3 billion active per token (Mixture of Experts)
- Optimized for RTX 3090 and Apple Silicon
Real-World Performance
| Hardware | Speed |
|---|---|
| M4 Max MacBook Pro | 82 tokens/sec |
| Various configs | 43-81 tokens/sec |
For context, that’s faster than most cloud API responses once you account for network latency. Local inference with frontier-competitive quality.
How to Run It
The model is available on Hugging Face. With Ollama or LM Studio:
# If available in Ollama library
ollama run glm4.7-flash
# Or download GGUF from Hugging Face and load in LM Studio
Check the Hugging Face model card for the latest quantized versions optimized for your hardware.
The $3 vs $200 Question
Techloy ran a piece titled “America’s $200 AI Coding Tool Just Met a $3 Chinese Rival.” The comparison is stark:
Premium AI Coding Assistants:
- GitHub Copilot: $19/month (individual), $39/month (business)
- Cursor Pro: $20/month
- Claude Pro: $20/month
GLM-4.7:
- API access: $3/month
- Local (Flash): Free after hardware
The quality gap that justified premium pricing is narrowing. When an open-weight model beats Claude on tool-use benchmarks, the value proposition shifts.
What This Means
For Developers
You have options now. GLM-4.7 integrates with Claude Code, Cline, OpenCode, and Roo Code. You can swap backends without changing workflows. The Flash version means you can code with AI assistance on a plane, in a coffee shop with bad WiFi, anywhere.
For the Industry
The moat around frontier AI capabilities is eroding faster than anyone predicted. Six months ago, matching Claude on benchmarks required billion-dollar training runs. Now it requires downloading weights from Hugging Face.
For Privacy
Open weights mean you can inspect what the model does. Local inference means your code never leaves your machine. For proprietary codebases, this matters enormously.
Chinese origin may raise concerns for some use cases. But the weights are public, the architecture is documented, and local inference doesn’t phone home. The privacy calculus is better than cloud APIs regardless of who trained the model.
The Caveats
Benchmarks aren’t everything. Real-world performance on your specific codebase may differ. The 400B parameter full model requires serious hardware to self-host. Chinese models may have different content policies and training biases than Western alternatives.
But the trend is clear: frontier-quality AI is becoming a commodity. The question isn’t whether open models will catch up — they already have on key metrics. The question is what that means for the companies charging premium prices for capabilities you can now run locally.
Try It
Cloud: Sign up at chat.z.ai ($3/month)
Local: Download GLM-4.7-Flash from Hugging Face, run via Ollama or LM Studio
In Your Editor: Configure Claude Code, Cline, or your preferred coding assistant to use the GLM-4.7 API endpoint
The future of AI-assisted coding might not be a $200/month subscription. It might be a model running on your laptop, trained in China, beating American frontier models on the benchmarks that matter.
Welcome to 2026.