A year ago, AI models attempting the US Math Olympiad produced answers riddled with circular arguments, unsupported guesses, and incoherent structure. In 2026, GPT-5.4 scored 95.24% on USAMO, essentially saturating one of the hardest mathematical reasoning benchmarks that exists.
The gap from guessing to proof-writing happened in twelve months.
The Results
MathArena’s USAMO 2026 evaluation tested leading AI models on six problems designed for the top 250 high school mathematicians in the United States. These problems require multi-step proofs, not pattern matching—the kind of rigorous reasoning that AI has historically struggled with.
The leaderboard:
| Model | Score |
|---|---|
| GPT-5.4 (xhigh) | 95.24% |
| Gemini 3.1 Pro | 74.4% |
| Claude Opus 4.6 | 47.0% |
| Step-3.5-Flash (open source) | 44.6% |
GPT-5.4 produced complete, well-structured proofs for nearly every problem. Its only significant failure was Problem 5, where it incorrectly argued the statement was false and constructed an invalid counterexample.
For context: on USAMO 2025, even the best models exhibited “basic errors, including circular arguments, unsupported guesses, and very poor overall structure.” The same evaluation methodology now shows models generating complete mathematical proofs rather than fumbling through steps.
Problem-by-Problem Breakdown
The six USAMO problems tested different mathematical domains:
Problem 1 (Average: 70.8%): Required the rearrangement inequality. Most models struggled with this classical technique.
Problem 2 (Average: 40.5%): The hardest combinatorial problem. Successful solutions required the Kraft-McMillan inequality and existential proofs—not strategies that can be brute-forced.
Problem 3 (Average: 25.0%): The most difficult geometry problem. Only GPT-5.4 achieved full marks. Models defaulted to coordinate-bashing rather than synthetic geometric approaches.
Problem 4 (Average: 93.5%): The easiest problem, involving digit constraints. Nearly all models succeeded through carry-propagation analysis.
Problem 5 (Average: 58.3%): Required identifying Miquel points and spiral similarity arguments.
Problem 6 (Average: 49.4%): Connected Fibonacci numbers and Euler’s totient function. Required familiarity with Vieta Jumping, a technique even many human competitors find challenging.
The variation matters. GPT-5.4 excels at algebraic manipulation but shares other models’ weakness in synthetic geometry. These gaps help calibrate what “95%” actually means.
How the Evaluation Works
MathArena’s methodology addresses the persistent problem of AI judges favoring their own outputs.
The team built a semi-automated grading pipeline using three judge models: GPT-5.4, Gemini 3.1 Pro, and Opus 4.6. Each solution was reformatted to remove stylistic differences, then evaluated by all three judges. When scores agreed within two points, the minimum was taken. When they diverged, human reviewers adjudicated.
The human verification found the automated pipeline required minimal correction—final scores shifted by at most two points, and only for three solutions. But it also revealed bias: Gemini 3.1 Pro and Opus 4.6 both inflated scores for their own outputs, while GPT-5.4 was the most reliable judge with minimal self-bias.
This matters for anyone relying on AI to evaluate AI. Self-preference isn’t just a theoretical concern; it showed up in rigorous testing.
What This Means
Three implications:
Mathematical reasoning is no longer a hard capability barrier. USAMO problems are designed to separate mathematically gifted humans from everyone else. If GPT-5.4 can write valid proofs for 95% of them, the question shifts from “can AI do hard math?” to “what remains that AI can’t do?”
The open-source gap is real. Step-3.5-Flash, the strongest open model, scored 44.6%—less than half GPT-5.4’s performance. For researchers, educators, or anyone who can’t pay for frontier API access, the capability ceiling is significantly lower.
Benchmark saturation is arriving faster than expected. USAMO was supposed to be hard. When a benchmark gets saturated, we need harder benchmarks—but creating olympiad-level problems is expensive and slow. The evaluation infrastructure for AI capabilities is struggling to keep pace with the capabilities themselves.
The Caveats
Before declaring that AI has “solved” mathematical reasoning:
USAMO problems are known formats. Even fresh problems follow predictable structures. The training data for frontier models includes decades of competition mathematics. Whether GPT-5.4 can handle genuinely novel mathematical questions—the kind research mathematicians work on—remains unclear.
Proofs aren’t always elegant. Multiple reviewers noted that AI-generated solutions prefer coordinate-bashing and algebraic manipulation over conceptual insight. The proofs work, but they’re not necessarily how a skilled mathematician would approach the problem.
95% isn’t 100%. GPT-5.4 still made a significant error on Problem 5, arguing confidently for an invalid counterexample. In domains where mathematical correctness matters—formal verification, theorem proving for critical systems—that error rate matters.
What You Can Do
If you work in math education: The calculus on AI detection just changed again. Students now have access to models that can write valid olympiad-level proofs, not just copy homework answers.
If you build AI applications: Mathematical reasoning capabilities have improved faster than most roadmaps predicted. Tasks involving multi-step logical deduction that seemed out of reach a year ago may be worth revisiting.
If you evaluate AI capabilities: Self-preference bias in LLM judges is measurable and significant. Any evaluation pipeline using AI judges should either use the strongest model (currently GPT-5.4) or implement multi-model jury approaches with human verification.
The ceiling moved. Plan accordingly.