Forward looking: How powerful the foundational models will be at the end of 2026.

Most AI predictions fail for one of two reasons: they are too vague to test, or too dramatic to be useful.

This post makes exactly one prediction, and makes it testable.

Conclusion First

Prediction (two parts, same date): By the end of 2026, frontier foundation model-agent systems will:

Generate research-level math proofs better than the median professional mathematician.
Verify math proofs better than the median professional mathematician.

Important scope note: this prediction is about natural-language mathematical proving and reviewing, not formal proof workflows in systems like Lean.

Why This Is A Real Possibility

In 2025, we saw models move from frequent failure on hard mathematical reasoning tasks to nontrivial success on problems that require long chains of formal thought.

A concrete anchor is arXiv:2601.23245: an agent based on Gemini Deep Think (2025 release) was reported to produce genuinely creative mathematical computation in proving new results.

Mathematics is the key domain because it has strict correctness standards:

Statements are explicit.
Proofs can be checked step by step.
Contradictions eventually surface.

If models can repeatedly produce correct nontrivial proofs and diagnose proof errors with high precision, that is a capability jump, not just stylistic fluency.

I also expect the rate of model progress in math to accelerate from 2025 to 2026. The systems are stronger, training and evaluation loops are improving (less bugs), model-assisted data generation and critique create compounding gains, and model can help improve its own training and evaluation pipeline (codex-5.3 helps its own training).

Part 1: Proof Generation Better Than The Median Mathematician

By December 2026, on hard unseen mathematics problems, frontier agentic systems should:

Produce more correct nontrivial proofs than a randomly sampled professional mathematician under similar time budgets.
Recover from failure and repair proof plans through iterative self-critique.
Sustain quality across multiple areas (for example algebra, combinatorics, and analysis), not just one benchmark.

Why Verification May Shift From Humans To AI

Historically, humans were the final safety layer for mathematical correctness. By end of 2026, that may no longer be true in many workflows.

Another key point: if a model can already generate creative, hard, long proofs, it must already have nontrivial internal verification ability. Without some ability to check consistency across many dependent steps, it cannot reliably produce those long-form arguments at all.

Part 2: Proof Verification Better Than The Median Mathematician

By December 2026, frontier systems should verify proofs better than a randomly sampled professional mathematician, especially on long technical arguments.

Reasons this can happen:

AI verifiers can run many independent checking passes at low cost.
AI systems can cross-check proofs using different representations (natural language, symbolic steps, and formalized fragments, or use Lean).
Verification loops can be automated and repeated until contradictions are exhausted.
Human reviewers are sparse, slow, and inconsistent on long technical arguments.

If AI systems become better at both finding subtle errors and avoiding reviewer fatigue, human verification becomes optional for a large fraction of advanced proof workflows.

What Would Falsify This Prediction

By December 2026, this prediction is wrong if we do not see robust evidence for both parts:

Generation: systems do not beat the median professional mathematician on hard unseen proof tasks.
Verification: systems do not verify advanced proofs more reliably than the median professional mathematician.
Output quality at scale: no flood of AI-generated proofs in 2026 with most outputs solid under technical review.

Failure signs would include:

impressive one-off demos that do not replicate,
strong novelty but weak correctness under audit,
no clear 2026 surge in AI-generated proofs, or a surge dominated by low-quality proofs,
verifier performance collapsing on long or unfamiliar proofs.