AI for Mathematics — Proving, Formalizing & Conjecturing
Mathematics is the cleanest stress-test for machine reasoning: a claim is true or false, and a formal proof is mechanically checkable. That property makes math a flagship sub-discipline of ai-for-science — unlike wet-lab biology, the verifier is free, exact, and incorruptible. The 2024–2026 frontier splits into two research cultures that are now converging: informal large-language-model (LLM) reasoning that writes human-style natural-language proofs, and formal systems that emit machine-verified proofs in Lean, Isabelle, or Coq. The dominant winning pattern across both is neuro-symbolic: an LLM proposes, a symbolic engine verifies, and a search loop closes the gap.
This module covers the AI/ML angle only — the learning systems, their training loops, and what their benchmark numbers do and do not mean.
1. Theorem proving
Formal proving (machine-checked)
AlphaProof (see alphaproof) is Google DeepMind's flagship formal prover. It couples a pre-trained language model with the AlphaZero reinforcement-learning (RL) algorithm, operating inside the Lean proof environment so that every candidate proof is automatically verified [1]. Training relied on autoformalizing roughly 80 million natural-language problem statements into Lean to bootstrap an RL self-play loop [1]. At the 2024 International Mathematical Olympiad (IMO), AlphaProof — together with AlphaGeometry 2 — solved four of six problems for 28 out of 42 points, exactly the silver-medal threshold and one point shy of gold; this was the first time any AI reached an IMO medal tier [1]. The full method was published in Nature in 2025 [2]. AlphaProof handled the algebra and number-theory problems; geometry went to AlphaGeometry 2.
AlphaGeometry / AlphaGeometry2 is a specialist neuro-symbolic geometry system: a language model proposes auxiliary constructions, a symbolic deduction engine grinds out consequences. AlphaGeometry2 raised the IMO-2000–2024 geometry solve rate to 84% (from 54% for the first version), surpassing the average human gold medalist on that problem set [3]. At IMO 2025 it solved the geometry problem (P2) directly from natural language in about 20 seconds [3].
DeepSeek-Prover is the leading open-weight formal prover. DeepSeek-Prover-V2 (671B, April 2025) trains an RL proof generator against a Lean 4 verifier, using DeepSeek-V3 to decompose theorems into subgoals before recursive proof search [4]. It reaches 88.9% pass rate on the miniF2F-test benchmark (up from ~63.5% for V1.5) and solved 49 of 658 problems on PutnamBench, with weights released openly [4]. Kimina-Prover is a parallel open effort applying RL to large formal-reasoning models, shipping an open-sourced Kimina-Autoformalizer for natural-language → Lean 4 translation [5].
Informal proving (natural-language, LLM)
By 2025 frontier LLMs began matching elite humans in informal olympiad reasoning. Gemini Deep Think (advanced version) officially scored 35/42 at IMO 2025 — solving five of six problems, the first AI to receive an officially verified gold score, working end-to-end in natural language inside the 4.5-hour window [6]. OpenAI separately reported a 35/42 result with an experimental model but without formal IMO certification [6]. These are informal proofs graded by humans, not machine-checked, which is the central caveat distinguishing them from AlphaProof's formal output.
2. Autoformalization
Autoformalization is the translation of natural-language mathematics into a formal language (usually Lean 4) so it can be machine-verified. It is the load-bearing bridge between the informal and formal cultures — and currently the weakest link. AlphaProof's 80M-statement training pipeline is itself an autoformalization engine [1].
Benchmarks such as ProofNet (undergraduate-level autoformalization + proving) and its corrected variant ProofNet#, plus miniF2F, are standard test beds. The honest state of the art: even frontier models struggle to emit well-typed Lean 4, but sampling many candidates and filtering ill-typed outputs yields large gains, with accuracy continuing to rise as the formalization-attempt budget grows [5]. Dedicated 7B autoformalizers (e.g. Kimina-Autoformalizer) target this task directly [5]. The unsolved problem is faithfulness: a statement can compile in Lean yet not mean what the English meant — paraphrase-induced mistranslation remains an active failure mode.
3. LLM reasoning on math benchmarks
A ladder of benchmarks now tracks frontier-model math ability:
- MATH / AIME — competition problems with short numeric answers. Frontier reasoning models have largely saturated MATH and now post strong AIME scores; because these problems are widely published, contamination (the model having seen them in pre-training) is a serious confound, so high scores must be read with caution.
- Putnam / PutnamBench — undergraduate competition; PutnamBench provides formal (Lean) targets, making it contamination-resistant. DeepSeek-Prover-V2 solved 49 of 658 [4].
- FrontierMath (Epoch AI) — hundreds of original, expert-crafted problems spanning number theory to algebraic geometry, many requiring hours or days for an expert; the dataset (338 problems across Tiers 1–4) is unpublished to resist contamination [7]. It is the toughest standard math benchmark; even strong models historically scored in the low double digits, though leaders have since climbed substantially as of 2026.
The AI Mathematical Olympiad (AIMO) Prize (funded by XTX Markets) runs Kaggle competitions toward an IMO-gold-level open model. The first Progress Prize (2024) was won by Team Numina (NuminaMath), receiving $131,072; the award was presented at IMO 2024 by Terence Tao [8]. The second Progress Prize was won by Nvidia's NemoSkills team, and a third prize has since launched [8].
What "solving" means — the honest note. Three distinct claims hide behind the word: (a) a machine-checked formal proof (AlphaProof, DeepSeek-Prover) — the strongest guarantee; (b) a human-graded natural-language proof (Gemini/OpenAI at IMO 2025) — verified by people, not machines; (c) a correct final answer on a numeric benchmark (MATH, AIME) — which says nothing about whether the reasoning was valid. Contamination threatens (c) most, formal benchmarks resist it best, and the gap between "produces the right number" and "produces a sound proof" is the field's defining tension.
4. Conjecture generation & mathematical discovery
Beyond proving given statements, AI is now used to discover new mathematics.
FunSearch (DeepMind, Nature 2023) pairs an LLM with an automated evaluator inside an evolutionary search over programs (not proofs). It produced genuinely new constructions for the cap set problem in extremal combinatorics — the largest improvement to the asymptotic lower bound in 20 years — and new heuristics for online bin packing [9]. The key trick: the LLM proposes program code, and an external scorer provides ground truth, sidestepping LLM hallucination.
DeepMind × Oxford/Sydney (Nature 2021) used ML to guide mathematician intuition, yielding a new theorem in knot theory (a connection between algebraic and geometric invariants) and progress on the Kazhdan–Lusztig conjecture in representation theory [10]. Here the AI surfaced candidate relationships; humans proved them.
The Ramanujan Machine (Nature 2021) automatically generates conjectured continued- fraction formulas for fundamental constants (π, e, Catalan's constant, ζ values) by numerically matching values — producing conjectures without proofs, some later proven by others [11]. It predates the LLM era but exemplifies the discovery-by-search paradigm.
5. The neuro-symbolic pattern
The unifying architecture across nearly every flagship result:
LLM proposer → candidate (proof step / construction / program)
│
▼
formal verifier (Lean / symbolic engine / evaluator) → exact reward
│
▼
search / RL loop (tree search, evolutionary, AlphaZero) → improved proposer
The verifier is what makes math special: it converts an unreliable generative model into a sound system, because nothing leaves the loop unless it checks. This is the same proposer-verifier-search skeleton that powers self-improving ML systems described in aifs-ml-self, specialized to a domain with a free oracle.
Systems table
| System | Approach | Result | Year |
|---|---|---|---|
| AlphaProof + AlphaGeometry 2 | LLM + AlphaZero RL in Lean; symbolic geometry | IMO 2024 silver, 28/42, 4/6 solved [1][2] | 2024 |
| AlphaGeometry2 | LLM proposer + symbolic deduction | 84% of 2000–2024 IMO geometry; P2 at IMO 2025 in ~20s [3] | 2025 |
| DeepSeek-Prover-V2 (671B) | RL + subgoal decomposition, Lean 4 verifier | 88.9% miniF2F-test; 49/658 PutnamBench [4] | 2025 |
| Kimina-Prover / Autoformalizer | RL formal reasoning + NL→Lean translation | Open-weight prover + autoformalizer [5] | 2025 |
| Gemini Deep Think (advanced) | Informal LLM, parallel reasoning | IMO 2025 gold, 35/42, 5/6 (human-graded) [6] | 2025 |
| FunSearch | LLM + evolutionary program search | New cap-set lower bound; bin-packing heuristics [9] | 2023 |
| Ramanujan Machine | Numerical conjecture search | New continued-fraction conjectures for constants [11] | 2021 |
| DeepMind × Oxford | ML-guided intuition | Knot-theory theorem; KL-conjecture progress [10] | 2021 |
Benchmark table
| Benchmark | What it tests | Contamination risk | Notes |
|---|---|---|---|
| MATH / AIME | Competition, numeric answer | High (published) | Largely saturated by frontier reasoners |
| Putnam / PutnamBench | Undergrad competition; Lean targets | Low (formal) | DeepSeek-Prover-V2: 49/658 [4] |
| miniF2F | Olympiad/AIME formal proving (Lean) | Low (formal) | DeepSeek-Prover-V2: 88.9% [4] |
| ProofNet / ProofNet# | Autoformalization + proving | Low | Hardest on faithful formalization [5] |
| FrontierMath (Epoch) | Original research-level problems | Very low (unpublished) [7] | Toughest standard benchmark |
Open problems
- Verification cost. Formal proving guarantees soundness but is expensive: AlphaProof reportedly needed long compute per IMO problem, and DeepSeek-Prover's top scores use thousands of samples per theorem [4]. Scaling proof search remains the bottleneck.
- Novel vs. memorized. On published benchmarks (MATH, AIME) it is genuinely hard to separate reasoning from recall. Contamination-resistant, unpublished sets like FrontierMath [7] and formal targets like PutnamBench are the field's response, but a rigorous "is this new mathematics?" test is unsolved.
- The autoformalization gap. Translating natural-language math into faithful, well- typed Lean is the chokepoint connecting informal LLM reasoning to formal verification. Sampling-and-filtering helps but meaning-preservation under paraphrase is unsolved [5].
- Informal-but-unchecked proofs. IMO-gold natural-language proofs (Gemini, OpenAI) are human-graded, not machine-verified [6] — closing the loop to machine-checkable output for full proofs (not just final answers) is the next frontier.
- Discovery beyond benchmarks. FunSearch and the conjecture machines show AI can produce genuinely new results [9][11], but only in narrow, scorable niches; general open-ended mathematical discovery remains aspirational.
Sources
- https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/ (2026-06-14)
- https://www.nature.com/articles/s41586-025-09833-y (2026-06-14)
- https://arxiv.org/abs/2502.03544 (2026-06-14)
- https://arxiv.org/abs/2504.21801 (2026-06-14)
- https://arxiv.org/abs/2504.11354 (2026-06-14)
- https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ (2026-06-14)
- https://epoch.ai/frontiermath/the-benchmark (2026-06-14)
- https://aimoprize.com/updates/2024-07-20-progress-prize-results (2026-06-14)
- https://www.nature.com/articles/s41586-023-06924-6 (2026-06-14)
- https://www.nature.com/articles/d41586-021-03593-1 (2026-06-14)
- https://www.nature.com/articles/s41586-021-03229-4 (2026-06-14)