AI for Mathematics — Proving, Formalizing & Conjecturing

Mathematics is the cleanest stress-test for machine reasoning: a claim is true or false, and a formal proof is mechanically checkable. That property makes math a flagship sub-discipline of ai-for-science — unlike wet-lab biology, the verifier is free, exact, and incorruptible. The 2024–2026 frontier splits into two research cultures that are now converging: informal large-language-model (LLM) reasoning that writes human-style natural-language proofs, and formal systems that emit machine-verified proofs in Lean, Isabelle, or Coq. The dominant winning pattern across both is neuro-symbolic: an LLM proposes, a symbolic engine verifies, and a search loop closes the gap.

This module covers the AI/ML angle only — the learning systems, their training loops, and what their benchmark numbers do and do not mean.

1. Theorem proving

Formal proving (machine-checked)

AlphaProof (see alphaproof) is Google DeepMind's flagship formal prover. It couples a pre-trained language model with the AlphaZero reinforcement-learning (RL) algorithm, operating inside the Lean proof environment so that every candidate proof is automatically verified [1]. Training relied on autoformalizing roughly 80 million natural-language problem statements into Lean to bootstrap an RL self-play loop [1]. At the 2024 International Mathematical Olympiad (IMO), AlphaProof — together with AlphaGeometry 2 — solved four of six problems for 28 out of 42 points, exactly the silver-medal threshold and one point shy of gold; this was the first time any AI reached an IMO medal tier [1]. The full method was published in Nature in 2025 [2]. AlphaProof handled the algebra and number-theory problems; geometry went to AlphaGeometry 2.

AlphaGeometry / AlphaGeometry2 is a specialist neuro-symbolic geometry system: a language model proposes auxiliary constructions, a symbolic deduction engine grinds out consequences. AlphaGeometry2 raised the IMO-2000–2024 geometry solve rate to 84% (from 54% for the first version), surpassing the average human gold medalist on that problem set [3]. At IMO 2025 it solved the geometry problem (P2) directly from natural language in about 20 seconds [3].

DeepSeek-Prover is the leading open-weight formal prover. DeepSeek-Prover-V2 (671B, April 2025) trains an RL proof generator against a Lean 4 verifier, using DeepSeek-V3 to decompose theorems into subgoals before recursive proof search [4]. It reaches 88.9% pass rate on the miniF2F-test benchmark (up from ~63.5% for V1.5) and solved 49 of 658 problems on PutnamBench, with weights released openly [4]. Kimina-Prover is a parallel open effort applying RL to large formal-reasoning models, shipping an open-sourced Kimina-Autoformalizer for natural-language → Lean 4 translation [5].

Informal proving (natural-language, LLM)

By 2025 frontier LLMs began matching elite humans in informal olympiad reasoning. Gemini Deep Think (advanced version) officially scored 35/42 at IMO 2025 — solving five of six problems, the first AI to receive an officially verified gold score, working end-to-end in natural language inside the 4.5-hour window [6]. OpenAI separately reported a 35/42 result with an experimental model but without formal IMO certification [6]. These are informal proofs graded by humans, not machine-checked, which is the central caveat distinguishing them from AlphaProof's formal output.

2. Autoformalization

Autoformalization is the translation of natural-language mathematics into a formal language (usually Lean 4) so it can be machine-verified. It is the load-bearing bridge between the informal and formal cultures — and currently the weakest link. AlphaProof's 80M-statement training pipeline is itself an autoformalization engine [1].

Benchmarks such as ProofNet (undergraduate-level autoformalization + proving) and its corrected variant ProofNet#, plus miniF2F, are standard test beds. The honest state of the art: even frontier models struggle to emit well-typed Lean 4, but sampling many candidates and filtering ill-typed outputs yields large gains, with accuracy continuing to rise as the formalization-attempt budget grows [5]. Dedicated 7B autoformalizers (e.g. Kimina-Autoformalizer) target this task directly [5]. The unsolved problem is faithfulness: a statement can compile in Lean yet not mean what the English meant — paraphrase-induced mistranslation remains an active failure mode.

3. LLM reasoning on math benchmarks

A ladder of benchmarks now tracks frontier-model math ability:

MATH / AIME — competition problems with short numeric answers. Frontier reasoning models have largely saturated MATH and now post strong AIME scores; because these problems are widely published, contamination (the model having seen them in pre-training) is a serious confound, so high scores must be read with caution.
Putnam / PutnamBench — undergraduate competition; PutnamBench provides formal (Lean) targets, making it contamination-resistant. DeepSeek-Prover-V2 solved 49 of 658 [4].
FrontierMath (Epoch AI) — hundreds of original, expert-crafted problems spanning number theory to algebraic geometry, many requiring hours or days for an expert; the dataset (338 problems across Tiers 1–4) is unpublished to resist contamination [7]. It is the toughest standard math benchmark; even strong models historically scored in the low double digits, though leaders have since climbed substantially as of 2026.

The AI Mathematical Olympiad (AIMO) Prize (funded by XTX Markets) runs Kaggle competitions toward an IMO-gold-level open model. The first Progress Prize (2024) was won by Team Numina (NuminaMath), receiving $131,072; the award was presented at IMO 2024 by Terence Tao [8]. The second Progress Prize was won by Nvidia's NemoSkills team, and a third prize has since launched [8].

What "solving" means — the honest note. Three distinct claims hide behind the word: (a) a machine-checked formal proof (AlphaProof, DeepSeek-Prover) — the strongest guarantee; (b) a human-graded natural-language proof (Gemini/OpenAI at IMO 2025) — verified by people, not machines; (c) a correct final answer on a numeric benchmark (MATH, AIME) — which says nothing about whether the reasoning was valid. Contamination threatens (c) most, formal benchmarks resist it best, and the gap between "produces the right number" and "produces a sound proof" is the field's defining tension.

4. Conjecture generation & mathematical discovery

Beyond proving given statements, AI is now used to discover new mathematics.

FunSearch (DeepMind, Nature 2023) pairs an LLM with an automated evaluator inside an evolutionary search over programs (not proofs). It produced genuinely new constructions for the cap set problem in extremal combinatorics — the largest improvement to the asymptotic lower bound in 20 years — and new heuristics for online bin packing [9]. The key trick: the LLM proposes program code, and an external scorer provides ground truth, sidestepping LLM hallucination.

DeepMind × Oxford/Sydney (Nature 2021) used ML to guide mathematician intuition, yielding a new theorem in knot theory (a connection between algebraic and geometric invariants) and progress on the Kazhdan–Lusztig conjecture in representation theory [10]. Here the AI surfaced candidate relationships; humans proved them.

The Ramanujan Machine (Nature 2021) automatically generates conjectured continued- fraction formulas for fundamental constants (π, e, Catalan's constant, ζ values) by numerically matching values — producing conjectures without proofs, some later proven by others [11]. It predates the LLM era but exemplifies the discovery-by-search paradigm.

5. The neuro-symbolic pattern

The unifying architecture across nearly every flagship result:

  LLM proposer  →  candidate (proof step / construction / program)
        │
        ▼
  formal verifier  (Lean / symbolic engine / evaluator)  →  exact reward
        │
        ▼
  search / RL loop  (tree search, evolutionary, AlphaZero)  →  improved proposer

The verifier is what makes math special: it converts an unreliable generative model into a sound system, because nothing leaves the loop unless it checks. This is the same proposer-verifier-search skeleton that powers self-improving ML systems described in aifs-ml-self, specialized to a domain with a free oracle.

Systems table

System	Approach	Result	Year
AlphaProof + AlphaGeometry 2	LLM + AlphaZero RL in Lean; symbolic geometry	IMO 2024 silver, 28/42, 4/6 solved [1][2]	2024
AlphaGeometry2	LLM proposer + symbolic deduction	84% of 2000–2024 IMO geometry; P2 at IMO 2025 in ~20s [3]	2025
DeepSeek-Prover-V2 (671B)	RL + subgoal decomposition, Lean 4 verifier	88.9% miniF2F-test; 49/658 PutnamBench [4]	2025
Kimina-Prover / Autoformalizer	RL formal reasoning + NL→Lean translation	Open-weight prover + autoformalizer [5]	2025
Gemini Deep Think (advanced)	Informal LLM, parallel reasoning	IMO 2025 gold, 35/42, 5/6 (human-graded) [6]	2025
FunSearch	LLM + evolutionary program search	New cap-set lower bound; bin-packing heuristics [9]	2023
Ramanujan Machine	Numerical conjecture search	New continued-fraction conjectures for constants [11]	2021
DeepMind × Oxford	ML-guided intuition	Knot-theory theorem; KL-conjecture progress [10]	2021

Benchmark table

Benchmark	What it tests	Contamination risk	Notes
MATH / AIME	Competition, numeric answer	High (published)	Largely saturated by frontier reasoners
Putnam / PutnamBench	Undergrad competition; Lean targets	Low (formal)	DeepSeek-Prover-V2: 49/658 [4]
miniF2F	Olympiad/AIME formal proving (Lean)	Low (formal)	DeepSeek-Prover-V2: 88.9% [4]
ProofNet / ProofNet#	Autoformalization + proving	Low	Hardest on faithful formalization [5]
FrontierMath (Epoch)	Original research-level problems	Very low (unpublished) [7]	Toughest standard benchmark

Open problems

Verification cost. Formal proving guarantees soundness but is expensive: AlphaProof reportedly needed long compute per IMO problem, and DeepSeek-Prover's top scores use thousands of samples per theorem [4]. Scaling proof search remains the bottleneck.
Novel vs. memorized. On published benchmarks (MATH, AIME) it is genuinely hard to separate reasoning from recall. Contamination-resistant, unpublished sets like FrontierMath [7] and formal targets like PutnamBench are the field's response, but a rigorous "is this new mathematics?" test is unsolved.
The autoformalization gap. Translating natural-language math into faithful, well- typed Lean is the chokepoint connecting informal LLM reasoning to formal verification. Sampling-and-filtering helps but meaning-preservation under paraphrase is unsolved [5].
Informal-but-unchecked proofs. IMO-gold natural-language proofs (Gemini, OpenAI) are human-graded, not machine-verified [6] — closing the loop to machine-checkable output for full proofs (not just final answers) is the next frontier.
Discovery beyond benchmarks. FunSearch and the conjecture machines show AI can produce genuinely new results [9][11], but only in narrow, scorable niches; general open-ended mathematical discovery remains aspirational.

Sources

https://deepmind.google/blog/ai-solves-imo-problems-at-silver-medal-level/ (2026-06-14)
https://www.nature.com/articles/s41586-025-09833-y (2026-06-14)
https://arxiv.org/abs/2502.03544 (2026-06-14)
https://arxiv.org/abs/2504.21801 (2026-06-14)
https://arxiv.org/abs/2504.11354 (2026-06-14)
https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ (2026-06-14)
https://epoch.ai/frontiermath/the-benchmark (2026-06-14)
https://aimoprize.com/updates/2024-07-20-progress-prize-results (2026-06-14)
https://www.nature.com/articles/s41586-023-06924-6 (2026-06-14)
https://www.nature.com/articles/d41586-021-03593-1 (2026-06-14)
https://www.nature.com/articles/s41586-021-03229-4 (2026-06-14)