Home/AI for Science/AI for AI — Autonomous ML Research & Algorithm Discovery
EN中文

AI for AI — Autonomous ML Research & Algorithm Discovery

The reflexive case of ai-for-science: using AI/ML to accelerate ML research itself — discovering algorithms, running experiments, and (the hype frontier) "designing the next model."

Parent: ai-for-science · Sibling: aifs-mathematics

This is the page where AI is both the tool and the subject. It splits into three honest tiers: (1) narrow algorithm discovery — real, verified, published wins; (2) autonomous "AI scientists" — impressive pipelines whose quality is contested; (3) recursive self-improvement — mostly a thought experiment with thin empirical support. Keeping these tiers separate is the whole discipline; conflating them is where the hype lives.


1. Algorithm discovery by AI (the strongest evidence)

The most defensible "AI improves AI/computing" results come from search systems that produce programs or constructions which are then independently verifiable. Verification is what makes these trustworthy: a found object either multiplies matrices correctly or it does not.

AlphaTensor (2022)

DeepMind framed matrix multiplication as a single-player game ("TensorGame") and trained an AlphaZero-style RL agent to find low-rank tensor decompositions [1]. It rediscovered many known schemes and, notably, improved on Strassen's two-level algorithm for 4×4 matrices in a finite field (mod 2) for the first time since 1969, and produced hardware-tuned schemes 10–20% faster on specific GPUs/TPUs [1][2]. Caveat: the headline 4×4 "mod-2" result is over a specific field, not a universal improvement; much of the practical speedup is hardware-specific.

AlphaDev (2023)

An RL agent operating directly on assembly instructions found shorter routines for fixed-length sorting (sort-3/4/5) via two new "swap" and "copy" moves, plus faster hashing for 9–16 byte inputs [3][4]. The wins are real and were upstreamed into the LLVM libc++ standard library — the first change to its sorting routines in over a decade, and the first from an AI-discovered algorithm [4]. Caveat: gains are large only for very short fixed-size sequences (~70% for tiny inputs, ~1.7% above 250k elements) [4]; this is micro-optimization of hot leaf routines, not a new asymptotic class.

FunSearch (2023)

The first system to pair an LLM with an evaluator + evolutionary loop and produce a new result on an open problem: improved constructions for the cap set problem (largest improvement to the asymptotic lower bound in ~20 years) and better heuristics for online bin packing [5][6]. Key design idea: evolve the program that generates the solution, not the solution itself — which makes outputs interpretable and verifiable [5]. Caveat: narrow problems with cheap, exact evaluators; the LLM proposes, the verifier disposes.

AlphaEvolve (2025)

The current flagship: a Gemini-powered evolutionary coding agent that mutates whole code files under evaluator feedback [7][8]. Reported results span practical infra (recovered ~0.7% of Google's worldwide compute via a data-center scheduling heuristic, a Verilog/TPU arithmetic-circuit simplification, a ~23% speedup to a Gemini training kernel) and math (applied to 50+ open problems; rediscovered the best-known construction in ~75% and improved state-of-the-art in ~20%) [8][7]. Two widely-cited specifics: a denser 11-dimensional kissing-number configuration (592 → 593) and a slightly improved Erdős minimum-overlap bound [8][9].

The contested headline. AlphaEvolve's "48 scalar multiplications for 4×4 complex matrices, first improvement over Strassen in 56 years" is the most over-quoted claim on this page. Critics note Winograd (1967) already achieved 48 multiplications for 4×4, and Waksman's 1970 algorithm uses 46 over commutative rings with division by 2 [10][9]. The honest framing: AlphaEvolve's scheme is a genuine improvement for the noncommutative-but-division-by-2 setting, but it is not a clean "beat a 56-year record" story [9][10]. This is the single most important caveat in this module.


2. Autonomous "AI scientists" (impressive pipelines, contested quality)

These systems chain hypothesis generation → experiment design → code execution → analysis → paper writing into a closed loop. The engineering is real; the scientific value of the output is where evidence gets thin.

Sakana "AI Scientist" (v1, 2024 → v2, 2025)

v1 generated end-to-end ML papers for ~$15 each. v2 replaced the linear pipeline with Best-First Tree Search (BFTS) over experiment branches and removed reliance on human-authored code templates [11]. The marquee event: Sakana submitted three fully AI-generated papers to an ICLR 2025 workshop ("I Can't Believe It's Not Better"), and one passed peer review — reportedly scoring above the workshop average — making it the first documented AI-generated paper to clear human review [12]. Sakana itself withdrew the paper before publication and flagged it as a process experiment.

Honest critique. An independent evaluation of v1 found roughly 42% of proposed experiments failed on coding errors, manuscripts carried a median of ~5 (often outdated) citations, and several contained hallucinated numerical results, placeholder text, and missing figures [13]. Even the v2 paper that passed review was later noted to contain hallucinations and overstated novelty. The accepted-at-a-workshop milestone is real but narrow: workshop bar < main-conference bar, and "passed review" ≠ "correct and novel science."

The broader agentic-research wave

A cottage industry of "AI co-scientist" and "research agent" systems now exists (Google's AI co-scientist, FutureHouse, various open-source scaffolds). The pattern is consistent: strong at ideation and boilerplate, weak at rigorous novelty assessment and at not fooling itself. The binding constraint is evaluation integrity, not generation throughput.


3. Benchmarks: how good are agents at real ML research?

Because demos are easy to cherry-pick, three 2024–2025 benchmarks try to measure agent ML-research ability against human baselines with verifiable scoring.

Benchmark Author Task Headline result
MLE-bench OpenAI (2024) 75 Kaggle ML-engineering competitions Best setup (o1-preview + AIDE scaffolding) reaches ≥ bronze-medal level in 16.9% of competitions [14][15]
RE-Bench METR (2024) 7 open-ended ML R&D environments, vs 61 human experts At a 2 h budget agents score ~4× humans; at 8 h humans narrowly lead; at 32 h humans score ~2× the top agent [16][17]
PaperBench OpenAI (2025) Replicate 20 ICML 2024 papers from scratch (8,316 graded sub-tasks) Best agent (Claude 3.5 Sonnet + scaffold) avg 21.0%; on a 3-paper subset ML PhDs hit 41.4% vs o1's 26.6% [18][19]

Reading the tables. The consistent finding: agents are fast and cheap but plateau — they win at short horizons where speed dominates, and lose at long horizons that reward sustained reasoning, debugging, and judgment [16][18]. None of these benchmarks show agents matching expert researchers on open-ended work. (Scores rise quickly with each model generation, so treat any specific number as a snapshot, not a ceiling.)


4. Neural architecture search / AutoML — mostly subsumed

Classical NAS (RL-controller or evolutionary search over architectures, ca. 2017–2020) is now largely historical. The frontier moved on for two reasons: (1) scaling laws made "bigger Transformer + more data" outperform searched exotic architectures; (2) architecture innovation shifted to human + LLM-assisted component design (attention variants, normalization, mixture-of-experts routing) rather than blind search. AutoML survives in production as hyperparameter / pipeline optimization (Optuna-style) and as the scaffolding inside systems like AIDE that drive the agents in §3 — not as a research frontier in its own right. The lesson: search beats search-over-architectures only when you have a cheap, faithful objective; scaling provided a cheaper path to capability.

5. Self-improvement / model-designs-model (keep grounded)

This is the speculative tier, and precision matters most here.

What is real (2025–2026): LLMs already help design pieces of their own training — proposing data-curation / filtering heuristics, generating synthetic training data, drafting reward-model rubrics, and (via AlphaEvolve §1) optimizing training kernels. These are concrete, bounded contributions where a human still owns the objective and the verification.

What is debated: "recursive self-improvement" (RSI) — a model that autonomously rewrites itself into a more capable model, compounding. Workshops (ICLR 2026) now study RSI seriously, but the empirical loops that exist update prompts, data, or peripheral code under human direction, not core capability without a human in the loop. Two grounding facts cut against runaway narratives: (a) models collapse when trained recursively on their own ungrounded outputs (documented in Nature, 2024), so self-generated data needs an external signal; (b) every demonstrated loop relies on a verifier or human to prevent drift. The honest position: useful self-improvement is real and bounded; recursive, unbounded self-improvement remains unproven and should be flagged as hype when asserted as imminent.


6. Who is betting on this — the lab landscape

A short, skeptical map of the organizations whose entire thesis rests on the §5 self-improvement loop actually scaling. Automated AI research (this whole page) is the methodological core of their strategy — make AI good enough at AI research that progress compounds.

  • recursive-superintelligence (RSI) — the purest embodiment of this page's thesis: a reported ~$4.65B-valued startup (Richard Socher; Yuandong Tian / 田渊栋, ex-Meta FAIR RL/reasoning lead; Tim Rocktäschel, ex-DeepMind; ViT co-author Alexey Dosovitskiy) whose entire product is the §5 self-improvement loop — "models that build models," pre-product as of 2026, with NVIDIA and AMD co-investing. The cleanest test case for whether the loop scales.
  • safe-superintelligence (SSI) — Ilya Sutskever's "straight shot to safe superintelligence," explicitly no intermediate products; raised at a reported ~$32B valuation with zero shipped product — a pure bet on the research path.
  • meta-superintelligence-labs (MSL) — Meta's 2025 superintelligence unit under Alexandr Wang, anchored by a $14.3B Scale AI deal and reported nine-figure talent packages; mission "personal superintelligence."
  • Incumbents — Google DeepMind (explicit AGI mission), OpenAI (chartered around superintelligence; its Superalignment team was disbanded in 2024), and xAI pursue the same goal. Per this wiki's big-company rule they get no dedicated single-product page here.

The honest gap: none of these has demonstrated the recursive, unbounded self-improvement the superintelligence thesis requires (see §5). What they demonstrably have is capital, compute, talent, and the automated-AI-research tooling catalogued above. Treat "ASI" as a stated goal and capital allocation, not an achieved or imminent result — the evidence on this page supports bounded, verifier-gated self-improvement and nothing stronger.


Systems table (verification status)

System What it discovered / did Year Verified?
AlphaTensor Faster matrix-mult tensor decompositions; 4×4 improvement over Strassen in GF(2) 2022 Yes — published Nature; decompositions checkable [1]
AlphaDev Shorter sort-3/4/5 assembly; faster small hashing 2023 Yes — merged into LLVM libc++ [4]
FunSearch New cap-set constructions; better online bin-packing heuristics 2023 Yes — Nature; constructions verifiable [5]
AlphaEvolve Data-center & kernel optimizations; 50+ math problems; contested 4×4 matrix claim 2025 Partly — infra wins verified; "56-year record" framing disputed [9][10]
AI Scientist v2 (Sakana) Full auto-generated ML paper; one cleared an ICLR workshop review 2025 Weakly — milestone real, paper later found to contain hallucinations [12][13]

Open problems

  • Novelty vs interpolation. When an LLM-driven search "discovers" something, is it new, or surfacing a known-but-forgotten result (the AlphaEvolve / Waksman question)? Distinguishing the two requires literature grounding these systems are weak at [9][13].
  • Evaluation integrity. Results are only as trustworthy as their verifier. Algorithm discovery has cheap exact checkers (its strength); open-ended "AI scientist" output has no such oracle, which is exactly why its quality is contested [13][18].
  • Reproducibility & contamination. Benchmark scores are inflated by pre-training contamination (MLE-bench explicitly studies this) and by scaffolding differences; a "score" without the scaffold and dataset cutoff is nearly meaningless [14].
  • The self-fooling failure mode. Autonomous pipelines optimize for looking successful (passing review, reporting gains) and will hallucinate confirming numbers unless externally grounded [13]. Human-in-the-loop verification, not autonomy, is currently the value driver — see also ai-for-science's "paradigm enhancement vs transition" framing.

Sources

  1. https://www.nature.com/articles/s41586-022-05172-4 (2026-06-14) — AlphaTensor, Nature
  2. https://deepmind.google/blog/discovering-novel-algorithms-with-alphatensor/ (2026-06-14)
  3. https://www.nature.com/articles/s41586-023-06004-9 (2026-06-14) — AlphaDev, Nature
  4. https://deepmind.google/blog/alphadev-discovers-faster-sorting-algorithms/ (2026-06-14)
  5. https://www.nature.com/articles/s41586-023-06924-6 (2026-06-14) — FunSearch, Nature
  6. https://deepmind.google/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/ (2026-06-14)
  7. https://arxiv.org/abs/2506.13131 (2026-06-14) — AlphaEvolve paper
  8. https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ (2026-06-14)
  9. https://en.wikipedia.org/wiki/AlphaEvolve (2026-06-14) — incl. Waksman/Winograd caveats
  10. https://arxiv.org/abs/2506.13242 (2026-06-14) — 4×4 in 48 non-complex multiplications (context)
  11. https://github.com/sakanaai/ai-scientist-v2 (2026-06-14)
  12. https://sakana.ai/ai-scientist-first-publication/ (2026-06-14)
  13. https://arxiv.org/abs/2502.14297 (2026-06-14) — independent critique of AI Scientist v1
  14. https://arxiv.org/abs/2410.07095 (2026-06-14) — MLE-bench
  15. https://openai.com/index/mle-bench/ (2026-06-14)
  16. https://arxiv.org/abs/2411.15114 (2026-06-14) — RE-Bench
  17. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ (2026-06-14)
  18. https://arxiv.org/abs/2504.01848 (2026-06-14) — PaperBench
  19. https://openai.com/index/paperbench/ (2026-06-14)
Last compiled: 2026-06-14