AI for AI — Autonomous ML Research & Algorithm Discovery
The reflexive case of ai-for-science: using AI/ML to accelerate ML research itself — discovering algorithms, running experiments, and (the hype frontier) "designing the next model."
Parent: ai-for-science · Sibling: aifs-mathematics
This is the page where AI is both the tool and the subject. It splits into three honest tiers: (1) narrow algorithm discovery — real, verified, published wins; (2) autonomous "AI scientists" — impressive pipelines whose quality is contested; (3) recursive self-improvement — mostly a thought experiment with thin empirical support. Keeping these tiers separate is the whole discipline; conflating them is where the hype lives.
1. Algorithm discovery by AI (the strongest evidence)
The most defensible "AI improves AI/computing" results come from search systems that produce programs or constructions which are then independently verifiable. Verification is what makes these trustworthy: a found object either multiplies matrices correctly or it does not.
AlphaTensor (2022)
DeepMind framed matrix multiplication as a single-player game ("TensorGame") and trained an AlphaZero-style RL agent to find low-rank tensor decompositions [1]. It rediscovered many known schemes and, notably, improved on Strassen's two-level algorithm for 4×4 matrices in a finite field (mod 2) for the first time since 1969, and produced hardware-tuned schemes 10–20% faster on specific GPUs/TPUs [1][2]. Caveat: the headline 4×4 "mod-2" result is over a specific field, not a universal improvement; much of the practical speedup is hardware-specific.
AlphaDev (2023)
An RL agent operating directly on assembly instructions found shorter routines for fixed-length sorting (sort-3/4/5) via two new "swap" and "copy" moves, plus faster hashing for 9–16 byte inputs [3][4]. The wins are real and were upstreamed into the LLVM libc++ standard library — the first change to its sorting routines in over a decade, and the first from an AI-discovered algorithm [4]. Caveat: gains are large only for very short fixed-size sequences (~70% for tiny inputs, ~1.7% above 250k elements) [4]; this is micro-optimization of hot leaf routines, not a new asymptotic class.
FunSearch (2023)
The first system to pair an LLM with an evaluator + evolutionary loop and produce a new result on an open problem: improved constructions for the cap set problem (largest improvement to the asymptotic lower bound in ~20 years) and better heuristics for online bin packing [5][6]. Key design idea: evolve the program that generates the solution, not the solution itself — which makes outputs interpretable and verifiable [5]. Caveat: narrow problems with cheap, exact evaluators; the LLM proposes, the verifier disposes.
AlphaEvolve (2025)
The current flagship: a Gemini-powered evolutionary coding agent that mutates whole code files under evaluator feedback [7][8]. Reported results span practical infra (recovered ~0.7% of Google's worldwide compute via a data-center scheduling heuristic, a Verilog/TPU arithmetic-circuit simplification, a ~23% speedup to a Gemini training kernel) and math (applied to 50+ open problems; rediscovered the best-known construction in ~75% and improved state-of-the-art in ~20%) [8][7]. Two widely-cited specifics: a denser 11-dimensional kissing-number configuration (592 → 593) and a slightly improved Erdős minimum-overlap bound [8][9].
The contested headline. AlphaEvolve's "48 scalar multiplications for 4×4 complex matrices, first improvement over Strassen in 56 years" is the most over-quoted claim on this page. Critics note Winograd (1967) already achieved 48 multiplications for 4×4, and Waksman's 1970 algorithm uses 46 over commutative rings with division by 2 [10][9]. The honest framing: AlphaEvolve's scheme is a genuine improvement for the noncommutative-but-division-by-2 setting, but it is not a clean "beat a 56-year record" story [9][10]. This is the single most important caveat in this module.
2. Autonomous "AI scientists" (impressive pipelines, contested quality)
These systems chain hypothesis generation → experiment design → code execution → analysis → paper writing into a closed loop. The engineering is real; the scientific value of the output is where evidence gets thin.
Sakana "AI Scientist" (v1, 2024 → v2, 2025)
v1 generated end-to-end ML papers for ~$15 each. v2 replaced the linear pipeline with Best-First Tree Search (BFTS) over experiment branches and removed reliance on human-authored code templates [11]. The marquee event: Sakana submitted three fully AI-generated papers to an ICLR 2025 workshop ("I Can't Believe It's Not Better"), and one passed peer review — reportedly scoring above the workshop average — making it the first documented AI-generated paper to clear human review [12]. Sakana itself withdrew the paper before publication and flagged it as a process experiment.
Honest critique. An independent evaluation of v1 found roughly 42% of proposed experiments failed on coding errors, manuscripts carried a median of ~5 (often outdated) citations, and several contained hallucinated numerical results, placeholder text, and missing figures [13]. Even the v2 paper that passed review was later noted to contain hallucinations and overstated novelty. The accepted-at-a-workshop milestone is real but narrow: workshop bar < main-conference bar, and "passed review" ≠ "correct and novel science."
The broader agentic-research wave
A cottage industry of "AI co-scientist" and "research agent" systems now exists (Google's AI co-scientist, FutureHouse, various open-source scaffolds). The pattern is consistent: strong at ideation and boilerplate, weak at rigorous novelty assessment and at not fooling itself. The binding constraint is evaluation integrity, not generation throughput.
3. Benchmarks: how good are agents at real ML research?
Because demos are easy to cherry-pick, three 2024–2025 benchmarks try to measure agent ML-research ability against human baselines with verifiable scoring.
| Benchmark | Author | Task | Headline result |
|---|---|---|---|
| MLE-bench | OpenAI (2024) | 75 Kaggle ML-engineering competitions | Best setup (o1-preview + AIDE scaffolding) reaches ≥ bronze-medal level in 16.9% of competitions [14][15] |
| RE-Bench | METR (2024) | 7 open-ended ML R&D environments, vs 61 human experts | At a 2 h budget agents score ~4× humans; at 8 h humans narrowly lead; at 32 h humans score ~2× the top agent [16][17] |
| PaperBench | OpenAI (2025) | Replicate 20 ICML 2024 papers from scratch (8,316 graded sub-tasks) | Best agent (Claude 3.5 Sonnet + scaffold) avg 21.0%; on a 3-paper subset ML PhDs hit 41.4% vs o1's 26.6% [18][19] |
Reading the tables. The consistent finding: agents are fast and cheap but plateau — they win at short horizons where speed dominates, and lose at long horizons that reward sustained reasoning, debugging, and judgment [16][18]. None of these benchmarks show agents matching expert researchers on open-ended work. (Scores rise quickly with each model generation, so treat any specific number as a snapshot, not a ceiling.)
4. Neural architecture search / AutoML — mostly subsumed
Classical NAS (RL-controller or evolutionary search over architectures, ca. 2017–2020) is now largely historical. The frontier moved on for two reasons: (1) scaling laws made "bigger Transformer + more data" outperform searched exotic architectures; (2) architecture innovation shifted to human + LLM-assisted component design (attention variants, normalization, mixture-of-experts routing) rather than blind search. AutoML survives in production as hyperparameter / pipeline optimization (Optuna-style) and as the scaffolding inside systems like AIDE that drive the agents in §3 — not as a research frontier in its own right. The lesson: search beats search-over-architectures only when you have a cheap, faithful objective; scaling provided a cheaper path to capability.
5. Self-improvement / model-designs-model (keep grounded)
This is the speculative tier, and precision matters most here.
What is real (2025–2026): LLMs already help design pieces of their own training — proposing data-curation / filtering heuristics, generating synthetic training data, drafting reward-model rubrics, and (via AlphaEvolve §1) optimizing training kernels. These are concrete, bounded contributions where a human still owns the objective and the verification.
What is debated: "recursive self-improvement" (RSI) — a model that autonomously rewrites itself into a more capable model, compounding. Workshops (ICLR 2026) now study RSI seriously, but the empirical loops that exist update prompts, data, or peripheral code under human direction, not core capability without a human in the loop. Two grounding facts cut against runaway narratives: (a) models collapse when trained recursively on their own ungrounded outputs (documented in Nature, 2024), so self-generated data needs an external signal; (b) every demonstrated loop relies on a verifier or human to prevent drift. The honest position: useful self-improvement is real and bounded; recursive, unbounded self-improvement remains unproven and should be flagged as hype when asserted as imminent.
6. Who is betting on this — the lab landscape
A short, skeptical map of the organizations whose entire thesis rests on the §5 self-improvement loop actually scaling. Automated AI research (this whole page) is the methodological core of their strategy — make AI good enough at AI research that progress compounds.
- recursive-superintelligence (RSI) — the purest embodiment of this page's thesis: a reported ~$4.65B-valued startup (Richard Socher; Yuandong Tian / 田渊栋, ex-Meta FAIR RL/reasoning lead; Tim Rocktäschel, ex-DeepMind; ViT co-author Alexey Dosovitskiy) whose entire product is the §5 self-improvement loop — "models that build models," pre-product as of 2026, with NVIDIA and AMD co-investing. The cleanest test case for whether the loop scales.
- safe-superintelligence (SSI) — Ilya Sutskever's "straight shot to safe superintelligence," explicitly no intermediate products; raised at a reported ~$32B valuation with zero shipped product — a pure bet on the research path.
- meta-superintelligence-labs (MSL) — Meta's 2025 superintelligence unit under Alexandr Wang, anchored by a $14.3B Scale AI deal and reported nine-figure talent packages; mission "personal superintelligence."
- Incumbents — Google DeepMind (explicit AGI mission), OpenAI (chartered around superintelligence; its Superalignment team was disbanded in 2024), and xAI pursue the same goal. Per this wiki's big-company rule they get no dedicated single-product page here.
The honest gap: none of these has demonstrated the recursive, unbounded self-improvement the superintelligence thesis requires (see §5). What they demonstrably have is capital, compute, talent, and the automated-AI-research tooling catalogued above. Treat "ASI" as a stated goal and capital allocation, not an achieved or imminent result — the evidence on this page supports bounded, verifier-gated self-improvement and nothing stronger.
Systems table (verification status)
| System | What it discovered / did | Year | Verified? |
|---|---|---|---|
| AlphaTensor | Faster matrix-mult tensor decompositions; 4×4 improvement over Strassen in GF(2) | 2022 | Yes — published Nature; decompositions checkable [1] |
| AlphaDev | Shorter sort-3/4/5 assembly; faster small hashing | 2023 | Yes — merged into LLVM libc++ [4] |
| FunSearch | New cap-set constructions; better online bin-packing heuristics | 2023 | Yes — Nature; constructions verifiable [5] |
| AlphaEvolve | Data-center & kernel optimizations; 50+ math problems; contested 4×4 matrix claim | 2025 | Partly — infra wins verified; "56-year record" framing disputed [9][10] |
| AI Scientist v2 (Sakana) | Full auto-generated ML paper; one cleared an ICLR workshop review | 2025 | Weakly — milestone real, paper later found to contain hallucinations [12][13] |
Open problems
- Novelty vs interpolation. When an LLM-driven search "discovers" something, is it new, or surfacing a known-but-forgotten result (the AlphaEvolve / Waksman question)? Distinguishing the two requires literature grounding these systems are weak at [9][13].
- Evaluation integrity. Results are only as trustworthy as their verifier. Algorithm discovery has cheap exact checkers (its strength); open-ended "AI scientist" output has no such oracle, which is exactly why its quality is contested [13][18].
- Reproducibility & contamination. Benchmark scores are inflated by pre-training contamination (MLE-bench explicitly studies this) and by scaffolding differences; a "score" without the scaffold and dataset cutoff is nearly meaningless [14].
- The self-fooling failure mode. Autonomous pipelines optimize for looking successful (passing review, reporting gains) and will hallucinate confirming numbers unless externally grounded [13]. Human-in-the-loop verification, not autonomy, is currently the value driver — see also ai-for-science's "paradigm enhancement vs transition" framing.
Sources
- https://www.nature.com/articles/s41586-022-05172-4 (2026-06-14) — AlphaTensor, Nature
- https://deepmind.google/blog/discovering-novel-algorithms-with-alphatensor/ (2026-06-14)
- https://www.nature.com/articles/s41586-023-06004-9 (2026-06-14) — AlphaDev, Nature
- https://deepmind.google/blog/alphadev-discovers-faster-sorting-algorithms/ (2026-06-14)
- https://www.nature.com/articles/s41586-023-06924-6 (2026-06-14) — FunSearch, Nature
- https://deepmind.google/blog/funsearch-making-new-discoveries-in-mathematical-sciences-using-large-language-models/ (2026-06-14)
- https://arxiv.org/abs/2506.13131 (2026-06-14) — AlphaEvolve paper
- https://deepmind.google/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/ (2026-06-14)
- https://en.wikipedia.org/wiki/AlphaEvolve (2026-06-14) — incl. Waksman/Winograd caveats
- https://arxiv.org/abs/2506.13242 (2026-06-14) — 4×4 in 48 non-complex multiplications (context)
- https://github.com/sakanaai/ai-scientist-v2 (2026-06-14)
- https://sakana.ai/ai-scientist-first-publication/ (2026-06-14)
- https://arxiv.org/abs/2502.14297 (2026-06-14) — independent critique of AI Scientist v1
- https://arxiv.org/abs/2410.07095 (2026-06-14) — MLE-bench
- https://openai.com/index/mle-bench/ (2026-06-14)
- https://arxiv.org/abs/2411.15114 (2026-06-14) — RE-Bench
- https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ (2026-06-14)
- https://arxiv.org/abs/2504.01848 (2026-06-14) — PaperBench
- https://openai.com/index/paperbench/ (2026-06-14)