AI for Biology — Protein, Genome & Cell
Biology is arguably the scientific domain where AI has moved fastest from demonstration to infrastructure. The reason is structural: life encodes itself in sequences — proteins as strings of amino acids, genomes as strings of nucleotides — and decades of high-throughput sequencing have produced billions of these strings that can be learned self-supervised, exactly the regime in which large neural models excel. This module surveys the AI/ML methods that define the 2025–2026 frontier: structure prediction, generative protein design, sequence/genomic foundation models, and the more contested single-cell models. It is deliberately about the methods and systems, not wet-lab biology per se. See ai-for-science for the cross-domain framing and aifs-chemistry for the closely-coupled molecular-chemistry side.
1. Protein structure prediction
The prediction of 3D protein structure from sequence was the field's breakthrough use case. AlphaFold 2 (2021) used an Evoformer that reasoned jointly over a multiple-sequence alignment (MSA) and a pairwise residue representation, with an SE(3)-aware "structure module" emitting coordinates.
AlphaFold 3 (2024) is a substantial architectural redesign. It replaces the hand-built structure module with a diffusion network operating directly on atom coordinates — conceptually similar to image diffusion, starting from a noise cloud of atoms and denoising over many steps to a final structure [1]. The decisive change is scope: AF3 predicts the joint structure of complexes spanning proteins, nucleic acids, small-molecule ligands, ions and modified residues in one unified framework. It reports far greater accuracy than specialized docking tools for protein–ligand interactions, higher accuracy than nucleic-acid-specific predictors for protein–nucleic-acid interactions, and improved antibody–antigen accuracy over AlphaFold-Multimer v2.3 [1].
ESMFold (Meta AI, 2022–2023) takes a different route: it folds from a single sequence, with no MSA or templates, by reading structure out of the protein language model ESM-2 (variants reported up to 15B parameters) [2]. It is reported to be roughly an order of magnitude faster than AlphaFold 2 at inference, which enabled the >600M-protein ESM Metagenomic Atlas; accuracy is competitive for sequences the language model models confidently, and weaker for the rest [2].
Current limits. These remain static structure predictors — conformational dynamics, allostery and folding pathways are out of scope [1]. Independent analyses report that AF3 can hallucinate order in intrinsically disordered regions, with one study reporting a sizable fraction of disorder-residue misalignment against DisProt [9]. Chirality and exotic stoichiometries remain weak spots, and accuracy degrades for sequences lacking evolutionary signal.
2. Protein design / de-novo
Where prediction reads structure from sequence, design does the inverse and the generative: invent a new protein for a target function.
- RFdiffusion (Baker Lab) is a denoising diffusion model built on the RoseTTAFold backbone. It represents each residue as a rigid frame (a Cα coordinate plus an N–Cα–C orientation) and runs an SE(3)-equivariant diffusion so that outputs are invariant to global rotation/translation [3][4]. It generates protein backbones conditioned on motifs, symmetries or binding targets.
- ProteinMPNN solves the complementary inverse-folding problem: given a backbone, design an amino-acid sequence that will fold into it. The canonical pipeline is RFdiffusion → ProteinMPNN: geometry engine, then sequence designer [4].
- AlphaProteo (DeepMind, 2024) is a family of models targeting the de-novo binder problem directly. On a reported set of seven target proteins it achieved 3-to-300× better binding affinities and higher experimental success than prior methods, generating binders to targets including VEGF-A and a SARS-CoV-2 protein, often after a single round of medium-throughput screening [5].
- Flow-matching designers are the emerging alternative to score-based diffusion: continuous-normalizing-flow training (flow matching) on the SE(3) manifold offers straighter probability paths and faster sampling for backbone generation, and is an active 2025–2026 research line [4].
The unifying idea is equivariant generative modeling on a geometric manifold: proteins live in 3D, so the generative process must respect the symmetries of 3D space rather than learning them from data.
3. Sequence & genomic foundation models
This is where "foundation model" is most literal — single self-supervised models pretrained on raw sequence and adapted to many tasks.
- ESM-2 / ESM3. ESM-2 is a masked protein language model whose internal representations encode structure [2]. ESM3 (EvolutionaryScale, 2024) is a multimodal generative successor that reasons jointly over sequence, structure and function, trained (per the lab) on 2.78B proteins. In a widely cited demonstration it generated a novel fluorescent protein at ~58% identity to known GFPs — characterized as equivalent to simulating ~500M years of evolution [6].
- Evo / Evo2 (Arc Institute + NVIDIA, 2025) are DNA foundation models built on the StripedHyena architecture (a convolution/attention hybrid for very long context). Evo 2 is reported at 7B and 40B parameters, trained on ~9.3 trillion DNA base pairs across >128,000 species, with context windows up to ~1 million nucleotides — enough to read whole microbial genomes or human chromosomes in one pass. It reports state-of-the-art zero-shot variant classification, e.g. on BRCA1 [7].
- AlphaGenome (DeepMind, 2025) targets the regulatory genome: it takes up to ~1 Mb of DNA and predicts thousands of functional genomic tracks (expression, chromatin accessibility, histone marks, TF binding, contact maps, splicing) at up to single-base resolution. It is reported to match or exceed the best external models on 24 of 26 variant-effect evaluations and is the only assessed model that jointly predicts all modalities [8].
- Nucleotide Transformer (InstaDeep, Nature Methods 2024) is a family of human/multi-species genomics LMs (reported 50M–2.5B parameters), with the multispecies 2.5B variant the strongest of its cohort across promoter and splicing tasks [9b]. A v3 line extends context toward 1 Mb.
Methods table
| Model | Architecture | Primary task | Year |
|---|---|---|---|
| AlphaFold 2 | Evoformer + structure module (MSA-based) | Single-chain structure | 2021 |
| AlphaFold 3 | Pairformer + atom diffusion | Multi-molecule complex structure | 2024 |
| ESMFold / ESM-2 | Protein LM → folding head (single-seq) | Fast structure prediction | 2022–23 |
| ESM3 | Multimodal masked/generative LM | Seq+struct+function generation | 2024 |
| RFdiffusion | SE(3)-equivariant diffusion (RoseTTAFold) | Backbone generation | 2023 |
| ProteinMPNN | Message-passing GNN | Inverse folding (sequence design) | 2022 |
| AlphaProteo | Generative binder design | De-novo binders | 2024 |
| Evo 2 | StripedHyena (long-context hybrid) | DNA/RNA/protein generation & prediction | 2025 |
| AlphaGenome | Unified DNA sequence model | Regulatory track + variant effect | 2025 |
| Nucleotide Transformer | Transformer (k-mer / single-base) | Genomic downstream tasks | 2023–24 |
| scGPT / Geneformer | Transformer over gene tokens | Single-cell representation | 2023–24 |
Comparison — structure predictors
| Property | AlphaFold 2 | AlphaFold 3 | ESMFold |
|---|---|---|---|
| Needs MSA? | Yes | Yes (retained) | No (single sequence) |
| Coordinate generator | Structure module | Diffusion (atom-level) | Folding head |
| Multi-molecule complexes | Limited (Multimer) | Native (protein/NA/ligand/ion) | Limited |
| Relative speed | Baseline | Slower (sampling) | ~10× faster (reported) [2] |
| Key weakness | Static; MSA-dependent | Static; disorder hallucination [9] | Lower accuracy on hard seqs |
4. Single-cell / omics foundation models — the honest debate
scGPT and Geneformer apply the LM recipe to single-cell transcriptomics, tokenizing genes/expression and pretraining on tens of millions of cells. The promise is a reusable "virtual cell" embedding for clustering, annotation and perturbation prediction.
The candid 2025 finding is that this promise is not yet established. Multiple benchmarks report that in zero-shot settings these models can be matched or beaten by far simpler baselines — highly-variable-gene selection, scVI, or Harmony batch correction — for tasks like batch integration and cell-type clustering [10]. Proposed explanations: masked-LM pretraining may not yield useful cell-level embeddings, or the models may not have truly learned the pretraining task; notably, larger pretraining data did not reliably help [10]. The takeaway is methodological humility — "foundation model" branding does not by itself beat well-tuned, task-specific baselines, and rigorous baselines are mandatory.
General vs specialized
Biology is the domain closest to having genuine foundation models, and the reason is data-shaped. Protein and DNA sequences are (a) enormous in volume, (b) naturally self-supervisable (mask a residue/base and predict it), and (c) carry deep evolutionary signal that correlates with structure and function. This is the same precondition that made language models work, transplanted into biology — which is why sequence models (ESM, Evo, NT) generalize across tasks far more convincingly than, say, materials or fluid-dynamics models do.
But "foundation" is uneven. Sequence space is conquered; the cell and the regulatory genome are not. AlphaGenome and Evo 2 push toward whole-genome context, yet integrating multi-omic, dynamic, spatial cell state into one model remains open — and the single-cell results above show that scale alone is not sufficient. The frontier is moving from one molecule toward systems.
Open problems
- Dynamics, not snapshots. Structure predictors output static conformations; ensembles, allostery and folding kinetics are largely unsolved.
- Disorder & hallucination. Intrinsically disordered regions are both biologically important and where models confidently hallucinate order [9].
- Trustworthy generative biology. De-novo design success rates are rising but experimental validation remains the only ground truth; in-silico metrics can be gamed.
- Genuine cell foundation models. Beating simple baselines robustly, and predicting perturbation response, is still open [10].
- Multimodal integration. Unifying sequence, structure, function, regulation and cell state in one model — the "virtual cell" goal — is unrealized.
- Evaluation rigor. The field needs leak-free, baseline-anchored benchmarks; optimistic zero-shot claims have repeatedly failed independent re-testing.
Sources
- AlphaFold 3 — Accurate structure prediction of biomolecular interactions with AlphaFold 3, Nature 630:493–500 (2024). https://www.nature.com/articles/s41586-024-07487-w
- ESMFold / ESM-2 — Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). https://www.science.org/doi/10.1126/science.ade2574
- RFdiffusion / SE(3)-equivariant diffusion — background on frame-based equivariant protein diffusion. https://arxiv.org/pdf/2302.02277
- RFdiffusion → ProteinMPNN pipeline & flow-matching design — review of generative protein design. https://www.sciencedirect.com/science/article/pii/S0959440X24000216
- AlphaProteo — De novo design of high-affinity protein binders with AlphaProteo (2024). https://arxiv.org/abs/2409.08022
- ESM3 — Simulating 500 million years of evolution with a language model, Science (2024). https://www.science.org/doi/10.1126/science.ads0018
- Evo 2 — Arc Institute / NVIDIA DNA foundation model announcement (2025). https://arcinstitute.org/news/evo2
- AlphaGenome — Advancing regulatory variant effect prediction with AlphaGenome, Nature (2025). https://www.nature.com/articles/s41586-025-10014-0
- AF3 disorder hallucination — Hallucinations in AlphaFold3 for Intrinsically Disordered Proteins (2025). https://arxiv.org/pdf/2510.15939 9b. Nucleotide Transformer — Building and evaluating robust foundation models for human genomics, Nature Methods (2024). https://www.nature.com/articles/s41592-024-02523-z
- Single-cell FM limits — Zero-shot evaluation reveals limitations of single-cell foundation models, Genome Biology (2025). https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03574-x