AI for Operations Research — Learned Optimization & Solvers

Operations Research (OR) is the discipline of mathematical optimization: routing, scheduling, resource allocation, and the broad family of combinatorial optimization (CO) problems. Classical OR has decades of mature machinery — branch-and-bound, cutting planes, metaheuristics, mixed-integer programming (MIP) solvers. The "AI for OR" agenda asks whether machine learning can learn heuristics, branching rules, or whole solution policies that adapt to the structure of a problem distribution, rather than relying on hand-engineered rules. This module surveys the AI/ML angle only: where learning is plugged into the OR pipeline, what is genuinely competitive, and what remains aspirational. See ai-for-science for the broader program and aifs-ml-self for ML that improves ML systems.

The central tension throughout: learned methods are fast and amortize over a distribution, but well-tuned classical solvers remain hard to beat on realistic, large, irregular instances [9]. Read the "Does it beat classical OR?" section before drawing conclusions.

1. Neural combinatorial optimization (NCO)

The earliest and most studied thread treats CO as a sequence-generation or policy problem and trains a neural network to construct solutions.

Pointer Networks (Vinyals et al., 2015) introduced an attention mechanism whose output is a pointer into the input set, enabling variable-length combinatorial outputs (e.g., a permutation of cities for TSP). This made end-to-end neural TSP solvers conceivable.
Attention models for routing — Kool, van Hoof & Welling, "Attention, Learn to Solve Routing Problems!" (ICLR 2019) [1] replaced recurrence with a Transformer-style encoder/decoder and trained with REINFORCE using a greedy rollout baseline. It became the canonical construction-policy baseline for TSP, the Capacitated Vehicle Routing Problem (CVRP), and variants.
POMO — "Policy Optimization with Multiple Optima" (Kwon et al., NeurIPS 2020) [3] exploits the symmetry of CO solutions (a tour has many equivalent starting points / orderings). By rolling out multiple trajectories per instance and using a shared baseline, POMO sharply improved both training stability and final tour quality over the attention model, and remains a strong NCO backbone.

Two broad NCO paradigms:

Constructive (autoregressive) policies — build a solution token by token (POMO, attention model).
Improvement / search heuristics — learn to edit an existing solution (2-opt moves, neural local search), or learn a "heatmap" of promising edges that guides a classical local search such as LKH [9].

Tooling has consolidated: RL4CO (Berto et al., 2023) [2] is a widely used open benchmark/library standardizing environments, baselines (attention model, POMO, and others), and training loops for routing CO.

2. ML inside exact MIP solvers

Rather than replacing the solver, this thread learns components of a branch-and-bound (B&B) MIP solver, preserving optimality guarantees while trying to cut wall-clock time.

Learning to branch — Gasse et al. (NeurIPS 2019) framed variable selection as imitation of the (expensive) strong branching expert, representing the MIP as a bipartite variable–constraint graph and learning a graph convolutional network (GCNN) branching policy. This is the reference "learning to branch" result and seeded a large follow-up literature (hybrid models, node selection, cut selection) [10].
Neural Diving + Neural Branching — Nair et al., "Solving Mixed Integer Programs Using Neural Networks" (DeepMind, 2020) [4]. Neural Diving learns a generative model over integer-variable assignments to produce high-quality partial assignments; the residual MIP over unassigned variables is finished by a base solver (e.g., SCIP). Neural Branching learns a branching policy via imitation. Reported gains were on Google production MIP datasets.
Learning to cut — selecting which cutting planes to add is another learnable decision; the same graph-representation idea has been applied to cut selection and separation [10].

Integration targets are the standard solvers: open-source SCIP (the usual research substrate because its internals are accessible) and commercial Gurobi (closed; ML typically wraps it via callbacks/warm starts rather than modifying internals).

3. Solver acceleration & hybrids

The most pragmatic wins come from ML guiding a classical search rather than owning it end to end.

Learned Large Neighborhood Search (LNS) — "Learning a Large Neighborhood Search Algorithm for Mixed Integer Programs" (DeepMind, 2021) [5]. Neural Diving produces an initial assignment; a learned Neural Neighborhood Selection policy (trained by imitation of a target policy that, with enough compute, picks the neighborhood containing the optimal next assignment) chooses which variables to re-optimize each round, and a MIP solver searches that neighborhood. The paper reports matching or beating baselines on five real-world MIP datasets, with 2× to 37.8× better average primal gap than the best baseline on three datasets at long running times [5]. (Attribute these numbers to that paper's datasets — do not generalize them.)
ML-guided heatmaps + LKH — a learned edge-scoring network seeds or biases the classical LKH local search for TSP, a recurring hybrid pattern [9].
Warm-starting and configuration — predicting good initial solutions, solver parameters, or which heuristic to run, on top of OR-Tools / SCIP / Gurobi.

The common thread: ML supplies priors (where to search, what to fix), and the classical solver supplies correctness and refinement.

4. RL for scheduling, routing, control, and chip design

Reinforcement learning has been applied to sequential OR-style decision problems beyond pure CO benchmarks:

Chip floorplanning / macro placement — Google "AlphaChip" (Nature 2021; re-branded 2024) poses macro placement as an RL problem and reports human-competitive or superior layouts, with layouts claimed to have taped out in multiple TPU generations [6].
- Honest note — reproducibility controversy. This result is contested. External and some internal critiques argue the claimed advantage over classical placers is not reproducible; Google counters that failed reproductions skipped pre-training, used far fewer compute resources, or did not train to convergence [6][7]. As of late 2025, no independent peer-reviewed reproduction confirming the Nature claims had been published, while Google maintains the method is deployed in production [6]. Treat the headline "RL beats classical chip placement" claim as disputed, not settled.
Data-center cooling / control — RL for facility control is frequently cited as an applied OR-flavored success, though published, audited details are thinner than the chip-placement case; treat specific energy-savings figures with caution.
Logistics / dynamic routing — RL construction policies (Sections 1, 5) are the research vehicle here.

5. Decision-focused (end-to-end) learning

Most deployed OR is predict-then-optimize (PtO): a model predicts uncertain parameters (demand, travel times, prices), then a solver optimizes given the predictions. Standard ML trains the predictor to minimize prediction error, which is misaligned with decision quality.

Decision-focused learning (DFL) trains the predictor to minimize downstream regret — the cost of the decisions induced by its predictions [8]. The hard part is differentiating through argmin:

OptNet (Amos & Kolter, 2017) embeds a quadratic program as a differentiable layer, backpropagating via implicit differentiation of the KKT conditions.
CVXPY layers / cvxpylayers (Agrawal et al., 2019) generalize this to disciplined convex programs, so a convex solve can sit inside a network and pass gradients.
Smart "Predict, then Optimize" (SPO/SPO+) and a family of surrogate losses approximate decision loss for linear/MILP objectives where the true gradient is zero almost everywhere [8].

Recent work (2024–2026) focuses on data efficiency and surrogate quality — e.g., decision-focused fine-tuning with limited data [8], surrogate modeling for MILP, and reducing the per-step solver cost that makes DFL expensive to train. The honest caveat: each training step typically requires one or more solver calls, so DFL is computationally heavy and is mostly applied to convex or moderate-size problems.

6. LLMs for optimization

A newer thread uses large language models (see aifs-ml-self) as either the optimizer or the modeler.

LLM-as-optimizer — OPRO, "Large Language Models as Optimizers" (DeepMind, 2023/2024) [9-ref]. The optimization problem is described in natural language; at each step the LLM is shown prior solutions with their objective values and asked to propose better ones. OPRO was demonstrated on small linear regression and TSP instances, then most impactfully on prompt optimization (reporting up to ~8% gain on GSM8K and up to ~50% on Big-Bench Hard tasks over human prompts) [9-ref]. Note: OPRO is a meta-heuristic that does not beat dedicated solvers on real CO — its real strength is black-box / textual objectives.
Agentic OR modeling / NL-to-MILP — using LLMs to translate a natural-language problem statement into a formal MILP/LP model (objective, constraints) callable by Gurobi/SCIP/OR-Tools. This lowers the modeling barrier (the historical bottleneck in OR adoption) rather than improving the solve itself. Reliability — correct constraint extraction, units, infeasibility diagnosis — is the open question, and benchmarks here are young.

7. Methods table

Approach	Problem class	Example system / paper	Maturity
Pointer / attention construction	Routing (TSP, CVRP)	Kool et al. attention model [1]	Mature baseline
Symmetry-aware RL construction	Routing CO	POMO [3]	Mature baseline
Learned local-search / heatmap	TSP, large CO	GNN-guided LKH [9]	Active research
Learning to branch (GNN)	MIP (exact)	Gasse et al. GCNN [10]	Maturing, in research solvers
Neural Diving / Branching	MIP (exact)	Nair et al. (DeepMind) [4]	Research, production datasets
Learned LNS	MIP (heuristic)	DeepMind LNS [5]	Research, strong primal gaps
Differentiable opt layers	Predict-then-optimize	OptNet, cvxpylayers [8]	Mature for convex
Decision-focused / SPO+	Linear/MILP decisions	SPO+, DFL fine-tuning [8]	Active research
LLM-as-optimizer	Black-box / textual	OPRO [9-ref]	Niche (not for hard CO)
LLM NL-to-MILP modeling	Any (modeling layer)	Agentic OR pipelines	Emerging

8. Does it beat classical OR? (the honest section)

The marketing answer is "yes"; the careful answer is "rarely, and only under matched conditions." The most rigorous recent evidence:

FrontierCO (2025) [9] is a real-world / large-scale benchmark built precisely because most NCO progress was measured on small, synthetic instances. Its finding: when classical solvers are also run in fast heuristic mode under matched time budgets, neural solvers remain substantially less effective than LKH-3 and other tuned classical methods on competition-grade and industrial instances [9].
Generalization is the core weakness. Neural CO heuristics trained on one instance distribution (size, structure) degrade on out-of-distribution instances; they close the gap to classical metaheuristics mainly when training and test distributions are aligned [9].
Exact-solver ML is more credible. Learning to branch and learned LNS sit inside a correct solver, so they cannot return wrong answers — the question is only speed, and gains are real but problem-specific [4][5][10].
AlphaChip remains the highest-profile contested "RL beats classical OR" claim [6][7].

Reasonable summary: ML's clearest, defensible wins in OR are as accelerators and priors inside classical solvers (branching, LNS, warm starts) and as a modeling aid (NL-to-MILP). End-to-end neural construction that outright beats a tuned LKH/Concorde/Gurobi on realistic large instances has not been convincingly demonstrated as of mid-2026.

9. Open problems

Generalization across instance sizes & structure. Train on n≈100, deploy on n≈10,000 — current NCO methods degrade sharply; size-agnostic policies are open.
Optimality guarantees. Construction policies give no bound. Coupling learned speed with certificate-producing exact methods (the inside-the-solver route) is the more guarantee-friendly path.
Fair, large-scale evaluation. FrontierCO and similar benchmarks exposed how much prior "SOTA" was synthetic-only; the field needs matched time budgets and industrial instances by default [9].
Training cost of DFL. Differentiable optimization requires solver calls in the loop; scaling DFL to large MILP remains expensive [8].
Deployment & trust. Integrating learned components into commercial solvers (closed-source Gurobi), handling distribution shift in production, and verifying feasibility are practical blockers.
LLM modeling reliability. NL-to-MILP must guarantee correct, feasible, intended models before it can be trusted in operational pipelines.

Sources

https://arxiv.org/abs/1803.08475 (2026-06-14) — Kool, van Hoof, Welling, "Attention, Learn to Solve Routing Problems!" (ICLR 2019).
https://arxiv.org/pdf/2306.17100 (2026-06-14) — Berto et al., "RL4CO: an Extensive RL for Combinatorial Optimization Benchmark."
https://proceedings.neurips.cc/paper/2020/hash/d1e946f4e67db4b362ad23818a6fb78a-Abstract.html (2026-06-14) — POMO and hybrid branching context (NeurIPS 2020).
https://arxiv.org/abs/2012.13349 (2026-06-14) — Nair et al. (DeepMind), "Solving Mixed Integer Programs Using Neural Networks" (Neural Diving / Branching).
https://arxiv.org/pdf/2107.10201 (2026-06-14) — "Learning a Large Neighborhood Search Algorithm for Mixed Integer Programs" (DeepMind, 2021).
https://en.wikipedia.org/wiki/AlphaChip_(controversy) (2026-06-14) — overview of the AlphaChip / macro-placement reproducibility dispute.
https://arxiv.org/html/2411.10053v1 (2026-06-14) — "That Chip Has Sailed: A Critique of Unfounded Skepticism Around AI for Chip Design" (Google response).
https://arxiv.org/pdf/2501.01874 (2026-06-14) — "DFF: Decision-Focused Fine-tuning for Smarter Predict-then-Optimize with Limited Data" (DFL / OptNet / cvxpylayers context).
https://arxiv.org/pdf/2505.16952 (2026-06-14) — "FrontierCO: Real-World and Large-Scale Evaluation of Machine Learning Solvers for Combinatorial Optimization."
https://arxiv.org/pdf/2511.09209 (2026-06-14) — recent learning-for-MILP work summarizing the Gasse et al. learning-to-branch / cut-selection lineage.

[9-ref] https://arxiv.org/abs/2309.03409 (2026-06-14) — Yang et al. (DeepMind), "Large Language Models as Optimizers" (OPRO).