AI Inference Engine Landscape

Overview

AI inference engines are the infrastructure layer that serves trained LLM models to users at scale. The market is rapidly commercializing around two dominant open-source projects: vLLM and SGLang.

Market Structure (2026)

Open-Source Leaders

Project	Stars	Commercial Entity	Valuation	Lead Investor
vLLM	~65K	Inferact	$800M	a16z + Lightspeed
SGLang	~16K	RadixArk	$400M	Accel

Other Players

Engine	Developer	Notes
TensorRT-LLM	NVIDIA	Most optimized for NVIDIA hardware, closed-ish
LMDeploy	Shanghai AI Lab (InternLM)	Strong INT4, TurboMind C++ engine
Xinference	Xorbits (阿里系)	Chinese market, distributed inference
Fireworks AI	Fireworks Inc.	$10B+ valuation, own engine

Local / Single-User Engines

A distinct category from production serving engines — these are designed for running models on consumer hardware:

Engine	Developer	Backend	Notes
ollama	Ollama Inc.	llama.cpp	One-command local serving, model registry, $20M funded 2026-04
llama.cpp	Community (ggerganov)	Pure C++/CUDA/Metal	Maximum HW compatibility, GGUF quantization (2-8 bit), no built-in registry
MLX	Apple	Apple-native Metal	Best perf/watt on M-series Macs, SWA-native
LM Studio	LM Studio Inc.	llama.cpp	GUI + model browser, macOS/Windows

Architectural relationship: Ollama and LM Studio are UX layers on top of llama.cpp; llama.cpp provides the C++ inference backend with GGUF quantization. MLX is a separate Apple-native stack that bypasses llama.cpp entirely on Apple Silicon. For production multi-user serving, vLLM/SGLang are the standard; local engines are for prototyping and single-user use cases.

Sliding Window Attention (SWA) Optimization

Multiple model families now use Sliding Window Attention to reduce KV cache memory pressure during long-text inference:

Mimo-v2.5 (minimax): 60-layer SWA computing only 128-token windows. Long-text prefill computation equivalent to traditional 10-layer global GQA [local: 2026-05-30-summary.md].
Gemma3 (Google): SWA auto-activates in supported engines, transparent to users.
Qwen3 (qwen): Hybrid SWA architecture, user-transparent.

KV Cache memory formula: 2 × L × H_kv × D_h × T × B × bytes — where L = layers, H_kv = KV attention heads, D_h = head dimension, T = sequence length, B = batch size. Critical for deployment sizing of 1T+ MoE models like Kimi K2 (kimi).

Engine support: vLLM, SGLang, llama.cpp/Ollama, and MLX all support SWA models — the optimization is architectural (model-level), not engine-specific. When a model uses SWA, the engine automatically applies the sliding window, requiring no user configuration.

DeepSeek's Strategic Choice

DeepSeek (models V3, R1, V3-0324) chose to contribute optimizations back to vLLM rather than building their own inference engine.

Logic:

DeepSeek is a model company, not infra company — would cost them a team to maintain an engine
vLLM has the largest deployment base — contributing to vLLM = DeepSeek models reach more users
vLLM is hardware-agnostic — DeepSeek benefits regardless of what hardware users have

AI Lab Official Recommendations

Lab	Models	Recommended Engines
DeepSeek	V3, R1, V3-0324	SGLang (Day-0) + vLLM
Meta	Llama 4	vLLM + SGLang + TensorRT-LLM
Google	Gemma 3/4	vLLM
Mistral	Mistral Large 3	vLLM + SGLang
Moonshot	Kimi K2, K2.5	vLLM + SGLang

Key Metrics

Metric	SGLang	vLLM
H100 throughput	~16,200 tok/s	~12,500 tok/s
Multi-GPU scaling	TP + PP + EP	TP + PP
MoE support	Yes (DeepSeek V3/R1)	Yes
FP8 support	Partial	Yes (Hopper)

Business Model

All commercial players follow: Open-source free + Enterprise managed services paid

Services charged: SLA guarantees, dedicated GPU clusters, commercial support, hardware co-development.

Sources

Inferact $150M seed round coverage (Fintool, Pulse2, a16z)
SGLang GitHub: lmsys-org/sglang
DeepSeek official model cards
H100 benchmark data from various inference tests
local: 2026-05-30-summary.md — SWA optimization, KV cache formula, Ollama/vLLM/llama.cpp/MLX landscape
local: 2026-05-31-ai-infrastructure.md — raw research notes