AI Inference Engine Landscape
Related: inferact, radixark, gpu-kernel-optimization, ollama
Overview
AI inference engines are the infrastructure layer that serves trained LLM models to users at scale. The market is rapidly commercializing around two dominant open-source projects: vLLM and SGLang.
Market Structure (2026)
Open-Source Leaders
| Project | Stars | Commercial Entity | Valuation | Lead Investor |
|---|---|---|---|---|
| vLLM | ~65K | Inferact | $800M | a16z + Lightspeed |
| SGLang | ~16K | RadixArk | $400M | Accel |
Other Players
| Engine | Developer | Notes |
|---|---|---|
| TensorRT-LLM | NVIDIA | Most optimized for NVIDIA hardware, closed-ish |
| LMDeploy | Shanghai AI Lab (InternLM) | Strong INT4, TurboMind C++ engine |
| Xinference | Xorbits (阿里系) | Chinese market, distributed inference |
| Fireworks AI | Fireworks Inc. | $10B+ valuation, own engine |
Local / Single-User Engines
A distinct category from production serving engines — these are designed for running models on consumer hardware:
| Engine | Developer | Backend | Notes |
|---|---|---|---|
| ollama | Ollama Inc. | llama.cpp | One-command local serving, model registry, $20M funded 2026-04 |
| llama.cpp | Community (ggerganov) | Pure C++/CUDA/Metal | Maximum HW compatibility, GGUF quantization (2-8 bit), no built-in registry |
| MLX | Apple | Apple-native Metal | Best perf/watt on M-series Macs, SWA-native |
| LM Studio | LM Studio Inc. | llama.cpp | GUI + model browser, macOS/Windows |
Architectural relationship: Ollama and LM Studio are UX layers on top of llama.cpp; llama.cpp provides the C++ inference backend with GGUF quantization. MLX is a separate Apple-native stack that bypasses llama.cpp entirely on Apple Silicon. For production multi-user serving, vLLM/SGLang are the standard; local engines are for prototyping and single-user use cases.
Sliding Window Attention (SWA) Optimization
Multiple model families now use Sliding Window Attention to reduce KV cache memory pressure during long-text inference:
- Mimo-v2.5 (minimax): 60-layer SWA computing only 128-token windows. Long-text prefill computation equivalent to traditional 10-layer global GQA [local: 2026-05-30-summary.md].
- Gemma3 (Google): SWA auto-activates in supported engines, transparent to users.
- Qwen3 (qwen): Hybrid SWA architecture, user-transparent.
KV Cache memory formula: 2 × L × H_kv × D_h × T × B × bytes — where L = layers, H_kv = KV attention heads, D_h = head dimension, T = sequence length, B = batch size. Critical for deployment sizing of 1T+ MoE models like Kimi K2 (kimi).
Engine support: vLLM, SGLang, llama.cpp/Ollama, and MLX all support SWA models — the optimization is architectural (model-level), not engine-specific. When a model uses SWA, the engine automatically applies the sliding window, requiring no user configuration.
DeepSeek's Strategic Choice
DeepSeek (models V3, R1, V3-0324) chose to contribute optimizations back to vLLM rather than building their own inference engine.
Logic:
- DeepSeek is a model company, not infra company — would cost them a team to maintain an engine
- vLLM has the largest deployment base — contributing to vLLM = DeepSeek models reach more users
- vLLM is hardware-agnostic — DeepSeek benefits regardless of what hardware users have
AI Lab Official Recommendations
| Lab | Models | Recommended Engines |
|---|---|---|
| DeepSeek | V3, R1, V3-0324 | SGLang (Day-0) + vLLM |
| Meta | Llama 4 | vLLM + SGLang + TensorRT-LLM |
| Gemma 3/4 | vLLM | |
| Mistral | Mistral Large 3 | vLLM + SGLang |
| Moonshot | Kimi K2, K2.5 | vLLM + SGLang |
Key Metrics
| Metric | SGLang | vLLM |
|---|---|---|
| H100 throughput | ~16,200 tok/s | ~12,500 tok/s |
| Multi-GPU scaling | TP + PP + EP | TP + PP |
| MoE support | Yes (DeepSeek V3/R1) | Yes |
| FP8 support | Partial | Yes (Hopper) |
Business Model
All commercial players follow: Open-source free + Enterprise managed services paid
Services charged: SLA guarantees, dedicated GPU clusters, commercial support, hardware co-development.
Sources
- Inferact $150M seed round coverage (Fintool, Pulse2, a16z)
- SGLang GitHub: lmsys-org/sglang
- DeepSeek official model cards
- H100 benchmark data from various inference tests
- local: 2026-05-30-summary.md — SWA optimization, KV cache formula, Ollama/vLLM/llama.cpp/MLX landscape
- local: 2026-05-31-ai-infrastructure.md — raw research notes