Home/AI Infrastructure/AI Inference Engine Landscape
EN中文

AI Inference Engine Landscape

Related: inferact, radixark, gpu-kernel-optimization, ollama

Overview

AI inference engines are the infrastructure layer that serves trained LLM models to users at scale. The market is rapidly commercializing around two dominant open-source projects: vLLM and SGLang.

Market Structure (2026)

Open-Source Leaders

Project Stars Commercial Entity Valuation Lead Investor
vLLM ~65K Inferact $800M a16z + Lightspeed
SGLang ~16K RadixArk $400M Accel

Other Players

Engine Developer Notes
TensorRT-LLM NVIDIA Most optimized for NVIDIA hardware, closed-ish
LMDeploy Shanghai AI Lab (InternLM) Strong INT4, TurboMind C++ engine
Xinference Xorbits (阿里系) Chinese market, distributed inference
Fireworks AI Fireworks Inc. $10B+ valuation, own engine

Local / Single-User Engines

A distinct category from production serving engines — these are designed for running models on consumer hardware:

Engine Developer Backend Notes
ollama Ollama Inc. llama.cpp One-command local serving, model registry, $20M funded 2026-04
llama.cpp Community (ggerganov) Pure C++/CUDA/Metal Maximum HW compatibility, GGUF quantization (2-8 bit), no built-in registry
MLX Apple Apple-native Metal Best perf/watt on M-series Macs, SWA-native
LM Studio LM Studio Inc. llama.cpp GUI + model browser, macOS/Windows

Architectural relationship: Ollama and LM Studio are UX layers on top of llama.cpp; llama.cpp provides the C++ inference backend with GGUF quantization. MLX is a separate Apple-native stack that bypasses llama.cpp entirely on Apple Silicon. For production multi-user serving, vLLM/SGLang are the standard; local engines are for prototyping and single-user use cases.

Sliding Window Attention (SWA) Optimization

Multiple model families now use Sliding Window Attention to reduce KV cache memory pressure during long-text inference:

  • Mimo-v2.5 (minimax): 60-layer SWA computing only 128-token windows. Long-text prefill computation equivalent to traditional 10-layer global GQA [local: 2026-05-30-summary.md].
  • Gemma3 (Google): SWA auto-activates in supported engines, transparent to users.
  • Qwen3 (qwen): Hybrid SWA architecture, user-transparent.

KV Cache memory formula: 2 × L × H_kv × D_h × T × B × bytes — where L = layers, H_kv = KV attention heads, D_h = head dimension, T = sequence length, B = batch size. Critical for deployment sizing of 1T+ MoE models like Kimi K2 (kimi).

Engine support: vLLM, SGLang, llama.cpp/Ollama, and MLX all support SWA models — the optimization is architectural (model-level), not engine-specific. When a model uses SWA, the engine automatically applies the sliding window, requiring no user configuration.

DeepSeek's Strategic Choice

DeepSeek (models V3, R1, V3-0324) chose to contribute optimizations back to vLLM rather than building their own inference engine.

Logic:

  1. DeepSeek is a model company, not infra company — would cost them a team to maintain an engine
  2. vLLM has the largest deployment base — contributing to vLLM = DeepSeek models reach more users
  3. vLLM is hardware-agnostic — DeepSeek benefits regardless of what hardware users have

AI Lab Official Recommendations

Lab Models Recommended Engines
DeepSeek V3, R1, V3-0324 SGLang (Day-0) + vLLM
Meta Llama 4 vLLM + SGLang + TensorRT-LLM
Google Gemma 3/4 vLLM
Mistral Mistral Large 3 vLLM + SGLang
Moonshot Kimi K2, K2.5 vLLM + SGLang

Key Metrics

Metric SGLang vLLM
H100 throughput ~16,200 tok/s ~12,500 tok/s
Multi-GPU scaling TP + PP + EP TP + PP
MoE support Yes (DeepSeek V3/R1) Yes
FP8 support Partial Yes (Hopper)

Business Model

All commercial players follow: Open-source free + Enterprise managed services paid

Services charged: SLA guarantees, dedicated GPU clusters, commercial support, hardware co-development.

Sources

  • Inferact $150M seed round coverage (Fintool, Pulse2, a16z)
  • SGLang GitHub: lmsys-org/sglang
  • DeepSeek official model cards
  • H100 benchmark data from various inference tests
  • local: 2026-05-30-summary.md — SWA optimization, KV cache formula, Ollama/vLLM/llama.cpp/MLX landscape
  • local: 2026-05-31-ai-infrastructure.md — raw research notes
Last compiled: 2026-05-31