deepinfra/sglang
forked from sgl-project/sglang
Captured source
source ↗deepinfra/sglang
Description: SGLang is a fast serving framework for large language models and vision language models.
Language: Python
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-12-31T21:04:14Z
Pushed: 2025-10-14T23:58:37Z
Default branch: main
Fork: yes
Parent repository: sgl-project/sglang
Archived: no
README:
--------------------------------------------------------------------------------
| **Blog** | **Documentation** | **Join Slack** | **Join Bi-Weekly Development Meeting** | **Roadmap** | **Slides** |
News
- [2025/09] 🔥 Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II): 3.8x Prefill, 4.8x Decode Throughput (blog).
- [2025/09] 🔥 SGLang Day 0 Support for DeepSeek-V3.2 with Sparse Attention (blog).
- [2025/08] 🔔 SGLang x AMD SF Meetup on 8/22: Hands-on GPU workshop, tech talks by AMD/xAI/SGLang, and networking (Roadmap, Large-scale EP, Highlights, AITER/MoRI, Wave).
- [2025/08] SGLang provides day-0 support for OpenAI gpt-oss model (instructions)
- [2025/05] Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs (blog).
- [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine (PyTorch blog)
- [2024/12] v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs (blog).
More
- [2025/06] SGLang, the high-performance serving infrastructure powering trillions of tokens daily, has been awarded the third batch of the Open Source AI Grant by a16z (a16z blog).
- [2025/06] Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput (blog).
- [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X (AMD blog)
- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU (AMD blog)
- [2025/01] SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. (instructions, AMD blog, 10+ other companies)
- [2024/10] The First SGLang Online Meetup (slides).
- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision (blog).
- [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) (blog).
- [2024/02] SGLang enables 3x faster JSON decoding with compressed finite state machine (blog).
- [2024/01] SGLang provides up to 5x faster inference with RadixAttention (blog).
- [2024/01] SGLang powers the serving of the official LLaVA v1.6 release demo (usage).
About
SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. The core features include:
- Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, prefill-decode disaggregation, speculative decoding, continuous batching, paged attention, tensor/pipeline/expert/data parallelism, structured outputs, chunked prefill, quantization (FP4/FP8/INT4/AWQ/GPTQ), and multi-lora batching.
- Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- Extensive Model Support: Supports a wide range of generative models (Llama, Qwen, DeepSeek, Kimi, GPT, Gemma, Mistral, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
- Active Community: SGLang is open-source and backed by an active community with wide industry adoption.
Getting Started
Benchmark and Performance
Learn more in the release blogs: v0.2 blog, v0.3 blog, v0.4 blog, [Large-scale expert…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine fork by maintainer.