What does this fork signal mean?

Qwen (Alibaba Cloud) forked QwenLM/vllm (forked from vllm-project/vllm). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo QwenLM/vllm · parent vllm-project/vllm · Fork of vLLM by Qwen, moderate stars. onlylabs links this event to 1 captured evidence page and 1 related fork signal.

Qwen (Alibaba Cloud) Fork: QwenLM/vllm

Captured source

source ↗

GitHub/github.com/QwenLM/vllm

QwenLM/vllm repository metadata

Source ↗

published Jan 25, 2025seen Jun 5captured Jun 11http 200method plain

QwenLM/vllm

Description: A high-throughput and memory-efficient inference and serving engine for LLMs

License: Apache-2.0

Stars: 42

Forks: 17

Open issues: 0

Created: 2025-01-25T12:36:14Z

Pushed: 2025-01-26T13:38:33Z

Default branch: main

Fork: yes

Parent repository: vllm-project/vllm

Archived: no

README:

Easy, fast, and cheap LLM serving for everyone

---

*Latest News* 🔥

[2025/01] We hosted the eighth vLLM meetup with Google Cloud! Please find the meetup slides from vLLM team here.
[2024/12] vLLM joins pytorch ecosystem! Easy, Fast, and Cheap LLM Serving for Everyone!
[2024/11] We hosted the seventh vLLM meetup with Snowflake! Please find the meetup slides from vLLM team here, and Snowflake team here.
[2024/10] We have just created a developer slack (slack.vllm.ai) focusing on coordinating contributions and discussing features. Please feel free to join us there!
[2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team here. Learn more from the talks from other vLLM contributors and users!
[2024/09] We hosted the sixth vLLM meetup with NVIDIA! Please find the meetup slides here.
[2024/07] We hosted the fifth vLLM meetup with AWS! Please find the meetup slides here.
[2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post here.
[2024/06] We hosted the fourth vLLM meetup with Cloudflare and BentoML! Please find the meetup slides here.
[2024/04] We hosted the third vLLM meetup with Roblox! Please find the meetup slides here.
[2024/01] We hosted the second vLLM meetup with IBM! Please find the meetup slides here.
[2023/10] We hosted the first vLLM meetup with a16z! Please find the meetup slides here.
[2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
[2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.

---

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with **PagedAttention**
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, INT4, INT8, and FP8.
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

Performance benchmark: We include a performance benchmark at the end of our blog post. It compares the performance of vLLM against other LLM serving engines (TensorRT-LLM, SGLang and LMDeploy). The implementation is under [nightly-benchmarks folder](.buildkite/nightly-benchmarks/) and you can reproduce this benchmark using our one-click runnable script.

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
Tensor parallelism and pipeline parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron.
Prefix caching support
Multi-lora support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
Embedding Models (e.g. E5-Mistral)
Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Fork of vLLM by Qwen, moderate stars