mistralai/vllm-release
forked from vllm-project/vllm
Captured source
source ↗mistralai/vllm-release
Description: A high-throughput and memory-efficient inference and serving engine for LLMs
Language: Python
License: Apache-2.0
Stars: 55
Forks: 14
Open issues: 0
Created: 2023-09-27T12:23:21Z
Pushed: 2023-12-11T08:56:05Z
Default branch: main
Fork: yes
Parent repository: vllm-project/vllm
Archived: yes
README:
Easy, fast, and cheap LLM serving for everyone
| Documentation | Blog | Paper | Discord |
---
*Latest News* 🔥
- [2023/12] Added ROCm support to vLLM.
- [2023/10] We hosted the first vLLM meetup in SF! Please find the meetup slides here.
- [2023/09] We created our Discord server! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
- [2023/09] We released our PagedAttention paper on arXiv!
- [2023/08] We would like to express our sincere gratitude to Andreessen Horowitz (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click example to start the vLLM demo, and the blog post for the story behind vLLM development on the clouds.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered LMSYS Vicuna and Chatbot Arena since mid-April. Check out our blog post.
---
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA CUDA and AMD ROCm.
vLLM seamlessly supports many Hugging Face models, including the following architectures:
- Aquila & Aquila2 (
BAAI/AquilaChat2-7B,BAAI/AquilaChat2-34B,BAAI/Aquila-7B,BAAI/AquilaChat-7B, etc.) - Baichuan & Baichuan2 (
baichuan-inc/Baichuan2-13B-Chat,baichuan-inc/Baichuan-7B, etc.) - BLOOM (
bigscience/bloom,bigscience/bloomz, etc.) - ChatGLM (
THUDM/chatglm2-6b,THUDM/chatglm3-6b, etc.) - Falcon (
tiiuae/falcon-7b,tiiuae/falcon-40b,tiiuae/falcon-rw-7b, etc.) - GPT-2 (
gpt2,gpt2-xl, etc.) - GPT BigCode (
bigcode/starcoder,bigcode/gpt_bigcode-santacoder, etc.) - GPT-J (
EleutherAI/gpt-j-6b,nomic-ai/gpt4all-j, etc.) - GPT-NeoX (
EleutherAI/gpt-neox-20b,databricks/dolly-v2-12b,stabilityai/stablelm-tuned-alpha-7b, etc.) - InternLM (
internlm/internlm-7b,internlm/internlm-chat-7b, etc.) - LLaMA & LLaMA-2 (
meta-llama/Llama-2-70b-hf,lmsys/vicuna-13b-v1.3,young-geng/koala,openlm-research/open_llama_13b, etc.) - Mistral (
mistralai/Mistral-7B-v0.1,mistralai/Mistral-7B-Instruct-v0.1, etc.) - Mixtral (
mistralai/Mixtral-8x7B-v0.1,mistralai/Mixtral-8x7B-Instruct-v0.1, etc.) - MPT (
mosaicml/mpt-7b,mosaicml/mpt-30b, etc.) - OPT (
facebook/opt-66b,facebook/opt-iml-max-30b, etc.) - Phi-1.5 (
microsoft/phi-1_5, etc.) - Qwen (
Qwen/Qwen-7B,Qwen/Qwen-7B-Chat, etc.) - Yi (
01-ai/Yi-6B,01-ai/Yi-34B, etc.)
Install vLLM with pip or from source:
pip install vllm
Getting Started
Visit our documentation to get started.
Contributing
We welcome and value any contributions and collaborations. Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.
Citation
If you use vLLM for your research, please cite our paper:
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}