deepinfra/text-generation-inference
forked from huggingface/text-generation-inference
Captured source
source ↗deepinfra/text-generation-inference
Description: Large Language Model Text Generation Inference
Language: Python
License: Apache-2.0
Stars: 9
Forks: 2
Open issues: 6
Created: 2023-08-09T20:42:01Z
Pushed: 2023-12-15T21:31:46Z
Default branch: main
Fork: yes
Parent repository: huggingface/text-generation-inference
Archived: no
README:
A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.
Note
This is a fork of https://github.com/huggingface/text-generation-inference before the restrictive license change. We will maintain this fork under the Apache 2.0 license. All contribution are welcome.
Table of contents
- [Features](#features)
- [Optimized Architectures](#optimized-architectures)
- [Get Started](#get-started)
- [Docker](#docker)
- [API Documentation](#api-documentation)
- [Using a private or gated model](#using-a-private-or-gated-model)
- [A note on Shared Memory](#a-note-on-shared-memory-shm)
- [Distributed Tracing](#distributed-tracing)
- [Local Install](#local-install)
- [CUDA Kernels](#cuda-kernels)
- [Run Falcon](#run-falcon)
- [Run](#run)
- [Quantization](#quantization)
- [Develop](#develop)
- [Testing](#testing)
- [Other supported hardware](#other-supported-hardware)
Features
- Serve the most popular Large Language Models with a simple launcher
- Tensor Parallelism for faster inference on multiple GPUs
- Token streaming using Server-Sent Events (SSE)
- Continuous batching of incoming requests for increased total throughput
- Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
- Quantization with bitsandbytes and GPT-Q
- Safetensors weight loading
- Watermarking with A Watermark for Large Language Models
- Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
- Stop sequences
- Log probabilities
- Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
Optimized architectures
Other architectures are supported on a best effort basis using:
AutoModelForCausalLM.from_pretrained(, device_map="auto")
or
AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto")
Get started
Docker
The easiest way of getting started is using the official Docker container:
model=tiiuae/falcon-7b-instruct volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.4 --model-id $model
Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.
To see all options to serve your models (in the code or in the cli:
text-generation-launcher --help
You can then query the model using either the /generate or /generate_stream routes:
curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'or from Python:
pip install text-generation
from text_generation import Client
client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)
text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
if not response.token.special:
text += response.token.text
print(text)API documentation
You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://deepinfra.github.io/text-generation-inference.
Using a private or gated model
You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.
For example, if you want to serve the gated Llama V2 model variants:
1. Go to https://huggingface.co/settings/tokens 2. Copy your cli READ token 3. Export HUGGING_FACE_HUB_TOKEN=
or with Docker:
model=meta-llama/Llama-2-7b-chat-hf volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run token= docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id $model
A note on Shared Memory (shm)
`NCCL` is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.
In order to…
Excerpt shown — open the source for the full document.