ForkDeepInfraDeepInfrapublished Aug 9, 2023seen 5d

deepinfra/text-generation-inference

forked from huggingface/text-generation-inference

Open original ↗

Captured source

source ↗

deepinfra/text-generation-inference

Description: Large Language Model Text Generation Inference

Language: Python

License: Apache-2.0

Stars: 9

Forks: 2

Open issues: 6

Created: 2023-08-09T20:42:01Z

Pushed: 2023-12-15T21:31:46Z

Default branch: main

Fork: yes

Parent repository: huggingface/text-generation-inference

Archived: no

README:

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.

Note

This is a fork of https://github.com/huggingface/text-generation-inference before the restrictive license change. We will maintain this fork under the Apache 2.0 license. All contribution are welcome.

Table of contents

  • [Features](#features)
  • [Optimized Architectures](#optimized-architectures)
  • [Get Started](#get-started)
  • [Docker](#docker)
  • [API Documentation](#api-documentation)
  • [Using a private or gated model](#using-a-private-or-gated-model)
  • [A note on Shared Memory](#a-note-on-shared-memory-shm)
  • [Distributed Tracing](#distributed-tracing)
  • [Local Install](#local-install)
  • [CUDA Kernels](#cuda-kernels)
  • [Run Falcon](#run-falcon)
  • [Run](#run)
  • [Quantization](#quantization)
  • [Develop](#develop)
  • [Testing](#testing)
  • [Other supported hardware](#other-supported-hardware)

Features

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(, device_map="auto")

or

AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.4 --model-id $model

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

To see all options to serve your models (in the code or in the cli:

text-generation-launcher --help

You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

or from Python:

pip install text-generation
from text_generation import Client

client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
if not response.token.special:
text += response.token.text
print(text)

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://deepinfra.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

1. Go to https://huggingface.co/settings/tokens 2. Copy your cli READ token 3. Export HUGGING_FACE_HUB_TOKEN=

or with Docker:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id $model

A note on Shared Memory (shm)

`NCCL` is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to…

Excerpt shown — open the source for the full document.