What does this fork signal mean?

DeepInfra forked deepinfra/text-generation-inference (forked from huggingface/text-generation-inference). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo deepinfra/text-generation-inference · parent huggingface/text-generation-inference. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

DeepInfra Fork: deepinfra/text-generation-inference

Captured source

source ↗

GitHub/github.com/deepinfra/text-generation-inference

deepinfra/text-generation-inference repository metadata

Source ↗

published Aug 9, 2023seen 5dcaptured 12hhttp 200method plain

deepinfra/text-generation-inference

Description: Large Language Model Text Generation Inference

Language: Python

License: Apache-2.0

Stars: 9

Forks: 2

Open issues: 6

Created: 2023-08-09T20:42:01Z

Pushed: 2023-12-15T21:31:46Z

Default branch: main

Fork: yes

Parent repository: huggingface/text-generation-inference

Archived: no

README:

A Rust, Python and gRPC server for text generation inference. Used in production at HuggingFace to power LLMs api-inference widgets.

Note

This is a fork of https://github.com/huggingface/text-generation-inference before the restrictive license change. We will maintain this fork under the Apache 2.0 license. All contribution are welcome.

[Features](#features)
[Optimized Architectures](#optimized-architectures)
[Get Started](#get-started)
[Docker](#docker)
[API Documentation](#api-documentation)
[Using a private or gated model](#using-a-private-or-gated-model)
[A note on Shared Memory](#a-note-on-shared-memory-shm)
[Distributed Tracing](#distributed-tracing)
[Local Install](#local-install)
[CUDA Kernels](#cuda-kernels)
[Run Falcon](#run-falcon)
[Run](#run)
[Quantization](#quantization)
[Develop](#develop)
[Testing](#testing)
[Other supported hardware](#other-supported-hardware)

Features

Serve the most popular Large Language Models with a simple launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(, device_map="auto")

Get started

Docker

The easiest way of getting started is using the official Docker container:

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.4 --model-id $model

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

To see all options to serve your models (in the code or in the cli:

text-generation-launcher --help

You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'

or from Python:

pip install text-generation

from text_generation import Client

client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=20).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=20):
if not response.token.special:
text += response.token.text
print(text)

API documentation

You can consult the OpenAPI documentation of the text-generation-inference REST API using the /docs route. The Swagger UI is also available at: https://deepinfra.github.io/text-generation-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-generation-inference. This allows you to gain access to protected resources.

For example, if you want to serve the gated Llama V2 model variants:

1. Go to https://huggingface.co/settings/tokens 2. Copy your cli READ token 3. Export HUGGING_FACE_HUB_TOKEN=

or with Docker:

model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=

docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.9.3 --model-id $model

A note on Shared Memory (shm)

`NCCL` is a communication framework used by PyTorch to do distributed training/inference. text-generation-inference make use of NCCL to enable Tensor Parallelism to dramatically speed up inference for large language models.

In order to…

Excerpt shown — open the source for the full document.