ForkBasetenBasetenpublished Jan 9, 2024seen 5d

basetenlabs/tensorrtllm_backend

forked from triton-inference-server/tensorrtllm_backend

Open original ↗

Captured source

source ↗
published Jan 9, 2024seen 5dcaptured 13hhttp 200method plain

basetenlabs/tensorrtllm_backend

Description: The Triton TensorRT-LLM Backend

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-01-09T17:52:34Z

Pushed: 2024-01-11T21:06:34Z

Default branch: main

Fork: yes

Parent repository: triton-inference-server/tensorrtllm_backend

Archived: no

README:

TensorRT-LLM Backend

The Triton backend for TensorRT-LLM. You can learn more about Triton backends in the backend repo. The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batcher_llm/) directory contains the C++ implementation of the backend supporting inflight batching, paged attention and more.

Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the issues page.

Accessing the TensorRT-LLM Backend

There are several ways to access the TensorRT-LLM Backend.

Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).

Run the Pre-built Docker Container

Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and Python Backend. This container should have everything to run a TensorRT-LLM model. You can find this container on the Triton NGC page.

Build the Docker Container

Option 1. Build via the build.py Script in Server Repo

Starting with Triton 23.10 release, you can follow steps described in the Building With Docker guide and use the build.py script.

A sample command to build a Triton Server container with all options enabled is shown below, which will build the same TRT-LLM container as the one on the NGC.

BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
PYTHON_BACKEND_REPO_TAG=r23.10

# Run the build script. The flags for some features or endpoints can be removed if not needed.
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
--enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
--filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
--endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
--backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
--image=base,${BASE_CONTAINER_IMAGE_NAME} \
--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
--backend=python:${PYTHON_BACKEND_REPO_TAG}

The BASE_CONTAINER_IMAGE_NAME is the base image that will be used to build the container. By default it is set to the most recent min image of Triton, on NGC, that matches the Triton release you are building for. You can change it to a different image if needed by setting the --image flag like the command below. The TENSORRTLLM_BACKEND_REPO_TAG and PYTHON_BACKEND_REPO_TAG are the tags of the TensorRT-LLM backend and Python backend repositories that will be used to build the container. You can also remove the features or endpoints that you don't need by removing the corresponding flags.

Option 2. Build via Docker

The version of Triton Server used in this build option can be found in the [Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
# For aarch64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm --build-arg TORCH_INSTALL_TYPE="src_non_cxx11_abi" -f dockerfile/Dockerfile.trt_llm_backend .

Using the TensorRT-LLM Backend

Below is an example of how to serve a TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 4-GPU environment. The example uses the GPT model from the TensorRT-LLM repository.

Prepare TensorRT-LLM engines

You can skip this step if you already have the engines ready. Follow the guide in TensorRT-LLM repository for more details on how to to prepare the engines for deployment.

# Update the submodule TensorRT-LLM repository
git submodule update --init --recursive
git lfs install
git lfs pull

# TensorRT-LLM is required for generating engines. You can skip this step if
# you already have the package installed. If you are generating engines within
# the Triton container, you have to install the TRT-LLM package.
(cd tensorrt_llm &&
bash docker/common/install_cmake.sh &&
export PATH=/usr/local/cmake/bin:$PATH &&
python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
pip3 install ./build/tensorrt_llm*.whl)

# Go to the tensorrt_llm/examples/gpt directory
cd tensorrt_llm/examples/gpt

# Download weights from HuggingFace Transformers
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd

# Convert weights from HF Tranformers to FT format
python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 4 --storage-type float16

# Build TensorRT engines
python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
--world_size=4 \
--dtype float16 \
--use_inflight_batching \
--use_gpt_attention_plugin float16 \
--paged_kv_cache \
--use_gemm_plugin float16 \
--remove_input_padding \
--use_layernorm_plugin float16 \
--hidden_act gelu \
--parallel_build \
--output_dir=engines/fp16/4-gpu

Create the model repository

There are five models in the…

Excerpt shown — open the source for the full document.