basetenlabs/tensorrtllm_backend
forked from triton-inference-server/tensorrtllm_backend
Captured source
source ↗basetenlabs/tensorrtllm_backend
Description: The Triton TensorRT-LLM Backend
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2024-01-09T17:52:34Z
Pushed: 2024-01-11T21:06:34Z
Default branch: main
Fork: yes
Parent repository: triton-inference-server/tensorrtllm_backend
Archived: no
README:
TensorRT-LLM Backend
The Triton backend for TensorRT-LLM. You can learn more about Triton backends in the backend repo. The goal of TensorRT-LLM Backend is to let you serve TensorRT-LLM models with Triton Inference Server. The [inflight_batcher_llm](./inflight_batcher_llm/) directory contains the C++ implementation of the backend supporting inflight batching, paged attention and more.
Where can I ask general questions about Triton and Triton backends? Be sure to read all the information below as well as the general Triton documentation available in the main server repo. If you don't find your answer there you can ask questions on the issues page.
Accessing the TensorRT-LLM Backend
There are several ways to access the TensorRT-LLM Backend.
Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).
Run the Pre-built Docker Container
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM Backend and Python Backend. This container should have everything to run a TensorRT-LLM model. You can find this container on the Triton NGC page.
Build the Docker Container
Option 1. Build via the build.py Script in Server Repo
Starting with Triton 23.10 release, you can follow steps described in the Building With Docker guide and use the build.py script.
A sample command to build a Triton Server container with all options enabled is shown below, which will build the same TRT-LLM container as the one on the NGC.
BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
PYTHON_BACKEND_REPO_TAG=r23.10
# Run the build script. The flags for some features or endpoints can be removed if not needed.
./build.py -v --no-container-interactive --enable-logging --enable-stats --enable-tracing \
--enable-metrics --enable-gpu-metrics --enable-cpu-metrics \
--filesystem=gcs --filesystem=s3 --filesystem=azure_storage \
--endpoint=http --endpoint=grpc --endpoint=sagemaker --endpoint=vertex-ai \
--backend=ensemble --enable-gpu --endpoint=http --endpoint=grpc \
--image=base,${BASE_CONTAINER_IMAGE_NAME} \
--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG} \
--backend=python:${PYTHON_BACKEND_REPO_TAG}The BASE_CONTAINER_IMAGE_NAME is the base image that will be used to build the container. By default it is set to the most recent min image of Triton, on NGC, that matches the Triton release you are building for. You can change it to a different image if needed by setting the --image flag like the command below. The TENSORRTLLM_BACKEND_REPO_TAG and PYTHON_BACKEND_REPO_TAG are the tags of the TensorRT-LLM backend and Python backend repositories that will be used to build the container. You can also remove the features or endpoints that you don't need by removing the corresponding flags.
Option 2. Build via Docker
The version of Triton Server used in this build option can be found in the [Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).
# Update the submodules cd tensorrtllm_backend git lfs install git submodule update --init --recursive # Use the Dockerfile to build the backend in a container # For x86_64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend . # For aarch64 DOCKER_BUILDKIT=1 docker build -t triton_trt_llm --build-arg TORCH_INSTALL_TYPE="src_non_cxx11_abi" -f dockerfile/Dockerfile.trt_llm_backend .
Using the TensorRT-LLM Backend
Below is an example of how to serve a TensorRT-LLM model with the Triton TensorRT-LLM Backend on a 4-GPU environment. The example uses the GPT model from the TensorRT-LLM repository.
Prepare TensorRT-LLM engines
You can skip this step if you already have the engines ready. Follow the guide in TensorRT-LLM repository for more details on how to to prepare the engines for deployment.
# Update the submodule TensorRT-LLM repository git submodule update --init --recursive git lfs install git lfs pull # TensorRT-LLM is required for generating engines. You can skip this step if # you already have the package installed. If you are generating engines within # the Triton container, you have to install the TRT-LLM package. (cd tensorrt_llm && bash docker/common/install_cmake.sh && export PATH=/usr/local/cmake/bin:$PATH && python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" && pip3 install ./build/tensorrt_llm*.whl) # Go to the tensorrt_llm/examples/gpt directory cd tensorrt_llm/examples/gpt # Download weights from HuggingFace Transformers rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2 pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd # Convert weights from HF Tranformers to FT format python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 4 --storage-type float16 # Build TensorRT engines python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \ --world_size=4 \ --dtype float16 \ --use_inflight_batching \ --use_gpt_attention_plugin float16 \ --paged_kv_cache \ --use_gemm_plugin float16 \ --remove_input_padding \ --use_layernorm_plugin float16 \ --hidden_act gelu \ --parallel_build \ --output_dir=engines/fp16/4-gpu
Create the model repository
There are five models in the…
Excerpt shown — open the source for the full document.