Snowflake-Labs/fastertransformer_backend
forked from triton-inference-server/fastertransformer_backend
Captured source
source ↗Snowflake-Labs/fastertransformer_backend
Language: Python
License: BSD-3-Clause
Stars: 1
Forks: 2
Open issues: 9
Created: 2023-07-19T05:20:24Z
Pushed: 2026-04-06T20:13:39Z
Default branch: corvo
Fork: yes
Parent repository: triton-inference-server/fastertransformer_backend
Archived: no
README:
NOTE: Fastertransformer backend is currently undergoing restructuring so might not work with all versions of Triton.
FasterTransformer Backend
The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM. In latest release, FasterTransformer backend supports the multi-node multi-GPU inference on T5 with the model of huggingface.
Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.
Table Of Contents
- [FasterTransformer Backend](#fastertransformer-backend)
- [Table Of Contents](#table-of-contents)
- [Support matrix](#support-matrix)
- [Introduction](#introduction)
- [Setup](#setup)
- [Prepare docker images](#prepare-docker-images)
- [Rebuilding FasterTransformer backend (optional)](#rebuilding-fastertransformer-backend-optional)
- [NCCL\_LAUNCH\_MODE](#nccl_launch_mode)
- [GPUs Topology](#gpus-topology)
- [Model-Parallism and Triton-Multiple-Model-Instances](#model-parallism-and-triton-multiple-model-instances)
- [Run inter-node (T x P \> GPUs per Node) models](#run-inter-node-t-x-p--gpus-per-node-models)
- [Run intra-node (T x P \:${CONTAINER_VERSION}
docker push :${CONTAINER_VERSION}
#### Rebuilding FasterTransformer backend (optional) Every time you need to build updated fastertransformer_backend you can build docker image. But also you can build it manually in interactive session (ex during fixing code on target node) with:
docker run -it \ –shm-size=1g –ulimit memlock=-1 \ -v ${WORKSPACE}:/workspace \ --name ft_backend_builder \ ${TRITON_DOCKER_IMAGE} bash
in docker container
rm /opt/tritonserver/lib/cmake/FasterTransformer/ -rf # Remove original library cd fastertransformer_backend mkdir build -p && cd build && \ cmake \ -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \ -D CMAKE_BUILD_TYPE=Release \ -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \ -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ .. && \ make -j"$(grep -c ^processor /proc/cpuinfo)" install
where `${WORKSPACE}` should contain `fastertransformer_backend` directory with code to build.
Then you can commit changes to new docker image with:docker commit ft_backend_builder ${TRITON_DOCKER_IMAGE}
## NCCL_LAUNCH_MODE In the docker file, `NCCL_LAUNCH_MODE=GROUP` is the default because it is less likely to hang. However, `NCCL_LAUNCH_MODE=PARALLEL` can bring better performance for communication. Hence, users may be able to try to use `NCCL_LAUNCH_MODE=PARALLEL` to accelerate. In current environment:
export NCCL_LAUNCH_MODE=PARALLEL
When building the Docker container changing the Dockerfile:
ENV NCCL_LAUNCH_MODE=PARALLEL
Or passing environment variable on container start:
docker run -e NCCL_LAUNCH_MODE=PARALLEL ...
### GPUs Topology If your current machine/nodes are fully connected through PCIE or even across NUMA nodes, there could be poor NCCL performance or even NCCL hangs due to limited peer to peer communication. You can apply `nvidia-smi topo -m` to check the topology. If you met timed-out or hangs, please first check the topology and try to use DGX V100 or DGX A100 with nvlink connected. ## Model-Parallism and Triton-Multiple-Model-Instances We apply MPI to start single-node/multi-node servers. - N: Number of MPI Processes/Number of Nodes - T: Tensor Parallel Size. Default 1 - P: Pipeline Parallel Size. Default 1 Multiple model instances on same GPUs will share the weights, so there will not be any redundant weights memory allocated. ### Run inter-node (T x P > GPUs per Node) models - `total number of GPUs = num_gpus_per_node x N = T x P`. - only single mode instance supported ### Run intra-node (T x P <= GPUs per Node) models - `total number of visible GPUs must be evenly divisble by T x P`. Note that you can control this by setting `CUDA_VISIBLE_DEVICES`. - `total number of visible GPUs must be <= T x P x Instance Count`. It can avoid unnecessary cuda memory allocation on unused GPUs. - multiple model instances can be run on tsame GPU groups or different GPU groups. The backend will first try to assign different GPU groups to different model instances. If there are not empty GPUs, multiple model instances will be assigned to the same GPU groups. For example, if there are 8 GPUs, 8 model instances (T = 2, P = 1), then model instances will be distributed to GPU groups [0, 1], [2, 3], [4, 5], [6, 7], [0, 1], [2, 3], [4, 5], [6, 7]. - weights are shared among model instances in same GPU groups. In the example above, instance 0 and instance 4 will share the same weights, and others are similar. ### Specify Multiple Model Instances Set `count` here to start multiple model instances. Note `KIND_CPU` is the only choice here as the backend needs to take full control of how to distribute multiple model instances to all the visible GPUs.
instance_group [ { count: 8 kind: KIND_CPU } ]
### Multi-Node Inference We currently do not support the case that different nodes have different number of GPUs. We start one MPI process per node. If you need to run on three nodes, then you should launch 3 Nodes with one process per node. Remember to change `tensor_para_size` and…
Excerpt shown — open the source for the full document.