What does this fork signal mean?

Snowflake (Arctic) forked Snowflake-Labs/fastertransformer_backend (forked from triton-inference-server/fastertransformer_backend). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo Snowflake-Labs/fastertransformer_backend · parent triton-inference-server/fastertransformer_backend. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Snowflake (Arctic) Fork: Snowflake-Labs/fastertransformer_backend

Captured source

source ↗

GitHub/github.com/Snowflake-Labs/fastertransformer_backend

Snowflake-Labs/fastertransformer_backend repository metadata

Source ↗

published Jul 19, 2023seen 5dcaptured 9hhttp 200method plain

Snowflake-Labs/fastertransformer_backend

Language: Python

License: BSD-3-Clause

Stars: 1

Forks: 2

Open issues: 9

Created: 2023-07-19T05:20:24Z

Pushed: 2026-04-06T20:13:39Z

Default branch: corvo

Fork: yes

Parent repository: triton-inference-server/fastertransformer_backend

Archived: no

README:

NOTE: Fastertransformer backend is currently undergoing restructuring so might not work with all versions of Triton.

FasterTransformer Backend

The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM. In latest release, FasterTransformer backend supports the multi-node multi-GPU inference on T5 with the model of huggingface.

Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.

[FasterTransformer Backend](#fastertransformer-backend)
[Table Of Contents](#table-of-contents)
[Support matrix](#support-matrix)
[Introduction](#introduction)
[Setup](#setup)
[Prepare docker images](#prepare-docker-images)
[Rebuilding FasterTransformer backend (optional)](#rebuilding-fastertransformer-backend-optional)
[NCCL\_LAUNCH\_MODE](#nccl_launch_mode)
[GPUs Topology](#gpus-topology)
[Model-Parallism and Triton-Multiple-Model-Instances](#model-parallism-and-triton-multiple-model-instances)
[Run inter-node (T x P \> GPUs per Node) models](#run-inter-node-t-x-p--gpus-per-node-models)
[Run intra-node (T x P \:${CONTAINER_VERSION}

docker push :${CONTAINER_VERSION}

#### Rebuilding FasterTransformer backend (optional)

Every time you need to build updated fastertransformer_backend you can build docker image.

But also you can build it manually in interactive session (ex during fixing code on target node) with:

docker run -it \ –shm-size=1g –ulimit memlock=-1 \ -v ${WORKSPACE}:/workspace \ --name ft_backend_builder \ ${TRITON_DOCKER_IMAGE} bash

in docker container

rm /opt/tritonserver/lib/cmake/FasterTransformer/ -rf # Remove original library cd fastertransformer_backend mkdir build -p && cd build && \ cmake \ -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \ -D CMAKE_BUILD_TYPE=Release \ -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \ -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ .. && \ make -j"$(grep -c ^processor /proc/cpuinfo)" install

where `${WORKSPACE}` should contain `fastertransformer_backend` directory with code to build.

Then you can commit changes to new docker image with:

docker commit ft_backend_builder ${TRITON_DOCKER_IMAGE}

## NCCL_LAUNCH_MODE

In the docker file, `NCCL_LAUNCH_MODE=GROUP` is the default because it is less likely to hang. However, `NCCL_LAUNCH_MODE=PARALLEL` can bring better performance for
communication. Hence, users may be able to try to use `NCCL_LAUNCH_MODE=PARALLEL` to accelerate.

In current environment:

export NCCL_LAUNCH_MODE=PARALLEL

When building the Docker container changing the Dockerfile:

ENV NCCL_LAUNCH_MODE=PARALLEL

Or passing environment variable on container start:

docker run -e NCCL_LAUNCH_MODE=PARALLEL ...

### GPUs Topology

If your current machine/nodes are fully connected through PCIE or even across NUMA nodes, there could be poor NCCL performance or even NCCL hangs due to limited peer to peer communication. You can apply `nvidia-smi topo -m` to check the topology.

If you met timed-out or hangs, please first check the topology and try to use DGX V100 or DGX A100 with nvlink connected.

## Model-Parallism and Triton-Multiple-Model-Instances
We apply MPI to start single-node/multi-node servers.

- N: Number of MPI Processes/Number of Nodes
- T: Tensor Parallel Size. Default 1
- P: Pipeline Parallel Size. Default 1

Multiple model instances on same GPUs will share the weights, so there will not be any redundant weights memory allocated.

### Run inter-node (T x P > GPUs per Node) models
- `total number of GPUs = num_gpus_per_node x N = T x P`.
- only single mode instance supported

### Run intra-node (T x P <= GPUs per Node) models
- `total number of visible GPUs must be evenly divisble by T x P`. Note that you can control this by setting `CUDA_VISIBLE_DEVICES`.
- `total number of visible GPUs must be <= T x P x Instance Count`. It can avoid unnecessary cuda memory allocation on unused GPUs.
- multiple model instances can be run on tsame GPU groups or different GPU groups.

The backend will first try to assign different GPU groups to different model instances. If there are not empty GPUs, multiple model instances will be assigned to the same GPU groups.

For example, if there are 8 GPUs, 8 model instances (T = 2, P = 1), then model instances will be distributed to GPU groups [0, 1], [2, 3], [4, 5], [6, 7], [0, 1], [2, 3], [4, 5], [6, 7].
- weights are shared among model instances in same GPU groups. In the example above, instance 0 and instance 4 will share the same weights, and others are similar.

### Specify Multiple Model Instances

Set `count` here to start multiple model instances. Note `KIND_CPU` is the only choice here as the backend needs to take full control of how to distribute multiple model instances to all the visible GPUs.

instance_group [ { count: 8 kind: KIND_CPU } ]

### Multi-Node Inference

We currently do not support the case that different nodes have different number of GPUs.

We start one MPI process per node. If you need to run on three nodes, then you should launch 3 Nodes with one process per node.
Remember to change `tensor_para_size` and…

Excerpt shown — open the source for the full document.

Snowflake-Labs/fastertransformer_backend

FasterTransformer Backend

Table Of Contents

in docker container