What does this fork signal mean?

Together AI forked togethercomputer/FT_Bloomchat (forked from NVIDIA/FasterTransformer). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo togethercomputer/FT_Bloomchat · parent NVIDIA/FasterTransformer. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Together AI Fork: togethercomputer/FT_Bloomchat

Captured source

source ↗

GitHub/github.com/togethercomputer/FT_Bloomchat

togethercomputer/FT_Bloomchat repository metadata

Source ↗

published Jan 30, 2023seen 5dcaptured 12hhttp 200method plain

togethercomputer/FT_Bloomchat

Description: Transformer related optimization, including BERT, GPT

Language: C++

License: Apache-2.0

Stars: 1

Forks: 1

Open issues: 2

Created: 2023-01-30T15:28:50Z

Pushed: 2023-05-24T11:45:43Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/FasterTransformer

Archived: no

README:

FasterTransformer

This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.

[FasterTransformer](#fastertransformer)
[Table Of Contents](#table-of-contents)
[Model overview](#model-overview)
[Support matrix](#support-matrix)
[Advanced](#advanced)
[Global Environment](#global-environment)
[Performance](#performance)
[BERT base performance](#bert-base-performance)
[BERT base performances of FasterTransformer new features](#bert-base-performances-of-fastertransformer-new-features)
[BERT base performance on TensorFlow](#bert-base-performance-on-tensorflow)
[BERT base performance on PyTorch](#bert-base-performance-on-pytorch)
[Decoding and Decoder performance](#decoding-and-decoder-performance)
[Decoder and Decoding end-to-end translation performance on TensorFlow](#decoder-and-decoding-end-to-end-translation-performance-on-tensorflow)
[Decoder and Decoding end-to-end translation performance on PyTorch](#decoder-and-decoding-end-to-end-translation-performance-on-pytorch)
[GPT performance](#gpt-performance)
[Release notes](#release-notes)
[Changelog](#changelog)
[Known issues](#known-issues)

Model overview

In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.

FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.

Support matrix

| Models | Framework | FP16 | INT8 (after Turing) | Sparsity (after Ampere) | Tensor parallel | Pipeline parallel | FP8 (after Hopper) | | ---------------- | -------------- | ---- | ------------------- | ----------------------- | --------------- | ----------------- | ------------------ | | BERT | TensorFlow | Yes | Yes | - | - | - | - | | BERT | PyTorch | Yes | Yes | Yes | Yes | Yes | - | | BERT | Triton backend | Yes | - | - | Yes | Yes | - | | BERT | C++ | Yes | Yes | - | - | - | Yes | | XLNet | C++ | Yes | - | - | - | - | - | | Encoder | TensorFlow | Yes | Yes | - | - | - | - | | Encoder | PyTorch | Yes | Yes | Yes | - | - | - | | Decoder | TensorFlow | Yes | - | - | - | - | - | | Decoder | PyTorch | Yes | - | - | - | - | - | | Decoding | TensorFlow | Yes | - | - | - | - | - | | Decoding | PyTorch | Yes | - | - | - | - | - | | GPT | TensorFlow | Yes | - | - | - | - | - | | GPT/OPT | PyTorch | Yes | - | - | Yes | Yes | Yes | | GPT/OPT | Triton backend | Yes | - | - | Yes | Yes | - | | GPT-MoE | PyTorch | Yes | - | - | Yes | Yes | - | | BLOOM | PyTorch | Yes | - | - | Yes | Yes | - | | BLOOM | Triton backend | Yes | - | - | Yes | Yes | - | | GPT-J | Triton backend | Yes | - | - | Yes | Yes | - | | Longformer | PyTorch | Yes | - | - | - | - | - | | T5/UL2 | PyTorch | Yes | - | - | Yes | Yes | - | | T5 | TensorFlow 2 | Yes | - | - | - | - | - | | T5/UL2 | Triton backend | Yes | - | - | Yes | Yes | - | | T5 | TensorRT | Yes | - | - | Yes | Yes | - | | T5-MoE | PyTorch | Yes | - | - | Yes | Yes | - | | Swin Transformer | PyTorch | Yes | Yes | - | - | - | - | | Swin Transformer | TensorRT | Yes | Yes | - | - | - | - | | ViT | PyTorch | Yes | Yes | - | - | - | - | | ViT | TensorRT | Yes | Yes | - | - | - | - | | GPT-NeoX | PyTorch | Yes | - | - | Yes | Yes | - | | GPT-NeoX | Triton backend | Yes | - | - | Yes | Yes | - | | BART/mBART | PyTorch | Yes | - | - | Yes | Yes | - | | WeNet | C++ | Yes | - | - | - | - | - | | DeBERTa | TensorFlow 2 | Yes | - | - | On-going | On-going | - | | DeBERTa | PyTorch | Yes | - | - | On-going | On-going | - |

Note that the FasterTransformer supports the models above on C++ because all source codes are built on C++.

More details of specific models are put in xxx_guide.md of [docs/](docs), where xxx means the model name. Some common questions and the respective answers are put in [docs/QAList.md](docs/QAList.md). Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.

Advanced

The following code lists the directory structure of FasterTransformer:

/src/fastertransformer: source code of FasterTransformer
|--/cutlass_extensions: Implementation of cutlass gemm/kernels.
|--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual.
|--/layers: Implementation of layer modules, like attention layer, ffn layer.
|--/models: Implementation of different models, like BERT, GPT.
|--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin.
|--/tf_op: custom Tensorflow OP implementation
|--/th_op: custom PyTorch OP implementation
|--/triton_backend: custom triton backend implementation
|--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils
/examples: C++, tensorflow and pytorch interface examples
|--/cpp: C++ interface examples
|--/pytorch: PyTorch OP examples
|--/tensorflow: TensorFlow OP examples
|--/tensorrt: TensorRT examples
/docs: Documents to explain the details of implementation of different models, and show the benchmark
/benchmark: Contains the scripts to run the benchmarks of different models
/tests: Unit tests
/templates: Documents to explain how to add a new model/example into FasterTransformer repo

Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.

Global Environment

FasterTransformer provides some convenient environment variables for debuging and testing.

1.…

Excerpt shown — open the source for the full document.