togethercomputer/FT_Bloomchat
forked from NVIDIA/FasterTransformer
Captured source
source ↗togethercomputer/FT_Bloomchat
Description: Transformer related optimization, including BERT, GPT
Language: C++
License: Apache-2.0
Stars: 1
Forks: 1
Open issues: 2
Created: 2023-01-30T15:28:50Z
Pushed: 2023-05-24T11:45:43Z
Default branch: main
Fork: yes
Parent repository: NVIDIA/FasterTransformer
Archived: no
README:
FasterTransformer
This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA.
Table Of Contents
- [FasterTransformer](#fastertransformer)
- [Table Of Contents](#table-of-contents)
- [Model overview](#model-overview)
- [Support matrix](#support-matrix)
- [Advanced](#advanced)
- [Global Environment](#global-environment)
- [Performance](#performance)
- [BERT base performance](#bert-base-performance)
- [BERT base performances of FasterTransformer new features](#bert-base-performances-of-fastertransformer-new-features)
- [BERT base performance on TensorFlow](#bert-base-performance-on-tensorflow)
- [BERT base performance on PyTorch](#bert-base-performance-on-pytorch)
- [Decoding and Decoder performance](#decoding-and-decoder-performance)
- [Decoder and Decoding end-to-end translation performance on TensorFlow](#decoder-and-decoding-end-to-end-translation-performance-on-tensorflow)
- [Decoder and Decoding end-to-end translation performance on PyTorch](#decoder-and-decoding-end-to-end-translation-performance-on-pytorch)
- [GPT performance](#gpt-performance)
- [Release notes](#release-notes)
- [Changelog](#changelog)
- [Known issues](#known-issues)
Model overview
In NLP, encoder and decoder are two important components, with the transformer layer becoming a popular architecture for both components. FasterTransformer implements a highly optimized transformer layer for both the encoder and decoder for inference. On Volta, Turing and Ampere GPUs, the computing power of Tensor Cores are used automatically when the precision of the data and weights are FP16.
FasterTransformer is built on top of CUDA, cuBLAS, cuBLASLt and C++. We provide at least one API of the following frameworks: TensorFlow, PyTorch and Triton backend. Users can integrate FasterTransformer into these frameworks directly. For supporting frameworks, we also provide example codes to demonstrate how to use, and show the performance on these frameworks.
Support matrix
| Models | Framework | FP16 | INT8 (after Turing) | Sparsity (after Ampere) | Tensor parallel | Pipeline parallel | FP8 (after Hopper) | | ---------------- | -------------- | ---- | ------------------- | ----------------------- | --------------- | ----------------- | ------------------ | | BERT | TensorFlow | Yes | Yes | - | - | - | - | | BERT | PyTorch | Yes | Yes | Yes | Yes | Yes | - | | BERT | Triton backend | Yes | - | - | Yes | Yes | - | | BERT | C++ | Yes | Yes | - | - | - | Yes | | XLNet | C++ | Yes | - | - | - | - | - | | Encoder | TensorFlow | Yes | Yes | - | - | - | - | | Encoder | PyTorch | Yes | Yes | Yes | - | - | - | | Decoder | TensorFlow | Yes | - | - | - | - | - | | Decoder | PyTorch | Yes | - | - | - | - | - | | Decoding | TensorFlow | Yes | - | - | - | - | - | | Decoding | PyTorch | Yes | - | - | - | - | - | | GPT | TensorFlow | Yes | - | - | - | - | - | | GPT/OPT | PyTorch | Yes | - | - | Yes | Yes | Yes | | GPT/OPT | Triton backend | Yes | - | - | Yes | Yes | - | | GPT-MoE | PyTorch | Yes | - | - | Yes | Yes | - | | BLOOM | PyTorch | Yes | - | - | Yes | Yes | - | | BLOOM | Triton backend | Yes | - | - | Yes | Yes | - | | GPT-J | Triton backend | Yes | - | - | Yes | Yes | - | | Longformer | PyTorch | Yes | - | - | - | - | - | | T5/UL2 | PyTorch | Yes | - | - | Yes | Yes | - | | T5 | TensorFlow 2 | Yes | - | - | - | - | - | | T5/UL2 | Triton backend | Yes | - | - | Yes | Yes | - | | T5 | TensorRT | Yes | - | - | Yes | Yes | - | | T5-MoE | PyTorch | Yes | - | - | Yes | Yes | - | | Swin Transformer | PyTorch | Yes | Yes | - | - | - | - | | Swin Transformer | TensorRT | Yes | Yes | - | - | - | - | | ViT | PyTorch | Yes | Yes | - | - | - | - | | ViT | TensorRT | Yes | Yes | - | - | - | - | | GPT-NeoX | PyTorch | Yes | - | - | Yes | Yes | - | | GPT-NeoX | Triton backend | Yes | - | - | Yes | Yes | - | | BART/mBART | PyTorch | Yes | - | - | Yes | Yes | - | | WeNet | C++ | Yes | - | - | - | - | - | | DeBERTa | TensorFlow 2 | Yes | - | - | On-going | On-going | - | | DeBERTa | PyTorch | Yes | - | - | On-going | On-going | - |
- Note that the FasterTransformer supports the models above on C++ because all source codes are built on C++.
More details of specific models are put in xxx_guide.md of [docs/](docs), where xxx means the model name. Some common questions and the respective answers are put in [docs/QAList.md](docs/QAList.md). Note that the model of Encoder and BERT are similar and we put the explanation into bert_guide.md together.
Advanced
The following code lists the directory structure of FasterTransformer:
/src/fastertransformer: source code of FasterTransformer |--/cutlass_extensions: Implementation of cutlass gemm/kernels. |--/kernels: CUDA kernels for different models/layers and operations, like addBiasResiual. |--/layers: Implementation of layer modules, like attention layer, ffn layer. |--/models: Implementation of different models, like BERT, GPT. |--/tensorrt_plugin: encapluate FasterTransformer into TensorRT plugin. |--/tf_op: custom Tensorflow OP implementation |--/th_op: custom PyTorch OP implementation |--/triton_backend: custom triton backend implementation |--/utils: Contains common cuda utils, like cublasMMWrapper, memory_utils /examples: C++, tensorflow and pytorch interface examples |--/cpp: C++ interface examples |--/pytorch: PyTorch OP examples |--/tensorflow: TensorFlow OP examples |--/tensorrt: TensorRT examples /docs: Documents to explain the details of implementation of different models, and show the benchmark /benchmark: Contains the scripts to run the benchmarks of different models /tests: Unit tests /templates: Documents to explain how to add a new model/example into FasterTransformer repo
Note that many folders contains many sub-folders to split different models. Quantization tools are move to examples, like examples/tensorflow/bert/bert-quantization/ and examples/pytorch/bert/bert-quantization-sparsity/.
Global Environment
FasterTransformer provides some convenient environment variables for debuging and testing.
1.…
Excerpt shown — open the source for the full document.