sambanova/bloomchat
Python
Captured source
source ↗sambanova/bloomchat
Description: This repo contains the data preparation, tokenization, training and inference code for BLOOMChat. BLOOMChat is a 176 billion parameter multilingual chat model based on BLOOM.
Language: Python
License: NOASSERTION
Stars: 583
Forks: 52
Open issues: 0
Created: 2023-05-16T22:51:12Z
Pushed: 2023-10-10T20:53:21Z
Default branch: main
Fork: no
Archived: no
README: 
BLOOMChat Training Repo
Overview
This repo contains the data preparation, tokenization, training and inference code for BLOOMChat-176B-v1. BLOOMChat is a 176 billion parameter multilingual chat model. It is instruction tuned from BLOOM (176B) on assistant-style conversation datasets and supports conversation, question answering and generative answers in multiple languages.
We trained BLOOMChat on SambaNova DataScale systems using SambaNova’s unique Reconfigurable Dataflow Architecture The training data used to train BLOOMChat originated from OIG dataset from OpenChatKit, Dolly 2.0, and OASST1.
Basic Information
- Blog Post: Link
- Discord: Link
- HF Hosting: Chat with me!
- BLOOMChat-176B-v1: Link
Training Procedure
Environment setup
- Clone SambaNova's Generation Data Preparation repo
- Create a virtual environment
- Set up environment using the above repo's instructions
- Run this command
pip install datasets
Data Preprocessing
Further preprocessing had been done on the original datasets. You can find the relevant code under [data prep](data_prep).
To run these files: 1. cd data_prep 2. python prepare_oa_dolly.py 3. python subsample_openchatkit.py
After running these commands there should be 2 files under the data_prep directory:
oasst1_dolly.jsonlbloom_ock_100K_each.jsonl
NOTE these files are referenced in the tokenization code, so when running the tokenization scripts they need to be done within tokenization_prep otherwise the file paths will need to be changed.
DISCLAIMER: The OIG dataset preprocessed for BLOOMChat might not be 100% reproducible using OIG dataset from OpenChatKit. We will update soon about the other steps necessary to reproduce the dataset.
Dataset Tokenization
The next step after preprocessing the data is to tokenize the data using SambaNova's Generation Data Preparation repo. The scripts that utilize this public repository can be found under [tokenization prep](tokenization_prep).
Follow the instructions on how to set up Generation Data Preparation.
For the bash scripts they take in one argument which is the absolute path to the generation data preparation repo.
Example of running the script:
1. cd tokenization_prep 2. bash tokenization_oa_dolly.sh /home/.../generative_data_prep 3. bash tokenization_openchatkit.sh /home/.../generative_data_prep
After running these scripts there should be 2 directories under the tokenization_prep directory:
oasst1_dolly_outbloom_ock_100k_out
These directories contain two directories:
hdf5splits
The splits directory contains the original text, the hdf5 directory contains the tokenized text which will be fed into the model.
Training
As our models were trained on SambaNova's in-house Reconfigurable Dataflow Unit (RDU), our scripts will not work for training on GPU; however, we want to give better insight and transparency on the hyper-parameters that were used to train this model. The scripts that were ran on the RDU can be found under [training](training). The model is ran directly from HuggingFace, using a built-in wrapper provided by SambaFlow, our SDK for the RDU. For those interested in running models on RDUs, please feel free to get in touch.
NOTE: BLOOMChat is a two step process:
1. First train a model using OIG dataset from OpenChatKit 2. Second train the above model using Dolly 2.0 and OASST1 (Final Model)
Training Data
Quick Start Inference on GPU
First create a python virtual environment for these packages
pipenv --python 3.9 pipenv sync
TODO: Please add instructions on how to install deepspeed as needed for the bloom-inference.
Now let's git clone the huggingface/transformers-bloom-inference repo.
git clone https://github.com/huggingface/transformers-bloom-inference.git cd transformers-bloom-inference/
And then you need to modify two files in this transformers-bloom-inference repo:
- Modifying
inference_server/models/hf_accelerate.py - This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
- Modifying
inference_server/cli.py - This is because the model was trained using specific human, bot tags
- Trailing spaces may lead to subpar performance
Modifications for inference_server/models/hf_accelerate.py:
diff --git a/inference_server/models/hf_accelerate.py…
Excerpt shown — open the source for the full document.