CPM-Ant: The First Big Model in CPM-Live Project
Captured source
source ↗CPM-Ant: The First Big Model in CPM-Live Project | by OpenBMB | Medium
Sign up
Get app
Sign up
CPM-Ant: The First Big Model in CPM-Live Project
12 min read
Dec 29, 2022
--
Share
Press enter or click to view image in full size
Previous columns have covered OpenBMB’s open source community and core tools in detail. Based on these tools, we’ve launched CPM-Live, a Live Training for Open-source Big Models.
CPM-Ant is the first milestone of the CPM-Live project. Although the report of CPM-Ant was released as early as September, on Medium we haven’t released relevant information, so we decided to review the wonderful CPM-Ant with you at this moment.
CPM-Ant is our first 10B-parameter pretrained language model, the training of which took 68 days and was completed on August 5, 2022. CPM-Ant provides a feasible practical scheme in the training, tuning, compression, reasoning, application of the big model, hoping to provide different help and reference for different followers.
Overview
CPM-Ant is an open-source Chinese pre-trained language model (PLM) with 10B parameters. It is also the first milestone of the live training process of CPM-Live. The training process is cost-effective and environment-friendly. CPM-Ant also achieves promising results with delta tuning on the CUGE benchmark. Besides the full model, we also provide various compressed versions to meet the requirements of different hardware configurations. The code, log files, and checkpoints of CPM-Ant are available under an open license. More specifically, CPM-Ant is:
- Efficient: BMTrain enables us to take full advantage of distributed computing power to efficiently train big models. The training of CPM-Ant lasts 68 days and costs 430K RMB, which is much lower than the cost of existing model training practices. The greenhouse gas (GHG) emissions of training CPM-Ant are about 4872kg CO2e, while the emissions of training T5–11B are 46.7t CO2e [1].
- Effective: OpenDelta enables us to adapt CPM-Ant to downstream tasks through delta tuning. In our experiments, by only tuning 6.3 million parameters, CPM-Ant has achieved the best performance on the 3/6 CUGE tasks, outperforming those baselines (CPM2 with 11B parameters and Yuan 1.0 with 245B parameters) that tune all parameters.
- Economical: BMCook& BMInf enable us to drive CPM-Ant with limited computing resources. Based on BMInf, we can efficiently perform big model inference using a single GPU (even a consumer-level GPU like GTX 1060) instead of computing clusters. To make the deployment of CPM-Ant more economical, we use BMCook to further compress the original 10B CPM-Ant into multiple versions. These compressed checkpoints (7B, 3B, 1B, 300M) can meet the requirements of various low-resource scenarios.
- Easy-to-Use: For both the original 10B model and its compressed versions, they can be loaded and run with only a few lines of code. We will integrate CPM-Ant into ModelCenter soon, making further development on our model easier.
- Egalitarian: The training process of CPM-Ant is completely open. We have released all code, log files, and final checkpoints. All these files are publicly available. CPM-Ant also adopts an open license that allows commercial use.
Model Architecture
CPM-Ant is built on our previous work [2,3]. Here we briefly introduce the architecture of CPM-Ant. For more details on CPM-Ant, please refer to our GitHub repository.
Pre-training Objectives
CPM-Ant leverages text generation and blank infilling as its pre-training objectives. As shown in the figure below, both text generation and blank infilling are autoregressive. To build self-supervised data for the two objectives, we adopt two masking strategies respectively: one is to mask the tail of the input for text generation, and the other is to randomly mask the input for blanking infilling. The masking rate follows a uniform distribution U(0,1). For each sample, we will choose the random masking strategy to perform text infilling with 50% probability, or the tail masking strategy with another 50% probability for text generation.
Press enter or click to view image in full size
Example of text generation.
Press enter or click to view image in full size
Example of blank infilling
Pre-trained Soft Prompts
In CPM-Ant, we introduce pre-trained soft prompts to switch the generation mode. For text generation and blanking infilling, we set objective-specific soft prompts, respectively. These soft prompts consist of several learnable embeddings. During the pre-training process, these soft prompts are added to the input and stimulate objective-specific knowledge to process the input. When adapting the CPM-Ant for downstream tasks, only task-related soft prompts are used for tuning CPM-Ant. For more details on pre-trained soft prompts, we refer to this work [4].
Press enter or click to view image in full size
Example of adding soft prompts for text generation
Unified Modeling Architecture
Since we want our CPM-Ant to be general enough for various downstream tasks, we use a unified architecture to simultaneously encode contexts and generate tokens, by modifying attention masks to control the generation process, instead of applying the original encoder-decoder architecture of Transformer:
Press enter or click to view image in full size
where M is the attention mask and ⊙ is the Hadamard product. Similar unified encoder architectures have demonstrated their effectiveness and simplicity in preliminary work [5,6].
Press enter or click to view image in full size
Unified modeling architecture of Transformer
In order to further ensure stable training, we adopt the Pre-LN Residual structure as:
Multi-segment Mechanism & Relative Position Bias
We divide the input to CPM-Ant into several segments, and each segment is used to carry specific information. Specifically, we design segments to carry soft prompts, data for blanking infilling, and data for text generation, respectively. Previous work [7] shows that relative distances between tokens cannot be captured by applying a strategy encoding only absolute positions. We adopt a multi-segment mechanism to organize the data for CPM-Ant. More specifically, for the i-th token, we assign additional position id Pi and segment id Si . With the position ids and segment ids, we compute the relative position bias as follows,
where B the bias matrix used in the attention layer, fsi,sj (·) is to map the relative distance between tokens…
Excerpt shown — open the source for the full document.