OpenBMB/BMPrinciples
Captured source
source ↗OpenBMB/BMPrinciples
Description: A collection of phenomenons observed during the scaling of big foundation models, which may be developed into consensus, principles, or laws in the future
License: MIT
Stars: 285
Forks: 20
Open issues: 1
Created: 2023-05-10T13:24:03Z
Pushed: 2023-08-13T09:08:36Z
Default branch: main
Fork: no
Archived: no
README:
BM-Principles
🌟 The big models have proven their potential to lead to artificial general intelligence. However 😕, due to their rapid development, people have not fully grasped the principles of understanding and training big models. Therefore, in order to learn about big models together, we have decided to collect new phenomena observed on the big models and summarize them in this repository 📚 in the form of short entries. We hope this collection of phenomena observed during the scaling of big models may form future consensuses, principles, or patterns 📝.
The repository focuses on two aspects:
- How: How to train powerful big models? 🚀
- What: What properties are interesting for big models? 🤔
The repo is far from exclusive currently. Let's work together to improve it! 💪
How: how to train a powerful big model.
1. Scaling of Computation 1. Training loss decreases predictably.
- Training loss can be written as a smooth function of model parameters and computation.
> Scaling Laws for Neural Language Models
> Scaling Laws for Autoregressive Generative Modeling
2. Computational-optimal language model.
- Given a fixed computational budget, if we train an excessively large model, we can only iterate for a very limited number of steps. On the other hand, if we train a model that is too small, the limit of the loss will not be as good as that of a larger model. Therefore, there exists an *optimal model size*, *optimal training compute*, and *optimal tokens*.
- From previous experience, it's roughly $20 * N$, where $N$ is the number of model parameters.
> Training Compute-Optimal Large Language Models
3. LLM doesn't converge at tokens of optimal computation.
- LLM might continue to improve the loss after optimal tokens.
- From Llama-7b and Llama-13b's training loss, we can see that continue to improve after 140 B and 260 B parameters.
> LLaMA: Open and Efficient Foundation Language Models `
2. Optimal Hyperparameters. 1. The best batch size is a function of loss.
- To reach a certain loss, a large batch size requires more computation, a small batch size requires more training steps (i.e., times). The best batch size is a trade-off.
- Each diagonal line formed by the points represents a training process. The horizontal axis represents the training steps, the vertical axis represents the number of processed tokens, and the color depth represents the loss. The optimal batch size can be considered as the inflection point of each contour line of loss.
> Scaling Laws for Neural Language Models 2. Large batch size allows a large learning rate, 1. Generally, a larger batch size allows a larger learning rate. And the larger learning rate has faster convergence.
> Don't decay the learning rate, increase the batch size
3. Cosine scheduler is prevalent.
- Cosine scheduler is the prevalent one, which is better than Noam with the same peak learning rate. Noam decreases more sharply.
- Below is our experiment for CPM.
4. Cosine learning rate's period should be set to the end step.
- From 2.3, you might wonder if it is good to keep the learning rate high is good for training. But it's not.
- When you want to train $N$ steps, it's best to set the period of the scheduler to $N$, not bigger or smaller.
3. Predictable Scaling. 1. Pass rate on human eval can be predicted with 1/10000 compute.
- It's important to forecast the model's ability before it is trained. OpenAI GPT-4 proposed the first version of predictable scaling. It estimates the Human-eval's pass rate
- Currently, there is no other public result for predicting the downstream metrics for large models.
4. Model Architecture 1. Architectures in a diverse range have a similar pre-training loss.**
> Scaling Laws for Neural Language Models
2. For downstream metrics, we prefer deepnarrow architecture.
> Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
3. Normalization has not reached a consensus, but pre-norm is more popular recently.
- Here we list the normalization techniques of publicly known models.
| Model| Normalization | | ---- | -----| | Llama | Pre-norm | | GLM | PostNorm + DeepNorm | | Pythia | PostNorm | | BLOOM | PreNorm | | StarCoder | PreNorm |
> [DeepNet: Scaling Transformers to 1,000 Layers]()
5. Data Mixture 1. Diversity improves zero-shot generalization.
- Diverse cross-domain pretraining data combining web crawls with curated high-quality sources significantly improves zero-shot generalization over pretraining datasets constructed from Common Crawl only.
> What Language Model to Train if You Have One Million GPU Hours?
2. Data portion is important. 1. Re-mix the dataset in Pile boosts the convergence speed and performance.
> DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
3. Code might contribute to reasoning ability.
- There is a wide belief that pre-training on code results in a strong capability of reasoning. But currently, there is no quantitative verification.
> How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources
What: what properties are interesting for large models?
1. Emergent ability 1. Emergent ability is observed with models ~ 50B or larger
> Emergent Abilities of Large Language Models
2. Popular method only works on large models.
- Prompt tuning, Delta tuning works well for models larger than 1B
- In-context Learning, Chain-of-thought reasoning works for larger models.
>…
Excerpt shown — open the source for the full document.