zai-org/GLM-130B
Python
Captured source
source ↗zai-org/GLM-130B
Description: GLM-130B: An Open Bilingual Pre-Trained Model (ICLR 2023)
Language: Python
License: Apache-2.0
Stars: 7651
Forks: 603
Open issues: 124
Created: 2022-08-03T20:21:58Z
Pushed: 2023-07-25T09:01:49Z
Default branch: main
Fork: no
Archived: no
README:
🌐 Blog • ⏬ Download Model • 🪧 Demo • ✉️ Email • 📃 Paper [ICLR 2023]
💬 Google Group (Updates) or Wechat Group or Slack channel (Discussions)
GLM-130B: An Open Bilingual Pre-Trained Model
GLM-130B is an open bilingual (English & Chinese) bidirectional dense model with 130 billion parameters, pre-trained using the algorithm of General Language Model (GLM). It is designed to support inference tasks with the 130B parameters on **a single A100 (40G * 8) or V100 (32G * 8) server. With INT4 quantization, the hardware requirements can further be reduced to a single server with 4 * RTX 3090 (24G) with almost no performance degradation**. As of July 3rd, 2022, GLM-130B has been trained on over 400 billion text tokens (200B each for Chinese and English) and it has the following unique features:
- Bilingual: supports both English and Chinese.
- Performance (EN): better than GPT-3 175B (+4.0%), OPT-175B (+5.5%), and BLOOM-176B (+13.0%) on LAMBADA and slightly better than GPT-3 175B (+0.9%) on MMLU.
- Performance (CN): significantly better than ERNIE TITAN 3.0 260B on 7 zero-shot CLUE datasets (+24.26%) and 5 zero-shot FewCLUE datasets (+12.75%).
- Fast Inference: supports fast inference on both SAT and FasterTransformer (up to 2.5X faster) with a single A100 server.
- Reproducibility: all results (30+ tasks) can be easily reproduced with open-sourced code and model checkpoints.
- Cross-Platform: supports training and inference on NVIDIA, Hygon DCU, Ascend 910, and Sunway (Will be released soon).
This repository mainly focus on the evaluation of GLM-130B. If you find our work and our open-sourced efforts useful, ⭐️ to encourage our following development! :)
News
- [2023.06.25] Release ChatGLM2-6B, an updated version of ChatGLM-6B which introduces Stronger Performance (MMLU (+23%), CEval (+33%), GSM8K (+571%), BBH (+60%)), Longer Context (from 2K in ChatGLM-6B to 32K, and trained with a context length of 8K during the dialogue alignment), and More Efficient Inference (speeds up by 42% under the official implementation; the dialogue length supported by 6G GPU memory has increased from 1K to 8K). More details please refer to ChatGLM2-6B。
- [2023.06.14] We release the research WebGLM, which enables efficient and accurate web-enhanced question answering. All code and data are released!
- [2023.03.14] We are happy to introduce ChatGLM, a bilingual dialogue language model based on GLM-130B, and its open-sourced version ChatGLM-6B which can be run under only 6GB GPU memory!
- [2023.01.21] GLM-130B has been accepted to ICLR 2023!
- [2022.10.06] Our paper for GLM-130B is out!
- [2022.08.24] We are proud to publish the quantized version for GLM-130B. While preserving the activation precision as FP16, the model weights can be quantized to as low as INT4 with almost no degradation of performance, further reducing the hardware requirements of the GLM-130B to **a single server with 4 * RTX 3090 (24G)**! See [Quantization of GLM-130B](docs/quantization.md) for details.
For smaller models, please find monolingual GLMs (English: 10B/2B/515M/410M/335M/110M, Chinese: 10B/335M) and an 1B multilingual GLM (104 languages).
Getting Started
Environment Setup
Hardware
| Hardware | GPU Memory | Quantization | Weight Offload | | --------------- | -------------- | ---------------- | ------------------ | | 8 * A100 | 40 GB | No | No | | 8 * V100 | 32 GB | No | Yes (BMInf) | | 8 * V100 | 32 GB | INT8 | No | | 8 * RTX 3090 | 24 GB | INT8 | No | | 4 * RTX 3090 | 24 GB | INT4 | No | | 8 * RTX 2080 Ti | 11 GB | INT4 | No |
It is recommended to use the an A100 (40G * 8) server, as all GLM-130B evaluation results (~30 tasks) reported can be easily reproduced with a single A100 server in about half a day. With INT8/INT4 quantization, efficient inference on **a single server with 4 * RTX 3090 (24G)** is possible, see [Quantization of GLM-130B](docs/quantization.md) for details. Combining quantization and weight offloading techniques, GLM-130B can also be inferenced on servers with even smaller GPU memory, see [Low-Resource Inference](docs/low-resource-inference.md) for details.
Software
The GLM-130B code is built on the top of SAT. We recommend using Miniconda to manage your environment and installing additional dependencies via pip install -r requirements.txt. Here are the recommended environment configurations:
- Python 3.9+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ / Apex (installation with CUDA and C++ extensions is required, see [here](https://github.com/NVIDIA/apex/#linux))
- SwissArmyTransformer>=0.2.11 is required for quantization
Model weights
Download the GLM-130B’s model checkpoint from here, make sure all 60 chunks are downloaded completely, then use the following command to merge them into a single archive file and extract it:
cat glm-130b-sat.tar.part_* > glm-130b-sat.tar tar xvf glm-130b-sat.tar
Set CHECKPOINT_PATH in configs/model_glm_130b.sh to the path of the extracted folder. Since the checkpoint file is up to 260G, it is recommended to use the SSD or RAM disk to reduce the checkpoint loading time. Since the checkpoint we distribute is in 8-way tensor parallel, a conversion scripts is also provided if you need to change the tensor parallel dimension.
python tools/convert_tp.py \ --input-folder \ --output-folder \ --target-tp
###…
Excerpt shown — open the source for the full document.