What does this model signal mean?

Cerebras published cerebras/Cerebras-GPT-2.7B. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license apache-2.0 · 458 HF downloads · Routine model release, low traction (524 downloads).. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Cerebras Model: cerebras/Cerebras-GPT-2.7B

Captured source

source ↗

Hugging Face/huggingface.co/cerebras/Cerebras-GPT-2.7B

cerebras/Cerebras-GPT-2.7B model card

Source ↗

published Mar 20, 2023seen Jun 6captured Jun 11http 200method plaintask text-generationlicense apache-2.0library transformersdownloads 458likes 47

Cerebras-GPT 2.7B

Check out our Blog Post and arXiv paper!

Model Description

The Cerebras-GPT family is released to facilitate research into LLM scaling laws using open architectures and data sets and demonstrate the simplicity of and scalability of training LLMs on the Cerebras software and hardware stack. All Cerebras-GPT models are available on Hugging Face.

The family includes 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B models.

All models in the Cerebras-GPT family have been trained in accordance with Chinchilla scaling laws (20 tokens per model parameter) which is compute-optimal.

These models were trained on the Andromeda AI supercomputer comprised of 16 CS-2 wafer scale systems. Cerebras' weight streaming technology simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.

Cerebras systems for pre-training and fine tuning are available in the cloud via the Cerebras Model Studio. Cerebras CS-2 compatible checkpoints are available in Cerebras Model Zoo.

Model Details

Developed by: Cerebras Systems
License: Apache 2.0
Model type: Transformer-based Language Model
Architecture: GPT-3 style architecture
Data set: The Pile
Tokenizer: Byte Pair Encoding
Vocabulary Size: 50257
Sequence Length: 2048
Optimizer: AdamW, (β1, β2) = (0.9, 0.95), adam_eps = 1e−8 (1e−9 for larger models)
Positional Encoding: Learned
Language: English
Learn more: Dense Scaling Laws Paper for training procedure, config files, and details on how to use.

Contact: To ask questions about Cerebras-GPT models, join the Cerebras Discord.

This is the standard parameterization version of Cerebras-GPT with 2.7B parameters

Related models: Cerebras-GPT Models

| Model | Parameters | Layers | d_model | Heads | d_head | d_ffn | LR | BS (seq) | BS (tokens) | |---------------|------------|--------|---------|-------|--------|--------|----------|----------|----------------| | Cerebras-GPT | 111M | 10 | 768 | 12 | 64 | 3072 | 6.0E-04 | 120 | 246K | | Cerebras-GPT | 256M | 14 | 1088 | 17 | 64 | 4352 | 6.0E-04 | 264 | 541K | | Cerebras-GPT | 590M | 18 | 1536 | 12 | 128 | 6144 | 2.0E-04 | 264 | 541K | | Cerebras-GPT | 1.3B | 24 | 2048 | 16 | 128 | 8192 | 2.0E-04 | 528 | 1.08M | | Cerebras-GPT | 2.7B | 32 | 2560 | 32 | 80 | 10240 | 2.0E-04 | 528 | 1.08M | | Cerebras-GPT | 6.7B | 32 | 4096 | 32 | 128 | 16384 | 1.2E-04 | 1040 | 2.13M | | Cerebras-GPT | 13B | 40 | 5120 | 40 | 128 | 20480 | 1.2E-04 | 720 → 1080 | 1.47M → 2.21M |

Quickstart

This model can be easily loaded using the AutoModelForCausalLM functionality:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cerebras/Cerebras-GPT-2.7B")
model = AutoModelForCausalLM.from_pretrained("cerebras/Cerebras-GPT-2.7B")

text = "Generative AI is "

And can be used with Hugging Face Pipelines

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2)[0]
print(generated_text['generated_text'])

or with model.generate()

inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, num_beams=5,
max_new_tokens=50, early_stopping=True,
no_repeat_ngram_size=2)
text_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text_output[0])

Training data

Cerebras-GPT is trained using the Pile dataset from EleutherAI. See the Pile paper for a more detailed breakdown of data sources and methodology. The Pile was cleaned using the ftfy library to normalize the text, then filtered using scripts provided by Eleuther.

We tokenized the data using byte-pair encoding using the GPT-2 vocabulary. Our tokenized version of the Pile has 371B tokens. We include more details about the training dataset preprocessing in Appendix A.1 of our paper.

Recent works find significant duplicate data present in the Pile. Eleuther’s Pythia applies a deduplication process to reduce replicated data, decreasing the Pile dataset size. Pythia was trained on both the standard dataset and deduplicated dataset to characterize the impact. Our models are trained on the standard Pile without deduplication, which may present an opportunity for further improvement with the deduplicated data set.

Training procedure

We use the GPT-3 style model architecture. All of our layers use full attention as opposed to the GPT-3 style sparse banded attention. The model shapes were selected to either follow aspect ratio 80 or are the same shape as GPT-3 models. Learning rate warmed up for 375M tokens (1500 steps for 111M and 256M models) and 10x cosine decayed. No dropout was used and weight decay was set to 0.1. All models are trained with MSL of 2048.

All models were trained to Chinchilla point: 20 tokens per model parameter. Number of steps was chosen based on optimal batch size (varied by model) and fixed sequence length (2048).See Training Table, below, for details.

Model Params | Sequence Length | Batch Size | Number of Steps | Tokens | Tokens per Parameter | Flops ------------ | -------------- | ---------- | --------------- | ------ | -------------------- | ----- 111M | 2048 | 120 | 9037 | 2.22E+09 | 20 | 2.6E+18 256M | 2048 | 264 | 9468 | 5.12E+09 | 20 | 1.3E+19 590M | 2048 | 264 | 21836 | 1.18E+10 | 20 | 6.1E+19 1.3B | 2048 | 528 | 24334 | 2.63E+10 | 20 | 2.8E+20 2.7B | 2048 | 528 | 49041 | 5.30E+10 | 20 | 1.1E+21 6.7B | 2048 | 1040 | 62522 | 1.33E+11 | 20 | 6.3E+21 13B | 2048 | 720 | 174335 | 2.57E+11 | 20 | 2.3E+22

Evaluations

We trained models from smallest to largest and fit a power law as we went along. The power law was helpful for extrapolating the validation loss of the next largest model we trained and provided confidence about whether the...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Routine model release, low traction (524 downloads).