ModelSarvam AISarvam AIpublished Aug 13, 2024seen 5d

sarvamai/sarvam-1-v0.5

Open original ↗

Captured source

source ↗
published Aug 13, 2024seen 5dcaptured 13hhttp 200method plaintask text-generationlicense otherlibrary transformersparams 2.5Bdownloads 1.5klikes 101

The fully-trained version of this model is now available at https://huggingface.co/sarvamai/sarvam-1

Update (Aug 15, 2024): You can now get started with text completions and supervised finetuning using this notebook on Google colab!

This is an early checkpoint of sarvam-2b, a small, yet powerful language model pre-trained from scratch on 2 trillion tokens. It is trained to be good at 10 Indic languages + English. Officially, the Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

The final checkpoint of sarvam-2b will be released soon, and it will be trained on a data mixture of 4 trillion tokens: containing equal parts English (2T) and Indic (2T) tokens.

The current checkpoint has not undergone any post-training. You can see the capabilities of the current checkpoint in this video.

The model was trained with NVIDIA NeMo™ Framework on the Yotta Shakti Cloud using HGX H100 systems.

Getting started:

from transformers import pipeline
pipe = pipeline(model='sarvamai/sarvam-2b-v0.5', device=0)
pipe('भारत के प्रथम प्रधानमंत्री', max_new_tokens=15, temperature=0.1, repetition_penalty=1.2)[0]['generated_text']
# 'भारत के प्रथम प्रधानमंत्री जवाहरलाल नेहरू थे।\n\n'

Tokenizer

sarvam-2b's tokenizer is built to be efficient for Indic languages and has an average fertility score of ~2 which is significantly lower than other models.

Here is a comparison of fertility scores between sarvam-2b and other popular models. | |Sarvam-2B|Llama-3.1|Gemma-2|GPT-4o| |--------|------|---------|-------|------| |ben_Beng|2.07 |8.02 |3.72 |2.34 | |eng_Latn|1.43 |1.24 |1.23 |1.23 | |guj_Gujr|1.81 |9.97 |3.9 |2.3 | |hin_Deva|1.4 |2.67 |1.96 |1.65 | |kan_Knda|2.37 |14.95 |5.55 |3.29 | |mal_Mlym|2.85 |16.26 |5.88 |3.52 | |mar_Deva|1.77 |3.99 |3.2 |2.56 | |ory_Orya|2.35 |16.84 |6.87 |6.83 | |pan_Guru|1.68 |8.19 |3.37 |2.72 | |tam_Taml|2.17 |12.39 |4.19 |3.17 | |tel_Telu|2.14 |13.3 |4.57 |3.06 | |Average |2.08 |9.34 |4.01 |3.00 |

More technical details like evaluations and benchmarking will be posted soon.

Notability

notability 6.0/10

New model release with moderate downloads