togethercomputer/Llama-2-7B-32K-Instruct
Python
Captured source
source ↗togethercomputer/Llama-2-7B-32K-Instruct
Language: Python
License: Apache-2.0
Stars: 84
Forks: 5
Open issues: 3
Created: 2023-08-14T21:18:51Z
Pushed: 2023-08-18T15:59:11Z
Default branch: main
Fork: no
Archived: no
README:
Building Llama-2-7B-32K-Instruct Using Together API
In our blog post, we released the Llama-2-7B-32K-Instruct model finetuned using Together API. In this repo, we share the complete recipe. We encourage you to try out Together API and give us feedbacks! The fine-tuning process is carried out in four simple steps: Distill, Train, Test and Deploy.
(Step I) - Distill
Llama-2-7B-32K-Instruct is fine-tuned over a combination of two data sources:
1. 19K single- and multi-round conversations generated by human instructions and [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) outputs. We collected the dataset following the distillation paradigm that is used by Alpaca, Vicuna, WizardLM and Orca — producing instructions by querying a powerful LLM (in this case, Llama-2-70B-Chat). The complete dataset is also released here.
2. Long-context Summarization and Long-context QA. We follow the recipe of Llama-2-7B-32K, and train our model with the BookSum dataset and Multi-document Question Answering (MQA).
The final data mixture used for model finetuning is: 19K instruction (50%) + BookSum (25%) + MQA (25%).
To gather the instruction data from Llama-2-70B-Chat, we first use the Together API to query the model. Given an instruction such as
instruction = "Create a table about national parks in the US"
we can query the inference API using:
res = requests.post(endpoint, json={
"model": "togethercomputer/llama-2-70b-chat",
"max_tokens": 1024,
"prompt": f"[INST] {instruction} [/INST]",
"request_type": "language-model-inference",
"temperature": 0.7,
"top_p": 0.7,
"top_k": 50,
"repetition_penalty": 1,
"stop": [
"[INST]"
],
"safety_model": "",
"repetitive_penalty": 1
}, headers={
"Authorization": "Bearer " + ,
})where "[INST] {instruction} [/INST]" is the prompt format of LLaMA-2-70B-chat. We then parse out the result with
response = res.json()["output"]["choices"][0]["text"]
In this case, it returns
---
Sure, here's a table about national parks in the US:
| National Park | Location | Established | Area (acres) | Notable Features | | --- | --- | --- | --- | --- | | Yellowstone | Wyoming, Montana, Idaho | 1872 | 2,219,790 | Geysers, hot springs, wildlife | | Grand Canyon | Arizona | 1919 | 1,218,375 | Colorado River, canyon, scenic views | | Yosemite | California | 1890 | 747,956 | Granite cliffs, waterfalls, giant sequoias | | Zion | Utah | 1919 | 146,597 | Canyons, sandstone cliffs, unique rock formations | | Great Smoky Mountains | North Carolina, Tennessee | 1926 | 522,426 | Mountains, forests, waterfalls, wildlife | | Rocky Mountain | Colorado | 1915 | 265,795 | Mountains, alpine lakes, glaciers, wildlife | | Acadia | Maine | 1916 | 33,255 | Coastline, mountains, forests, wildlife | | Olympic | Washington | 1938 | 922,650 | Rainforests, mountains, coastline, wildlife | | Grand Teton | Wyoming | 1929 | 310,044 | Mountains, glaciers, lakes, wildlife | | Great Basin | Nevada | 1986 | 5,000 | 5,000-year-old bristlecone pine trees, limestone caverns | | Arches | Utah | 1971 | 7,200 | Over 2,000 natural arches, sandstone formations | | Bryce Canyon | Utah | 1928 | 35,835 | Hoodoos, amphitheater-shaped park, scenic views | | Canyonlands | Utah | 1964 | 337,598 | Canyons, mesas, buttes, desert landscapes | | Death Valley | California, Nevada | 1994 | 3,373,063 | Badwater Basin, salt flats, sand dunes, unique geology | | Denali | Alaska | 1917 | 4,740,911 | Mount Denali, glaciers, wildlife, dog sledding | | Everglades | Florida | 1935 | 1,508,537 | Mangrove forests, sawgrass marshes, diverse wildlife | | Glacier | Montana | 1910 | 1,012,837 | Glaciers, alpine lakes, mountains, wildlife | | Glacier Bay | Alaska | 1925 | 3,223,373 | Fjords, glaciers, mountains, wildlife |
Note: This table lists some of the most well-known national parks in the US, but there are many others that are also worth visiting. The area of each park is approximate and may vary slightly depending on the source.
---
To build Llama-2-7B-32K-Instruct, we collect instructions from 19K human inputs extracted from ShareGPT-90K (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations and also supports restarting and caching via a SQLite3 database. You can find the full script here, with merely 122 lines!
The output of this step is a jsonl file, each line corresponding to one conversation:
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}Finally, we perform a stratified sampling over three data sources with ratios: 19K instruction (50%) + BookSum (25%) + MQA (25%), and concatenate the dataset to a single instructions.jsonl.
(Step II) - Train
The second step is to fine-tune the Llama-2-7B-32K model using the instruction data we just collected. First, upload the dataset using [Together…
Excerpt shown — open the source for the full document.