WritingOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jun 17, 2023seen 4d

Official Tutorial — How to Fine-tune CPM-Bee on Basic Tasks

Open original ↗

Captured source

source ↗

Official Tutorial — How to Fine-tune CPM-Bee on Basic Tasks | by OpenBMB | Medium

Sign up

Get app

Sign up

Official Tutorial — How to Fine-tune CPM-Bee on Basic Tasks

6 min read

Jun 17, 2023

--

Share

Press enter or click to view image in full size

Once open sourced on May 27th, CPM-Bee, the Chinese-English multilingual foundation model with ten billion parameters, received a warm response on GitHub and quickly climbed to the 4th on Github Trending repositories and the 3rd on the Python Trending repositories.

Press enter or click to view image in full size

Press enter or click to view image in full size

CPM-Bee is a foundation model, and we open sourced it with a core purpose of providing broader support for various NLP application scenarios, enabling everyone to freely adapt it to different tasks. We have incorporated special designs during pre-training for more stable outputs. By fine-tuning CPM-Bee on suitable data, it will be able to produce content with higher quality and more information.

In fact, many friends in our community are looking forward to a formal and detailed tutorial of how to fine-tune the large model. So here is our CPM-Bee Basic Fine-tuning Tutorial!

CPM-Bee Data Format Introduction

The CPM-Bee foundation model can address various natural language processing tasks through a unified generation approach. CPM-Bee adopts a special multitask pretraining mode, where all data is managed using a single dictionary. We can design key-value pairs in the dictionary to express the tasks we want the model to perform, while reserving an field to store the model's output answer. It is important to note that the field is required, and the basic format is as follows:

{"some_key": "...", "": ""}

Although the input data format can be arbitrary, we highly recommend using the following reference formats when performing inference with CPM-Bee, since the model used a limited set of data formats during the pretraining phase.

1. Text Generation

# Text Generation{"input": "It's a nice day,", "prompt":"continue the text for 100 words", "":""}

Theprompt field is used to provide hints and specify tasks for the model. Although this field is not mandatory, we recommend using a well-craftedprompt to better guide the model. The termprompt can also be replaced with words such as "hint," "task," "target," or their equivalents in other languages. Please note thatprompt often includes control instructions such as "continue the text for xxx words," "translate from Chinese to English," or "generate a summary for this paragraph."

2. Translation

# Translation{"input": "今天天气不错,", "prompt":"中翻英", "":""}

CPM-Bee currently supports translation between Chinese and English. Commonprompt options for translation tasks include "Translate from Chinese to English," "Translate from English to Chinese," "Translate the Chinese paragraph to English," "Translate English text to Chinese," and their variations.

3. Q&A

# Q&A{"input": "今天天气不错,", "prompt":"问答", "question": "今天天气怎么样", "":""}

4. Multiple Choice

# Multiple Choice{"input": "今天天气不错,", "prompt":"选择题", "question": "今天天气怎么样", "options": {"": "好", "": "坏"}, "":""}

Options can be equivalently substituted by "answers", "candidates", "choice"...

Get OpenBMB’s stories in your inbox

Join Medium for free to get updates from this writer.

Subscribe

Subscribe

Remember me for faster sign in

5. Named Entity Recognition

# NER{"input": "Xiaonan, who works at the Ministry of Justice, said that the weather in Beijing is nice today.", "": {"Person": "", "Location": "", "Organization": ""}}

The above are data formats of some common tasks. Please note that the fields used in the examples are not strictly limited. You may make approximate semantic substitutions. For example, you can replace “Translate from Chinese to English” with “Translate this passage into English”. During fine-tuning, you are also free to design the data format according to your needs. For instance, if you want to fine-tune a dialogue model, you can construct the data format as follows:

{"input": "User: Hello, I would like to ask about tomorrow's weather.\nAI: Hello! Tomorrow's weather will vary depending on the city you are in. Please let me know your city.\nUser: I'm in Beijing.\nAI:", "": " Tomorrow's weather in Beijing is forecasted to be cloudy to partly cloudy, with a high temperature of 26°C and a low temperature of 18°C."}

You can also construct without using as shown below:

{"Hello, I would like to ask about tomorrow's weather.\nHello! Tomorrow's weather will vary depending on the city you are in. Please let me know your city.\nI'm in Beijing.\n", "": " Tomorrow's weather in Beijing is forecasted to be cloudy to partly cloudy, with a high temperature of 26°C and a low temperature of 18°C."}

In brief, you may define your data format freely in the process.

CPM-Bee Fine-tuning Process

In this tutorial, we will use sequence-to-sequence tasks as examples to demonstrate the CPM-Bee foundation model fine-tuning process. The tasks we have chosen require translating a sentence in vernacular Chinese into a line of classical poetry. First, for fine-tuning, you need to prepare the raw data in the following format:

{"target": "3", "input": "[Translate]昏暗的灯熄灭了又被重新点亮。[0]渔灯灭复明[1]残灯灭又然[2]残灯暗复明[3]残灯灭又明[Answer]"}
  • Place the data under the pathsrc/ccpm_example/raw_data/.
  • Prepare the model’s checkpoint and place it under the pathsrc/ckpts/pytorch_model.bin. You can download the weights from this link.

Then navigate to the working directory:

$ cd src

To reformat the data, rundata_reformat.py. Please note that here we are converting the raw data into the recommended format mentioned above. In your own experiment, you can set the desired format and write your owndata_reformat.py script.

$ python data_reformat.py

Data format after adjustment:

{"input": "昏暗的灯熄灭了又被重新点亮。", "options": {"": "渔灯灭复明", "": "残灯灭又然", "": "残灯暗复明", "": "残灯灭又明"}, "question": "这段话形容了哪句诗的意境?", "": ""}```
  • Place the data under the pathsrc/ccpm_example/bee_data/.

Please note that this format is a reference format. When fine-tuning, you are free to design your own data format. You can omit theprompt field as long as the provided data contains all the necessary information. However, we generally recommend identifying the input text field asinput/document/doc. If…

Excerpt shown — open the source for the full document.