WritingOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jul 15, 2023seen 4d

Speak and draw! VisCPM:SOTA Open-source Chinese Multimodal Large Model

Open original ↗

Captured source

source ↗

Speak and draw! VisCPM:SOTA Open-source Chinese Multimodal Large Model | by OpenBMB | Medium

Sign up

Get app

Sign up

Speak and draw! VisCPM:SOTA Open-source Chinese Multimodal Large Model

5 min read

Jul 15, 2023

--

Share

Press enter or click to view image in full size

Recently, Tsinghua University NLP Laboratory, Modelbest, and Zhihu together open sourced a series of multimodal large models, named VisCPM. Authoritative evaluation results indicate that VisCPM has achieved the best performance among Chinese multimodal open-source models.

VisCPM is a family of open-source large multimodal models, which support multimodal conversational capabilities (VisCPM-Chat model) and text-to-image generation capabilities (VisCPM-Paint model) in both Chinese and English, achieving state-of-the-art peformance among Chinese open-source multimodal models. VisCPM is trained based on the large language model CPM-Bee with 10 billion parameters, fusing visual encoder (Q-Former) and visual decoder (Diffusion-UNet) to support visual inputs and outputs. Thanks to the good bilingual capability of CPM-Bee, VisCPM can be pre-trained with English multimodal data only and will generalize to achieve promising Chinese multimodal capabilities.

Press enter or click to view image in full size

👐 Open-source Usage: VisCPM is free to be used for personal and research purposes. By open-sourcing the VisCPM model family, we hope to promote the development of the open-source community of large multimodal models and related research.

🌟 Image and text generation coverage: VisCPM models provide relatively comprehensive support for image and text multimodal capabilities, covering both multimodal conversation (image-to-text generation) capabilities and text-to-image generation capabilities.

💫 Excellent bilingual performance: Thanks to the excellent bilingual capability of the base language model CPM-Bee, VisCPM achieves outstanding results in both bilingual multimodal conversation and text-to-image generation.

VisCPM 🔗:https://github.com/OpenBMB/VisCPM

VisCPM-Chat: supports image bilingual multimodal conversations

VisCPM-Chat supports bilingual multimodal conversations involving images in both Chinese and English. The model utilizes Q-Former as the visual encoder and CPM-Bee (10B) as the base LLM. It combines visual and language models and is optimized with the language modeling training objective. The model training consists of two stages: pretraining and instruction tuning.

  • Pretraining: VisCPM-Chat is pretrained using approximately 100M high-quality English text-image pairs. The data sources include CC3M, CC12M, COCO, Visual Genome, Laion, etc. In this stage, the language model parameters remain fixed, and only the parameters of the Q-Former are updated to enable efficient alignment of vision and language representations.
  • Instruction Tuning: We utilize the LLaVA-150K dataset that contains English multimodal instruction-following data. We mix this data with corresponding translated Chinese data to fine-tune the model and align its multimodal capabilities with user intents. In this stage, we update all model parameters to improve the data efficiency of instruction tuning. Interestingly, we observe that even when using only English instruction data for fine-tuning, the model can well comprehend Chinese questions but can only respond in English. This indicates that the model has achieved good generalization in terms of its multilingual and multimodal capabilities. By incorporating a small amount of translated Chinese data during the instruction tuning stage, we can align the model’s response language with the user’s question language.

We evaluate the model on the standard LLaVA English test set and the translated Chinese test set from the standard English test set. The evaluation benchmark examines the model’s performance in conversation, detailed description, and complex reasoning, and uses GPT-4 for scoring. It can be observed that VisCPM-Chat achieves the best average performance in Chinese multimodal capabilities, excelling in conversation and complex reasoning, while also demonstrating good English multimodal capabilities.

Get OpenBMB’s stories in your inbox

Join Medium for free to get updates from this writer.

Subscribe

Subscribe

Remember me for faster sign in

We provide two versions of the model, namely VisCPM-Chat-balance and VisCPM-Chat-zhplus. The former has a balanced ability in both English and Chinese, while the latter has a stronger emphasis on Chinese proficiency. Both models use the same data during the instruction tuning stage. VisCPM-Chat-zhplus additionally incorporates 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese during the pretraining stage.

Press enter or click to view image in full size

Here are demonstrations of multimodal conversations with VisCPM-Chat:

Press enter or click to view image in full size

VisCPM-Paint: Supports bilingual text-to-image generation

VisCPM-Paint supports bilingual text-to-image generation. The model uses CPM-Bee as the text encoder, UNet as the image decoder, and fuses vision and language models using the objective of diffusion model. During the training process, the parameters of the language model remain fixed. The visual decoder is initialized with the parameters of Stable Diffusion 2.1, and it is fused with the language model by gradually unfreezing key bridging parameters. The model is trained on the LAION 2B English text-image pair dataset.

Similar to VisCPM-Chat, we found that due to the bilingual capability of CPM-Bee, VisCPM-Paint can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model’s Chinese text-to-image generation ability can be further improved.

Same as VisCPM-Chat, VisCPM-Paint has two different versions: balance and zhplus. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images.

Press enter or click to view image in full size

The following are demonstrations of images generated by VisCPM-Paint:

Press enter or click to view image in full size

VisCPM offers different model versions with…

Excerpt shown — open the source for the full document.