WritingQwen (Alibaba Cloud)Qwen (Alibaba Cloud)published Nov 14, 2022seen 6d

OFA: Towards Building a One-For-All Model

Open original ↗

Captured source

source ↗
published Nov 14, 2022seen 6dcaptured 3dhttp 200method plain

OFA: Towards Building a One-For-All Model | Qwen We have a new blog! View this page at qwen.ai . This page will automatically redirect in 5 seconds. If you are not redirected automatically, please click the button below. Go Now

OFA: Towards Building a One-For-All Model November 14, 2022 · 9 min · 1876 words · Qwen Team | Translations: 简体中文

2022 is a year of generalist models! With the bloom of multimodal pretraining, especially the unified model, we have witnessed the opportunity to building a generalist model that is capable of processing tasks of different modalities or multi-modalities! Thus, we propose OFA 1 , namely One-For-All, a unified multimodal pretrained model that unifies understanding and generation tasks concerning modalities into a single framework, and we pretrain OFA with the instruction-based multitask-pretraining that endows it with multiple capabilities. We opensourced both the pretrained and finetuned models to the community, hoping this pioneer work can help accelerate the development of generalist models. Paper Github ModelScope Demo Background # Multimodal pretraining has been developing rapidly ever since the transfer of BERT 2 to cross-modal representation learning. Representative studies include UNITER 3 , VilBERT 4 , etc. These studies directly incorporate the Transformer-based BERT 2 to a single-stream or dual stream framework for multimodal pretraining, and transform the image to a sequence of object features to be concatenated with the word embeddings as the input of Transformer. Later in 2021, with the rise of Vision Transformer 5 , there came methods that got rid of object-level features, which depend on complex preprocessing pipelines, say Faster-RCNN 6 : For example, the simplest ViLT 7 based on patch projection, the CLIP-based 8 CLIP-ViL 9 , etc. One milestone after should be the proposal of SimVLM 10 , which leverages the T5/BART method for multimodal pretraining and achieves new SoTA in many tasks. These progress should be regarded as the foundation of unified multimodal pretrained models in 2022, including OFA of ours, Unified-IO 11 , Flamingo 12 , BeiT-3 13 , etc. Method # What OFA wants to achieve is the unification of tasks, modalities, and architecture. We suppose there are three features for a unified model, i.e., task agnostic, modality agnostic, and task comprehensiveness. To further explain them, “task agnostic” indicates that the unified model should be able to accept tasks without modifying its own architecture and training methods, “modality agnostic” indicates that a unified model should accept inputs of different modalities without knowing what they are and designing complex preprocessing, and “task comprehensiveness” indicates that the unified model should learn as many tasks as possible so that it can transfer to unseen tasks with the composition of existing capabilities. Thus, we propose 3 types of unification for OFA, namely the unification of modalities, architecture, and tasks. Let’s figure them out one by one. For the unification of modalities, one key issue is the tokenization of inputs of different modalities, or to say, the discretization. Otherwise, there should be other solutions like diffusion models 14 for the generation. There is no need to change the tokenization for texts, but the images and bounding boxes need to be discretized. Owing to the success of vector quantization 15 16 and text-to-image generation with Transformer 17 18 , images can be represented with VQ tokens. Inspired by pix2seq 19 , bounding boxes can also be discretized with bins. We choose the universal Transformer encoder-decoder architecture, due to its successful usages in NLP unified models like T5 20 . Note that for the input of images to the Transformer, we use the first three blocks of ResNet. For the Transformer architecture, we modify the design by incorporating Normformer 21 for the training stability and transfer performance. The multitask learning is the key innovation of OFA. Specifically, we pretrain the model with 8 tasks, including 5 vision-language tasks, 2 vision tasks, and 1 language task. The vision-language tasks include visual grounding, grounded captioning, visual question answering, image-text matching, and image captioning. The vision tasks include detection and image infilling. The language task is text infilling. To help the model differentiate tasks, we insert an instruction, which is simply a piece of text describing the task. Thus, we expect the model to perform zero-shot generation based on a new instruction indicating an unseen task. To make this research as reproducible as possible, our pretraining is dependent on public datasets. Therefore, we expect the researchers following this work can reproduce our results with our opensourced code. We have released OFA models of 5 sizes, including OFA-Tiny (33M), OFA-Medium (93M), OFA-Base (180M), OFA-Large(470M), OFA-Huge (930M). See the table below for more statistics. Experiments # We have conducted experiments on multiple cross-modal tasks and unimodal tasks. On vision-language understanding, we test the models on VQA and SNLI-VE. We find that the huge-size model can achieve a comparable performance to the 80B-parameter model Flamingo and the 2B-parameter model CoCa pretrained on 5B image-text pairs. Furthermore, we achieve the best performance on visual entailment. For vision-language generation, we focus on the classical image captioning, and our OFA achieves the SoTA performance in both setups of cross-entropy optimization and CIDEr optimization. Also, we have transformed the task of visual grounding to a generation task, and we find that even the base-size OFA can outperform the previous SoTA, and the scaling of model size consistently brings performance improvements. This shows the significance of the unification of modalities and tasks. Additionally, we test OFA on text-to-image generation, as we believe that the image infilling task in pretraining endows it with the capability to generate image codes. We show that OFA can achieve a low FID score in the evaluation, and further finetuning on a larger dataset can significantly boost its performance. See cases below. As to the unimodal tasks, we evaluate OFA on the GLUE benchmark for NLU, Gigaword summarization for NLG, and ImageNet classification for vision understanding. We show that OFA can be competitive with both RoBERTa and DeBERTa, and the previous multimodal…

Excerpt shown — open the source for the full document.