WritingOpenAIOpenAIpublished Jun 17, 2020seen 6d

Captured source

source ↗
published Jun 17, 2020seen 6dcaptured 2dhttp 200method exa

Image GPT | OpenAI

June 17, 2020

Image GPT

Read paper View code ICML 2020 Paper (V1)

Illustration: Ben Barry

Loading…

Share

We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions⁠ and samples⁠. By establishing a correlation between sample quality and image classification accuracy, we show that our best generative model also contains features competitive with top convolutional nets in the unsupervised setting.

Introduction

Unsupervised and self-supervised learning,1 or learning without human-labeled data, is a longstanding challenge of machine learning. Recently, it has seen incredible success in language, as transformer2 models like BERT,3 GPT‑2,4 RoBERTa,5 T5,6 and other variants7, 8, 9, 10 have achieved top performance on a wide array of language tasks. However, the same broad class of models has not been successful in producing strong features for image classification.11 Our work aims to understand and bridge this gap.

Transformer models like BERT and GPT‑2 are domain agnostic, meaning that they can be directly applied to 1-D sequences of any form. When we train GPT‑2 on images unrolled into long sequences of pixels, which we call iGPT, we find that the model appears to understand 2-D image characteristics such as object appearance and category. This is evidenced by the diverse range of coherent image samples it generates, even without the guidance of human provided labels. As further proof, features from the model achieve state-of-the-art performance on a number of classification datasets and near state-of-the-art unsupervised accuracyA on ImageNet.

Evaluation

Dataset

Our Result

Best non-iGPT Result

Logistic regression on learned features (linear probe)

CIFAR-10

96.3 iGPT‑L 32x32 w/ 1536 features

95.3 SimCLR 12⁠ w/ 8192 features

CIFAR-100

82.8 iGPT‑L 32x32 w/ 1536 features

80.2 SimCLR w/ 8192 features

STL-10

95.5 iGPT‑L 32x32 w/ 1536 features

94.2 AMDIM 13⁠ w/ 8192 features

ImageNet

72.0 iGPT‑XL a⁠ 64x64 w/ 15360 features

76.5 SimCLR w/ 8192 features

Full fine-tune

CIFAR-10

99.0 iGPT‑L 32x32, trained on ImageNet

99.0 b⁠ GPipe, 14⁠ trained on ImageNet

ImageNet 32x32

66.3 iGPT‑L 32x32

70.2 Isometric Nets 15⁠

1. We only show ImageNet linear probe accuracy for iGPT‑XL since other experiments did not finish before we needed to transition to different supercomputing facilities. 2. Bit-L, trained on JFT (300M images with 18K classes), achieved a result of 99.3.

To highlight the potential of generative17, 18 sequence modeling19, 20, 21, 22 as a general purpose unsupervised learning algorithm, we deliberately use the same transformer architecture as GPT‑2 in language. As a consequence, we require significantly more compute in order to produce features competitive with those from top unsupervised convolutional nets.13, 23, 24, 25, 12 However, our results suggest that when faced with a new domain where the correct model priors are unknown, a large GPT‑2 can learn excellent features without the need for domain-specific26, 27, 28 architectural design choices.

Loading...

From language GPT to image GPT

In language, unsupervised learning algorithms that rely on word prediction (like GPT‑2 and BERT) have been extremely successful, achieving top performance on a wide array of language tasks. One possible reason for this success is that instances of downstream language tasks appear naturally in text: questions are often followed by answers (which could help with question-answering) and passages are often followed by summaries (which could help with summarization). In contrast, sequences of pixels do not clearly contain labels for the images they belong to.

Even without this explicit supervision, there is still a reason why GPT‑2 on images might work: a sufficiently large transformer trained on next pixel prediction might eventually learn to generate diverseB samples with clearly recognizable objects. Once it learns to do so, an idea known as “Analysis by Synthesis”29, 30, C suggests that the model will also know about object categories. Many early generative models31, 32, 33, 34, 35, 36 were motivated by this idea, and more recently, BigBiGAN37 was an example which produced encouraging samples and features. In our work, we first show that better generative models achieve stronger classification performance. Then, through optimizing GPT‑2 for generative capabilities, we achieve top-level classification performance in many settings, providing further evidence for analysis by synthesis.

Towards general unsupervised learning

Generative sequence modeling is a universal unsupervised learning algorithm: since all data types can be represented as sequences of bytes, a transformer can be directly applied to any data type without additional engineering. Our work tests the power of this generality by directly applying the architecture used to train GPT‑2 on natural language to image generation. We deliberately chose to forgo hand coding any image specific knowledge in the form of convolutions38 or techniques like relative attention,39 sparse attention,40 and 2-D position embeddings.27

As a consequence of its generality, our method requires significantly more compute to achieve competitive performance in the unsupervised setting. Indeed, contrastive methods41, 42, 43, 44, 45, 13, 23, 24, 25, 12 are still the most computationally efficient methods for producing high quality features from images. However, in showing that an unsupervised transformer model is competitive with the best unsupervised convolutional nets,24, 25, 12 we provide evidence that it is possible to trade off hand coded domain knowledge for compute. In new domains,46, 47 where there isn’t much knowledge to hand code, scaling compute seems an appropriate technique to test.

Approach

We train iGPT‑S, iGPT‑M, and iGPT‑L, transformers containing 76M, 455M, and 1.4B parameters respectively, on ImageNet. We also train iGPT‑XLD, a 6.8 billion parameter transformer, on a mix of ImageNet and images from the web. Due to the large computational cost of modeling long sequences with dense attention, we train at the low resolutions of 32x32, 48x48, and 64x64.

While it is tempting to work at even lower resolutions to further reduce compute cost, prior work has demonstrated that human performance on image classification begins to drop rapidly below these sizes.48…

Excerpt shown — open the source for the full document.