Generative models
Captured source
source ↗Generative models | OpenAI
June 16, 2016
Generative models
Illustration: Ludwig Pettersson
Loading…
Share
This post describes four projects that share a common theme of enhancing or using generative models, a branch of unsupervised learning techniques in machine learning. In addition to describing our work, this post will tell you a bit more about generative models: what they are, why they are important, and where they might be going.
One of our core aspirations at OpenAI is to develop algorithms and techniques that endow computers with an understanding of our world.
It’s easy to forget just how much you know about the world: you understand that it is made up of 3D environments, objects that move, collide, interact; people who walk, talk, and think; animals who graze, fly, run, or bark; monitors that display information encoded in language about the weather, who won a basketball game, or what happened in 1970.
This tremendous amount of information is out there and to a large extent easily accessible—either in the physical world of atoms or the digital world of bits. The only tricky part is to develop models and algorithms that can analyze and understand this treasure trove of data.
Generative models are one of the most promising approaches towards this goal. To train a generative model we first collect a large amount of data in some domain (e.g., think millions of images, sentences, or sounds, etc.) and then train a model to generate data like it. The intuition behind this approach follows a famous quote from Richard Feynman:
> “What I cannot create, I do not understand.”
Richard Feynman
The trick is that the neural networks we use as generative models have a number of parameters significantly smaller than the amount of data we train them on, so the models are forced to discover and efficiently internalize the essence of the data in order to generate it.
Generative models have many short-term applications. But in the long run, they hold the potential to automatically learn the natural features of a dataset, whether categories or dimensions or something else entirely.
Generating images
Let’s make this more concrete with an example. Suppose we have some large collection of images, such as the 1.2 million images in the ImageNet dataset (but keep in mind that this could eventually be a large collection of images or videos from the internet or robots). If we resize each image to have width and height of 256 (as is commonly done), our dataset is one large1,200,000x256x256x3(about 200GB) block of pixels. Here are a few example images from this dataset:
These images are examples of what our visual world looks like and we refer to these as “samples from the true data distribution”. We now construct our generative model which we would like to train to generate images like this from scratch. Concretely, a generative model in this case could be one large neural network that outputs images and we refer to these as “samples from the model”.
DCGAN
One such recent model is the DCGAN network from Radford et al. (shown below). This network takes as input 100 random numbers drawn from a uniform distribution(we refer to these as a code, or latent variables, in red) and outputs an image (in this case 64x64x3 images on the right, in green). As the code is changed incrementally, the generated images do too—this shows the model has learned features to describe how the world looks, rather than just memorizing some examples.
The network (in yellow) is made up of standard convolutional neural network components, such as deconvolutional layers(reverse of convolutional layers), fully connected layers, etc.:
DCGAN is initialized with random weights, so a random code plugged into the network would generate a completely random image. However, as you might imagine, the network has millions of parameters that we can tweak, and the goal is to find a setting of these parameters that makes samples generated from random codes look like the training data. Or to put it another way, we want the model distribution to match the true data distribution in the space of images.
Training a generative model
Suppose that we used a newly-initialized network to generate 200 images, each time starting with a different random code. The question is: how should we adjust the network’s parameters to encourage it to produce slightly more believable samples in the future? Notice that we’re not in a simple supervised setting and don’t have any explicit desired targets for our 200 generated images; we merely want them to look real. One clever approach around this problem is to follow the Generative Adversarial Network (GAN) approach. Here we introduce a second discriminator network (usually a standard convolutional neural network) that tries to classify if an input image is real or generated. For instance, we could feed the 200 generated images and 200 real images into the discriminator and train it as a standard classifier to distinguish between the two sources. But in addition to that—and here’s the trick—we can also backpropagate through both the discriminator and the generator to find how we should change the generator’s parameters to make its 200 samples slightly more confusing for the discriminator. These two networks are therefore locked in a battle: the discriminator is trying to distinguish real images from fake images and the generator is trying to create images that make the discriminator think they are real. In the end, the generator network is outputting images that are indistinguishable from real images for the discriminator.
There are a few other approaches to matching these distributions which we will discuss briefly below. But before we get there below are two animations that show samples from a generative model to give you a visual sense for the training process.In both cases the samples from the generator start out noisy and chaotic, and over time converge to have more plausible image statistics:
VAE learning to generate images (log time)
GAN learning to generate images (linear time)
This is exciting—these neural networks are learning what the visual world looks like! These models usually have only about 100 million parameters, so a network trained on ImageNet has to (lossily) compress 200GB of pixel data into 100MB of weights. This incentivizes it to discover the most salient features of the data: for example, it will likely learn that pixels nearby are likely to…
Excerpt shown — open the source for the full document.