WritingScalewayScalewaypublished Nov 3, 2020seen 5d

VAE: giving your Autoencoder the power of imagination

Open original ↗

Captured source

source ↗
published Nov 3, 2020seen 5dcaptured 3dhttp 200method plain

VAE: giving your Autoencoder the power of imagination Scale • Olga Petrova • 03/11/20 • 13 min read

In this article we will take a detailed look at the Variational Autoencoder: a generative model that is based on its more commonplace sibling, the Autoencoder (which we will devote some time to below as well). Stay tuned for the PyTorch implementation in the next post!

Before there were GANs, there were VAEs. (And before there were VAEs, there were AEs. Do not even get me started on the VAE-GANs; let us first decipher all these acronyms before they get out of hand!)

AE Autoencoder 1991 VAE Variational Autoencoder 2013 GAN Generative Adversarial Network 2014

Now let's go through them one at a time. Autoencoder: How do you describe a face?

Autoencoder is an unsupervised machine learning algorithm, that aims to obtain a low(er) dimensional representation of your input. This might not sound like a big deal, but the idea is actually quite deep if you think about it. First of all, what is a representation? It is how you choose to describe something, in a way that works well enough for your purposes. For instance, let's say you witnessed an armed robbery committed by this lady over here:

https://thispersondoesnotexist.com

With the incident being something out of the ordinary in your daily life, the image of the cheerful robber is deeply ingrained in your memory. On the bright side, the police is happy to hear it, and the forensic artist asks you to describe the suspect for their sketch. How would you go about it?

Even if you had the photo above at your disposal, your description probably would not start with "Well, if I was to produce a 1024 by 1024 pixel portrait of her, the value of the pixel in the lower left corner would be..." You don't actually need 1024 x 1024 numerical values to characterise someone's face – not to mention that this type of word picture does not come naturally to us, humans. Instead, you might say that she had wavy reddish hair, a round face, a medium-sized nose, and green eyes. The forensic artist would then use their knowledge of what a human face looks like plus your description to produce a sketch of the lady robber. The resulting digital sketch would indeed be made up of pixels, but that is not the representation that you and the artist used between the two of you in the process.

An artist can recreate an image of a person from a verbal description

What is a face to a neural network?

An artificial neural network trained on people's portraits develops its own ideas on what constitutes a human face. To a computer at large, the photo above is nothing more than 1024 x 1024 (x 3 due to the three color channels per pixel) integer values. However, a trained artificial neural network can describe it using certain characteristics instead – not unlike how we would go about it. These characteristics will be encoded numerically, and can therefore be combined together to form a one-dimensional array of numbers, i.e. a vector. The vector's dimensionality will vary depending on the problem at hand, but it will generally be much lower than the dimensions of the original input: for our 1024x1024 portraits, we can probably get away with several hundred (say, 512) for most practical purposes. These 512 characteristics may (and, in all probability, will) be different from what we could imagine: e.g. assigning a certain variable to the distance between the person's eyes, another one to encode the person's skintone, etc. But they will make as much sense to the trained network, as saying "wavy reddish hair" does to us.

How do we arrive at such a description in practice? So far we have been discussing images. Although the concept of autoencoders is by no means limited to computer vision, images are probably the easiest data format to illustrate an example model architecture with. A typical image autoencoder has two parts: an encoder, consisting of several convolutional layers, and a decoder, that is comprised of deconvolutional layers. Since the input first shrinks in size as it goes through the encoder, and then expands back to its original shape inside the decoder, the outer layer of the encoder, whose output has the lowest dimensionality, is often called the bottleneck .

The two parts often are, but do not have to be, symmetric in their architecture. The encoder is a convolutional network that is trained to recognise features of the input images: the deeper the layer, the more abstract the features. What we are after is the output of the bottleneck layer – which is the neural network's description of the input. This low dimensional representation is sometimes called the latent representation (in NLP, you may also find it referred to as a context vector for a phrase).

During training, the autoencoder is supplied with images from the training set, which it is expected to recreate once they pass through the bottleneck layer. The idea is that in order to do that, what comes out of the bottleneck should contain enough information for the network to be able to generate the original image, or something close enough to it.

An autoencoder 1) takes an image, 2) analyses it via the convolutional encoder , 3) arrives at the latent representatio n of the image, and 4) generates, via the deconvolutional decoder , 5) the output image. The training objective is to minimize the difference between the input and output images.

The training objective is enforced through the use of a pixelwise loss function: typically the sum of either the absolute values of the differences, or the squares of the differences, between each pixel's ground truth value and that which has been generated by the network.

What do we use autoencoders for?

Autoencoders have a variety of applications in and out of the computer vision field. The most obvious one is the data compression mechanism that they provide: data can be compressed via the encoder network, and restored from its latent form via the decoder. Additionally, since the goal of the autoencoder is to learn the most "useful" features of the data, it can also serve to remove noise from the inputs. To facilitate training for this task, we can add synthetic noise to the inputs that are fed into the encoder, and compare the decoded outputs to the original noise-free images. (The same approach can be applied to the problem of image colorisation for instance – although at that point, it would not make…

Excerpt shown — open the source for the full document.

Additional captured pages