What does this writing signal mean?

OpenAI Writing: Jukebox

Captured source

source ↗

openai.com/openai.com/index/jukebox

Jukebox

Source ↗

published Apr 30, 2020seen 6dcaptured 2dhttp 200method exa

Jukebox | OpenAI

April 30, 2020

Jukebox

Illustration: Ben Barry

Loading…

We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.

Curated samples

Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch. Below, we show some of our favorite samples.

Motivation and prior work

Automatic music generation dates back to more than half a century.1, 2, 3, 4 A prominent approach is to generate music symbolically in the form of a piano roll, which specifies the timing, pitch, velocity, and instrument of each note to be played. This has led to impressive results like producing Bach chorals,5, 6 polyphonic music with multiple instruments,7, 8, 9 as well as minute long musical pieces.10, 11, 12

But symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A different approachA is to model music directly as raw audio.13, 14, 15, 16 Generating music at the audio level is challenging since the sequences are very long.17 A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT‑2 had 1,000 timesteps and OpenAI Five⁠ took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.

One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.25, 17

We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet⁠ explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.

Approach

Compressing music to discrete codes

Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE.25 Hierarchical VQ-VAEs17 can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-226 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images.

We draw inspiration from VQ-VAE-2 and apply their approach to music. We modify their architecture as follows:

To alleviate codebook collapse common to VQ-VAE models, we use random restarts where we randomly reset a codebook vector to one of the encoded hidden states whenever its usage falls below a threshold.
To maximize the use of the upper levels, we use separate decoders and independently reconstruct the input from the codes of each level.
To allow the model to reconstruct higher frequencies easily, we add a spectral loss27, 28 that penalizes the norm of the difference of input and reconstructed spectrograms.

We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.

Generating codes using transformers

Next, we train the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. Like the VQ-VAE, we have three levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.

The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality.

We train these as autoregressive models using a simplified variant of Sparse Transformers.29, 30 Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively.

Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.

Dataset

To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki⁠. The metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. We train on 32-bit, 44.1 kHz raw audio, and perform data augmentation by randomly downmixing the right and left channels to produce mono audio.

Artist and genre conditioning

The top-level transformer is trained on the task of predicting compressed audio tokens. We can provide additional information, such as the artist and genre for each song. This has two advantages: first, it reduces the entropy of the audio prediction, so the model is able to achieve better quality in any particular style; second, at generation time, we are able to steer the model to generate in a style of our choosing.

This t-SNE31 below shows how the model learns, in an unsupervised way, to cluster similar artists and genres close together, and also makes some surprising associations like Jennifer Lopez being so close to Dolly Parton!

Lyrics conditioning

In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A…

Excerpt shown — open the source for the full document.