WritingReplicateReplicatepublished Aug 22, 2023seen 5d

Painting with words: a history of text-to-image AI

Open original ↗

Captured source

source ↗

Painting with words: a history of text-to-image AI – Replicate blog

Replicate Blog

Painting with words: a history of text-to-image AI

Posted August 22, 2023 by jakedahn

sketch of internet is a series of tubes by Leonardo Da Vinci, VQGAN+CLIP, SD1.5, SD2.1, SDXL. With the recent release of Stable Diffusion XL fine-tuning on Replicate, and today being the 1-year anniversary of Stable Diffusion, now feels like the perfect opportunity to take a step back and reflect on how text-to-image AI has improved over the last few years.

We’ve seen AI generated images ascend from incomprehensible piles of eyeballs and noise , to high quality artistic images that are sometimes indistinguishable from the brush strokes of a painter, or the detail-oriented rendering of an illustrator.

In this post, we’ll take a whirlwind tour of the evolution of text-to-image AI, to get a sense of how far we’ve come over the last few years, from early GAN experiments to the latest diffusion models.

Before Diving In

To celebrate the 1 year anniversary of Stable Diffusion, we’ve updated our free text-to-image AI playground tool with the latest Stable Diffusion XL 1.0 model.

Zoo is an open source web app for comparing text-to-image generation models. Zoo lets you compare various image generation models side-by-side, so for example, you can visualize how Stable Diffusion and other text-to-image AI models have improved through time, by comparing the same prompt across multiple models at the same time. Zoo includes Stable Diffusion 1.5 , Stable Diffusion 2.1 , Stable Diffusion XL 1.0 , Kandinski 2.2 , DALL·E 2 , Deepfloyd IF , and Material Diffusion .

Replicate Zoo: text-to-image playground, where you can compare text-to-image AI models side-by-side.

Contents

Below is a list of the models we’ll be showing off in this post. You can click on any of the links below to jump to that section.

CLIP + DALL·E

Advadnoun’s DeepDaze

Advadnoun’s The BigSleep

VQGAN+CLIP

Pixray

DALL·E 2

DALL·E Mini

Stable Diffusion 1

Stable Diffusion 2

Stable Diffusion XL (SDXL)

Fine-tuning

CLIP + DALL·E

The text-to-image generative AI scene as we know it started to take off back in January 2021, after OpenAI published their CLIP model.

CLIP is an open source model from OpenAI that’s trained on captioned images collected from the web, and it’s able to classify and project both images and text into the same embedding space. This means that it has a semantic understanding of what is happening in a given image. For example, if you feed CLIP a photo of a banana, it would be closely related to the text “yellow banana” in the embedding space.

This sort of multi-modal understanding of images and text is an important foundational element of text-to-image AI, because it can be used to nudge text-to-image AI generations to look like the given text prompt. For more details on how CLIP can be used as part of an image generation model, I suggest reading Jay Alammar ’s excellent blog post: The Illustrated Stable Diffusion .

OpenAI also shared a paper detailing their approach for using CLIP to build a text-to-image model, named DALL·E .

While DALL·E was never fully open sourced, the paper and approach inspired a few open source implementations that would go on to shape the text-to-image AI scene as we know it.

Advadnoun’s DeepDaze

The first open source experiment of text-to-image AI was released by advadnoun in January 2021, just a few days after the DALL·E paper was released.

@advadnoun shared a colab notebook , eloquently named Deep Daze . It combined the use of OpenAI’s CLIP model, and the SIREN model to create imagery that was almost legible. You can see in the images below, the very beginnings of a resemblance to the prompt, but the images are all very abstract, never converging on realism or legible subjects.

Here are some of the first images generated with DeepDaze. My favorite is the poplars at sunset , which looks like it could almost be an abstract impressionistic painting.

poplars at sunset - January 10, 2021 - @advadnoun

unnamed anime person - January 10, 2021 - @advadnoun

Birdhouse that looks like a chair - January 20, 2021 - @JasonCobill

a woman in a green dress dancing in a medieval castle - January 10, 2021 - @MasterScrat

Advadnoun’s The BigSleep

Then about one week later @advadnoun shared another colab notebook, named The BigSleep . This new notebook demonstrated the combination of the CLIP model and the BigGAN model .

The BigSleep represented a clear improvement toward creating legible scenes, but the images were still frequently difficult to comprehend; full of weird artifacts and error.

My favorite of these images is the A scene with vibrant colors — the clouds are realistic, and the vibrant colors look like fall foliage.

The Great Pyramids were turned into prisms by a wizard - January 17, 2021 - Wiskkey

A scene with vibrant colors - January 17, 2021 - Wiskkey

image from The Big Sleep notebook - January 17, 2021 - @advadnoun

a black cat sleeping on top of a red clock - January 17, 2021 - Wiskkey

VQGAN+CLIP

In April 2021, @RiversHaveWings shared a series of colab notebooks which combined VQGAN and CLIP. A paper was published later, which included several interesting examples, and a full description of how they combined VQGAN with CLIP.

VQGAN+CLIP felt like a major step forward in terms of recreating an artistic look and feel. You’ll notice that the images below are starting to resemble their prompts, and artistic textures like brush strokes and pencil marks are beginning to appear. VQGAN+CLIP was used to create the first AI generated images that left me speechless.

👉🏼 Run VQGAN+CLIP on Replicate Here are a few of my favorite VQGAN+CLIP images:

Robots are s*** at art, VQGAN+CLIP, July 2021 - Sylvie the Tower of Babel by J.M.W. Turner, VQGAN+CLIP, April 2021 - K Crowson, S Biderman et al.

A colored pencil drawing of a waterfall, VQGAN+CLIP, April 2021 - K Crowson, S Biderman et al.

sketch of internet is a series of tubes by Leonardo Da Vinci, VQGAN+CLIP, December 2021 - @anotherjesse

spaceship pencil, VQGAN+CLIP, December 2021 - @anotherjesse

Pixray

Pixray was an important image generation model in the history of Replicate. Originally released in June 2021, Pixray became the first text-to-image model on Replicate that reached tens of thousands of runs by early 2022. Today it’s been run a total of 1.3 million times.

Dribnet was the first Replicate user to formally request that we build an…

Excerpt shown — open the source for the full document.