ForkBasetenBasetenpublished Feb 14, 2023seen 5d

basetenlabs/stablediffusion

forked from nlile/stablediffusion

Open original ↗

Captured source

source ↗
published Feb 14, 2023seen 5dcaptured 9hhttp 200method plain

basetenlabs/stablediffusion

Description: High-Resolution Image Synthesis with Latent Diffusion Models

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2023-02-14T10:51:53Z

Pushed: 2023-02-21T17:03:00Z

Default branch: main

Fork: yes

Parent repository: nlile/stablediffusion

Archived: no

README:

Stable Diffusion Version 2

![t2i](assets/stable-samples/txt2img/768/merged-0006.png) ![t2i](assets/stable-samples/txt2img/768/merged-0002.png) ![t2i](assets/stable-samples/txt2img/768/merged-0005.png)

This repository contains Stable Diffusion models trained from scratch and will be continuously updated with new checkpoints. The following list provides an overview of all currently available models. More coming soon.

News

December 7, 2022

*Version 2.1*

  • New stable diffusion model (_Stable Diffusion 2.1-v_, HuggingFace) at 768x768 resolution and (_Stable Diffusion 2.1-base_, HuggingFace) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the LAION-5B dataset.

Per default, the attention operation of the model is evaluated at full precision when xformers is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with ATTN_PRECISION=fp16 python

November 24, 2022

*Version 2.0*

  • New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called v-prediction model.
  • The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
  • Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
  • New [depth-guided stable diffusion model](#depth-conditional-stable-diffusion), finetuned from _SD 2.0-base_. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.

![d2i](assets/stable-samples/depth2img/depth2img01.png)

  • A [text-guided inpainting model](#image-inpainting-with-stable-diffusion), finetuned from SD _2.0-base_.

We follow the original repository and provide basic inference scripts to sample from the models.

________________ *The original Stable Diffusion model was created in a collaboration with CompVis and RunwayML and builds upon the work:*

**High-Resolution Image Synthesis with Latent Diffusion Models**

Robin Rombach\*, Andreas Blattmann\*, Dominik Lorenz\, Patrick Esser, Björn Ommer

_CVPR '22 Oral | GitHub | arXiv | Project page_

and [many others](#shout-outs).

Stable Diffusion is a latent text-to-image diffusion model. ________________________________

Requirements

You can update an existing latent diffusion environment by running

conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch
pip install transformers==4.19.2 diffusers invisible-watermark
pip install -e .

xformers efficient attention

For more efficiency and speed on GPUs, we highly recommended installing the xformers library.

Tested on A100 with CUDA 11.4. Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those, e.g., via

export CUDA_HOME=/usr/local/cuda-11.4
conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc
conda install -c conda-forge gcc
conda install -c conda-forge gxx_linux-64==9.5.0

Then, run the following (compiling takes up to 30 min).

cd ..
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .
cd ../stablediffusion

Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.

General Disclaimer

Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card. The weights are available via the StabilityAI organization at Hugging Face under the [CreativeML Open RAIL++-M License](LICENSE-MODEL).

Stable Diffusion v2

Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2-v_ model produces 768x768 px outputs.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:

![sd evaluation results](assets/model-variants.jpg)

Text-to-Image…

Excerpt shown — open the source for the full document.