basetenlabs/stablediffusion
forked from nlile/stablediffusion
Captured source
source ↗basetenlabs/stablediffusion
Description: High-Resolution Image Synthesis with Latent Diffusion Models
License: MIT
Stars: 0
Forks: 0
Open issues: 0
Created: 2023-02-14T10:51:53Z
Pushed: 2023-02-21T17:03:00Z
Default branch: main
Fork: yes
Parent repository: nlile/stablediffusion
Archived: no
README:
Stable Diffusion Version 2
  
This repository contains Stable Diffusion models trained from scratch and will be continuously updated with new checkpoints. The following list provides an overview of all currently available models. More coming soon.
News
December 7, 2022
*Version 2.1*
- New stable diffusion model (_Stable Diffusion 2.1-v_, HuggingFace) at 768x768 resolution and (_Stable Diffusion 2.1-base_, HuggingFace) at 512x512 resolution, both based on the same number of parameters and architecture as 2.0 and fine-tuned on 2.0, on a less restrictive NSFW filtering of the LAION-5B dataset.
Per default, the attention operation of the model is evaluated at full precision when xformers is not installed. To enable fp16 (which can cause numerical instabilities with the vanilla attention module on the v2.1 model) , run your script with ATTN_PRECISION=fp16 python
November 24, 2022
*Version 2.0*
- New stable diffusion model (_Stable Diffusion 2.0-v_) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. _SD 2.0-v_ is a so-called v-prediction model.
- The above model is finetuned from _SD 2.0-base_, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
- Added a [x4 upscaling latent text-guided diffusion model](#image-upscaling-with-stable-diffusion).
- New [depth-guided stable diffusion model](#depth-conditional-stable-diffusion), finetuned from _SD 2.0-base_. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.

- A [text-guided inpainting model](#image-inpainting-with-stable-diffusion), finetuned from SD _2.0-base_.
We follow the original repository and provide basic inference scripts to sample from the models.
________________ *The original Stable Diffusion model was created in a collaboration with CompVis and RunwayML and builds upon the work:*
**High-Resolution Image Synthesis with Latent Diffusion Models**
Robin Rombach\*, Andreas Blattmann\*, Dominik Lorenz\, Patrick Esser, Björn Ommer
_CVPR '22 Oral | GitHub | arXiv | Project page_
and [many others](#shout-outs).
Stable Diffusion is a latent text-to-image diffusion model. ________________________________
Requirements
You can update an existing latent diffusion environment by running
conda install pytorch==1.12.1 torchvision==0.13.1 -c pytorch pip install transformers==4.19.2 diffusers invisible-watermark pip install -e .
xformers efficient attention
For more efficiency and speed on GPUs, we highly recommended installing the xformers library.
Tested on A100 with CUDA 11.4. Installation needs a somewhat recent version of nvcc and gcc/g++, obtain those, e.g., via
export CUDA_HOME=/usr/local/cuda-11.4 conda install -c nvidia/label/cuda-11.4.0 cuda-nvcc conda install -c conda-forge gcc conda install -c conda-forge gxx_linux-64==9.5.0
Then, run the following (compiling takes up to 30 min).
cd .. git clone https://github.com/facebookresearch/xformers.git cd xformers git submodule update --init --recursive pip install -r requirements.txt pip install -e . cd ../stablediffusion
Upon successful installation, the code will automatically default to memory efficient attention for the self- and cross-attention layers in the U-Net and autoencoder.
General Disclaimer
Stable Diffusion models are general text-to-image diffusion models and therefore mirror biases and (mis-)conceptions that are present in their training data. Although efforts were made to reduce the inclusion of explicit pornographic material, we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations. The weights are research artifacts and should be treated as such. Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding model card. The weights are available via the StabilityAI organization at Hugging Face under the [CreativeML Open RAIL++-M License](LICENSE-MODEL).
Stable Diffusion v2
Stable Diffusion v2 refers to a specific configuration of the model architecture that uses a downsampling-factor 8 autoencoder with an 865M UNet and OpenCLIP ViT-H/14 text encoder for the diffusion model. The _SD 2-v_ model produces 768x768 px outputs.
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 DDIM sampling steps show the relative improvements of the checkpoints:

Text-to-Image…
Excerpt shown — open the source for the full document.