How to get the best results from Stable Diffusion 3
Captured source
source ↗How to get the best results from Stable Diffusion 3 – Replicate blog
Replicate Blog
How to get the best results from Stable Diffusion 3
Posted June 18, 2024 by fofr
Stability AI recently released the weights for Stable Diffusion 3 Medium, a 2 billion parameter text-to-image model that excels at photorealism, typography, and prompt following.
You can run the official Stable Diffusion 3 model on Replicate , and it is available for commercial use. We have also open-sourced our Diffusers and ComfyUI implementations ( read our guide to ComfyUI ).
In this blog post we’ll show you how to use Stable Diffusion 3 (SD3) to get the best images, including how to prompt SD3, which is a bit different from previous Stable Diffusion models.
To help you experiment, we’ve created an SD3 explorer model that exposes all of the settings we discuss here.
SD3 has very good adherence to long, descriptive prompts. Try it out yourself in our SD3 explorer model. Picking an SD3 version
Stability AI have packaged up SD3 Medium in different ways to make sure it can run on as many devices as possible.
SD3 uses three different text encoders. (The text encoder is the part that takes your prompt and puts it into a format the model can understand). One of these new text encoders is really big – meaning it uses a lot of memory. If you’re looking at the SD3 Hugging Face weights , you’ll see four options with different text encoder configurations. You should choose which one to use based on your available VRAM.
sd3_medium_incl_clips_t5xxlfp8.safetensors
This encoder contains the model weights, the two CLIP text encoders and the large T5-XXL model in a compressed fp8 format. We recommend these weights for simplicity and best results.
sd3_medium_incl_clips_t5xxlfp16.safetensors
The same as sd3_medium_incl_clips_t5xxlfp8.safetensors , except the T5 part isn’t compressed as much. By using fp16 instead of fp8, you’ll get a slight improvement in your image quality. This improvement comes at the cost of higher memory usage.
sd3_medium_incl_clips.safetensors
This version does away with the T5 element altogether. It includes the weights with just the two CLIP text encoders. This is a good option if you do not have much VRAM, but your results might be very different from the full version. You might notice that this version doesn’t follow your prompts as closely, and it may also reduce the quality of text in images.
sd3_medium.safetensors
This model is just the base weights without any text encoders. If you use these weights, make sure you’re loading the text encoders separately. Stability AI have provided an example ComfyUI workflow for this.
Prompting
The big change in usage in SD3 is prompting. You can now pass in very long and descriptive prompts and get back images with very good prompt adherence. You’re no longer limited to the 77-token limit of the CLIP text encoder.
Results for the same prompt in SD3 (left) vs. SDXL, showing SD3’s advantages in long prompts and correctly rendering text. Prompt: The cover of a 1970s hardback children’s storybook with a black and white illustration of a small white baby bird perched atop the head of a friendly old hound dog. The dog is lying flag with its chin on the floor. The dog’s ears are long and droopy, and its eyes are looking upward at the small bird perched atop its head. The little white bird is looking down expectantly at the dog. The book’s title is ‘Are You My Boss?” set in a white serif font, and the cover is in a cool blue and green color palette Your prompt can now go as long as 10,000 characters, or more than 1,500 words. In practice, you won’t need that sort of length, but it is clear we should no longer worry about prompt length.
For very long prompts, at the moment, it’s hard to say what will and will not make it into the image. It isn’t clear which parts of a prompt the model will pay attention to. But the longer and more complex the prompt, the more likely something will be missing.
Do not use negative prompts
SD3 was not trained with negative prompts. Negative prompting does not work as you expect it to with SD3. If you’ve already experimented with SD3, you may have noticed that when you give a negative prompt, your image does change, but the change isn’t a meaningful one. Your negative prompt will not remove the elements you don’t want; instead, it will introducing noise to your conditioning and simply vary your output, kind of like using a different seed.
Prompting techniques
Now that we’re allowed longer prompts, you can use plain English sentences and grammar to describe the image you want. You can still use comma-separated keywords like before, but if you’re aiming for something specific, it pays to be descriptive and explicit with your prompts. This level of prompting is now similar to the way you would prompt Midjourney version 6 and DALL·E 3.
When you are describing an element of an image, try to make your language unambiguous to prevent those descriptions from also applying to other parts of the image.
These are examples of long and descriptive prompts that show good prompt adherence in SD3:
a man and woman are standing together against a backdrop, the backdrop is divided equally in half down the middle, left side is red, right side is gold, the woman is wearing a t-shirt with a yoda motif, she has a long skirt with birds on it, the man is wearing a three piece purple suit, he has spiky blue hair ( see example )
a man wearing 1980s red and blue paper 3D glasses is sitting on a motorcycle, it is parked in a supermarket parking lot, midday sun, he is wearing a Slipknot t-shirt and has black pants and cowboy boots ( see example )
a close-up half-portrait photo of a woman wearing a sleek blue and white summer dress with a monstera plant motif, has square white glasses, green braided hair, she is on a pebble beach in Brighton UK, very early in the morning, twilight sunrise ( see example )
Different prompts for each text encoder
Now that we have three text encoders, we can technically pass in different prompts to each of them. For example, you could try passing the general style and theme of an image to the CLIP text encoders, and the detailed subject to the T5 part. In our experimentation, we haven’t found any special techniques yet, but we’re still trying.
Here’s an example where we pass different prompts to the CLIP and T5 encoders .
Settings
There are many settings, some new, that you can use to change image outputs in…
Excerpt shown — open the source for the full document.