这是用户在 2024-6-21 24:14 为 https://replicate.com/blog/get-the-best-from-stable-diffusion-3 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

How to get the best results from Stable Diffusion 3
如何从《稳定扩散 3》中获得最佳效果

Posted by @fofr
发布于 2024 年 6 月 18 日 @fofr

Stability AI recently released the weights for Stable Diffusion 3 Medium, a 2 billion parameter text-to-image model that excels at photorealism, typography, and prompt following.
Stability AI 最近发布了 Stable Diffusion 3 Medium 的权重,这是一个 20 亿参数的文本到图像模型,在逼真度、排版和提示跟随方面表现出色。

You can run the official Stable Diffusion 3 model on Replicate, and it is available for commercial use. We have also open-sourced our Diffusers and ComfyUI implementations (read our guide to ComfyUI).
您可以在 Replicate 上运行官方的 Stable Diffusion 3 模型,该模型可用于商业用途。我们还开源了我们的 Diffusers 和 ComfyUI 实现(请阅读我们的 ComfyUI 指南)。

In this blog post we’ll show you how to use Stable Diffusion 3 (SD3) to get the best images, including how to prompt SD3, which is a bit different from previous Stable Diffusion models.
在这篇博文中,我们将向您介绍如何使用稳定扩散 3(SD3)获得最佳图像,包括如何提示 SD3,它与之前的稳定扩散模型有些不同。

To help you experiment, we’ve created an SD3 explorer model that exposes all of the settings we discuss here.
为了帮助您进行实验,我们创建了一个 SD3 浏览器模型,它可以显示我们在这里讨论的所有设置。

Screenshot of fofr's SD3 explorer with a long prompt and the resulting generated image
SD3 has very good adherence to long, descriptive prompts. Try it out yourself in our SD3 explorer model.
SD3 很好地遵守了冗长的描述性提示。请在我们的 SD3 探索者模型中亲自体验一下。

Picking an SD3 version
选择 SD3 版本

Stability AI have packaged up SD3 Medium in different ways to make sure it can run on as many devices as possible.
Stability AI 对 SD3 Medium 进行了不同的打包,以确保它能在尽可能多的设备上运行。

SD3 uses three different text encoders. (The text encoder is the part that takes your prompt and puts it into a format the model can understand). One of these new text encoders is really big – meaning it uses a lot of memory. If you’re looking at the SD3 Hugging Face weights, you’ll see four options with different text encoder configurations. You should choose which one to use based on your available VRAM.
SD3 使用三种不同的文本编码器。(文本编码器是接收您的提示并将其转换成模型可以理解的格式的部分)。其中一个新的文本编码器非常大,这意味着它会占用大量内存。如果您正在查看 SD3 "拥抱脸 "权重,您会看到四种不同文本编码器配置的选项。您应该根据可用的 VRAM 来选择使用哪一个。

sd3_medium_incl_clips_t5xxlfp8.safetensors

This encoder contains the model weights, the two CLIP text encoders and the large T5-XXL model in a compressed fp8 format. We recommend these weights for simplicity and best results.
该编码器包含模型权重、两个 CLIP 文本编码器和压缩 fp8 格式的大型 T5-XXL 模型。我们推荐使用这些权重,以获得最佳效果。

sd3_medium_incl_clips_t5xxlfp16.safetensors

The same as sd3_medium_incl_clips_t5xxlfp8.safetensors, except the T5 part isn’t compressed as much. By using fp16 instead of fp8, you’ll get a slight improvement in your image quality. This improvement comes at the cost of higher memory usage.
sd3_medium_incl_clips_t5xxlfp8.safetensors 相同,只是 T5 部分的压缩率没有那么高。使用 fp16 而不是 fp8,可以略微改善图像质量。这种改善的代价是内存使用率的增加。

sd3_medium_incl_clips.safetensors

This version does away with the T5 element altogether. It includes the weights with just the two CLIP text encoders. This is a good option if you do not have much VRAM, but your results might be very different from the full version. You might notice that this version doesn’t follow your prompts as closely, and it may also reduce the quality of text in images.
该版本完全取消了 T5 要素。它只包含两个 CLIP 文本编码器的权重。如果您的 VRAM 不多,这是一个不错的选择,但您的结果可能与完整版大相径庭。您可能会注意到,该版本没有那么紧跟您的提示,也可能会降低图像中文字的质量。

sd3_medium.safetensors

This model is just the base weights without any text encoders. If you use these weights, make sure you’re loading the text encoders separately. Stability AI have provided an example ComfyUI workflow for this.
该模型只有基本权重,没有任何文本编码器。如果使用这些权重,请确保单独加载文本编码器。Stability AI 为此提供了一个 ComfyUI 工作流程示例。

Prompting 提示

The big change in usage in SD3 is prompting. You can now pass in very long and descriptive prompts and get back images with very good prompt adherence. You’re no longer limited to the 77-token limit of the CLIP text encoder.
SD3 在使用方面的最大变化是提示功能。现在,您可以输入非常长的描述性提示信息,并以非常好的提示一致性返回图像。您不再受限于 CLIP 文本编码器的 77 个字符限制。

Comparison of SD3 and SDXL output
Results for the same prompt in SD3 (left) vs. SDXL, showing SD3's advantages in long prompts and correctly rendering text. Prompt: The cover of a 1970s hardback children's storybook with a black and white illustration of a small white baby bird perched atop the head of a friendly old hound dog. The dog is lying flag with its chin on the floor. The dog's ears are long and droopy, and its eyes are looking upward at the small bird perched atop its head. The little white bird is looking down expectantly at the dog. The book's title is 'Are You My Boss?" set in a white serif font, and the cover is in a cool blue and green color palette
同一提示在 SD3(左)和 SDXL 中的对比结果,显示 SD3 在长提示和正确呈现文本方面的优势。提示:20 世纪 70 年代的一本精装儿童故事书的封面上有一幅黑白插图,一只白色的小鸟栖息在一只友好的老猎狗的头上。这只狗躺在旗子上,下巴贴在地板上。狗的耳朵长长的,耷拉着,眼睛向上看着头上的小鸟。小白鸟正低头期待地看着小狗。这本书的标题是 "你是我的老板吗?",采用白色衬线字体,封面采用冷色调的蓝色和绿色。

Your prompt can now go as long as 10,000 characters, or more than 1,500 words. In practice, you won’t need that sort of length, but it is clear we should no longer worry about prompt length.
现在,您的提示可以长达 10,000 个字符,或超过 1,500 个单词。实际上,您并不需要这么长的提示语,但很明显,我们不应该再担心提示语的长度了。

For very long prompts, at the moment, it’s hard to say what will and will not make it into the image. It isn’t clear which parts of a prompt the model will pay attention to. But the longer and more complex the prompt, the more likely something will be missing.
对于很长的提示,目前还很难说哪些内容会出现在图像中,哪些不会出现在图像中。目前还不清楚模型会关注提示的哪些部分。但提示越长、越复杂,就越有可能缺少某些内容。

Do not use negative prompts
不要使用负面提示

SD3 was not trained with negative prompts. Negative prompting does not work as you expect it to with SD3. If you’ve already experimented with SD3, you may have noticed that when you give a negative prompt, your image does change, but the change isn’t a meaningful one. Your negative prompt will not remove the elements you don’t want; instead, it will introducing noise to your conditioning and simply vary your output, kind of like using a different seed.
SD3 没有接受过负面提示训练。在使用 SD3 时,负面提示并不像您想象的那样有效。如果您已经尝试过 SD3,您可能已经注意到,当您发出否定提示时,您的图像确实会发生变化,但这种变化并没有意义。您的否定提示不会删除您不想要的元素;相反,它会在您的调节中引入噪音,并简单地改变您的输出,有点像使用不同的种子。

Prompting techniques 提示技巧

Now that we’re allowed longer prompts, you can use plain English sentences and grammar to describe the image you want. You can still use comma-separated keywords like before, but if you’re aiming for something specific, it pays to be descriptive and explicit with your prompts. This level of prompting is now similar to the way you would prompt Midjourney version 6 and DALL·E 3.
现在我们可以使用较长的提示语,您可以使用普通的英语句子和语法来描述您想要的图像。您仍然可以像以前一样使用逗号分隔的关键字,但如果您的目标是特定的东西,就需要在提示中进行描述和明确说明。现在,这种级别的提示方式与您提示 Midjourney 版本 6 和 DALL-E 3 的方式类似。

When you are describing an element of an image, try to make your language unambiguous to prevent those descriptions from also applying to other parts of the image.
在描述图像的某一元素时,尽量做到语言明确,以防这些描述也适用于图像的其他部分。

These are examples of long and descriptive prompts that show good prompt adherence in SD3:
这些都是长篇描述性提示的示例,显示了 SD3 中良好的提示坚持性:

a man and woman are standing together against a backdrop, the backdrop is divided equally in half down the middle, left side is red, right side is gold, the woman is wearing a t-shirt with a yoda motif, she has a long skirt with birds on it, the man is wearing a three piece purple suit, he has spiky blue hair (see example)
一男一女站在背景板前,背景板从中间平分两半,左边是红色,右边是金色,女的穿着一件印有尤达图案的 T 恤,下身穿着一条印有小鸟图案的长裙,男的穿着一套紫色的三件套西装,留着一头尖尖的蓝色头发(见示例)。

a man wearing 1980s red and blue paper 3D glasses is sitting on a motorcycle, it is parked in a supermarket parking lot, midday sun, he is wearing a Slipknot t-shirt and has black pants and cowboy boots (see example)
一名男子戴着 1980 年代红蓝相间的纸质 3D 眼镜,坐在一辆摩托车上,摩托车停在一家超市的停车场,正午阳光明媚,他穿着 Slipknot T 恤,黑色裤子和牛仔靴(见示例)。

a close-up half-portrait photo of a woman wearing a sleek blue and white summer dress with a monstera plant motif, has square white glasses, green braided hair, she is on a pebble beach in Brighton UK, very early in the morning, twilight sunrise (see example)
一张半肖像特写照片,照片中的女子身穿带有 monstera 植物图案的时尚蓝白夏装,戴着方形白框眼镜,头发编成绿色辫子,她正在英国布莱顿的卵石海滩上,时间是清晨,黄昏日出(见示例)

Different prompts for each text encoder
每个文本编码器都有不同的提示

Now that we have three text encoders, we can technically pass in different prompts to each of them. For example, you could try passing the general style and theme of an image to the CLIP text encoders, and the detailed subject to the T5 part. In our experimentation, we haven’t found any special techniques yet, but we’re still trying.
现在我们有了三个文本编码器,从技术上讲,我们可以向每个编码器传递不同的提示。例如,您可以尝试将图片的总体风格和主题传递给 CLIP 文本编码器,而将详细主题传递给 T5 部分。在实验中,我们还没有发现任何特殊的技术,但我们仍在尝试。

Here’s an example where we pass different prompts to the CLIP and T5 encoders.
下面是一个例子,我们向 CLIP 和 T5 编码器传递不同的提示。

Settings 设置

There are many settings, some new, that you can use to change image outputs in SD3. We recommend some good defaults below, but you should experiment to find your own preferences.
在 SD3 中,您可以使用许多设置(有些是新设置)来更改图像输出。我们在下文中推荐了一些不错的默认设置,但您也可以根据自己的喜好进行尝试。

In summary, you should start your experimentation from these settings (we’ll discuss them more in detail below):
总之,您应该从这些设置开始实验(我们将在下文中详细讨论):

  • 28 steps 28 个步骤
  • 3.5 to 4.5 CFG
    3.5 至 4.5 CFG
  • dpmpp_2m sampler with the sgm_uniform scheduler
    使用 sgm_uniform 调度器的 dpmpp_2m 采样器
  • 3.0 shift 3.0 班

Width and height 宽度和高度

Much like SDXL, SD3 gives the best outputs at around 1 megapixel. Resolutions must be divisible by 64. We recommend the following widths and heights for these common aspect ratios:
与 SDXL 相似,SD3 在 100 万像素左右时输出效果最佳。分辨率必须能被 64 整除。对于这些常见的宽高比,我们推荐使用以下宽度和高度:

  • 1:1 - 1024 x 1024 (Square images)
    1:1 - 1024 x 1024(正方形图像)
  • 16:9 - 1344 x 768 (Cinematic and widescreen)
    16:9 - 1344 x 768(电影和宽屏)
  • 21:9 - 1536 x 640 (Cinematic)
    21:9 - 1536 x 640(电影效果)
  • 3:2 - 1216 x 832 (Landscape aspect ratio)
    3:2 - 1216 x 832(横向宽高比)
  • 2:3 - 832 x 1216 (Portrait aspect ratio)
    2:3 - 832 x 1216(纵向宽高比)
  • 5:4 - 1088 x 896 (Landscape aspect ratio)
    5:4 - 1088 x 896(横向宽高比)
  • 4:5 - 896 x 1088 (Portrait aspect ratio)
    4:5 - 896 x 1088(纵向宽高比)
  • 9:16 - 768 x 1344 (Long vertical images)
    9:16 - 768 x 1344(长垂直图像)
  • 9:21 - 640 x 1536 (Very tall images)
    9:21 - 640 x 1536(超高图像)

If you’ve previously used Stable Diffusion 1.5 and SDXL at resolutions larger than they were trained, you might be familiar with the strange outputs they give – distorted images, multiple heads, repeating elements, and so on. (You can see some of these in our previous SDXL guide.) This does not happen with SD3. In SD3, if you go bigger than the expected resolution, you’ll have a reasonable image in the middle and strange repeating artifacts around the edges (here’s a prediction example showing an image that’s too large). Similarly, if you go too small, your image will be harshly cropped (here’s a prediction example showing a cropped image that’s too small).
如果您以前在大于训练分辨率的情况下使用过 Stable Diffusion 1.5 和 SDXL,您可能会对它们的奇怪输出结果感到熟悉--扭曲的图像、多头、重复元素等。(在 SD3 中不会出现这种情况。在 SD3 中,如果分辨率超过预期,中间的图像会比较合理,而边缘则会出现奇怪的重复图像(以下是一个预测示例,显示了一幅过大的图像)。同样,如果分辨率过小,图像就会被严重裁剪(下面的预测示例显示了一幅裁剪过小的图像)。

Number of steps 步骤数

This setting is the number of denoising steps the model will use when generating an image. In SDXL, this value was typically around 20, and for Lightning models it’s 4 steps. Number of steps is the main factor that determines how long your image takes to generate. More steps, better image versus fewer steps, faster image.
该设置是模型生成图像时使用的去噪步骤数。在 SDXL 中,该值通常为 20 步左右,而对于 Lightning 模型,该值为 4 步。步骤数是决定生成图像所需时间的主要因素。步骤越多,生成的图像越好;步骤越少,生成的图像越快。

For SD3, we recommend 28 steps. This number gives sharp images with an interesting foreground and background and few VAE artifacts (visible noise patterns you might see in generated images), and it doesn’t take too long.
对于 SD3,我们建议使用 28 步。这个数字可以生成清晰的图像,前景和背景都很有趣,VAE 伪影(在生成的图像中可能会看到的可见噪点图案)很少,而且时间也不会太长。

The effect of increasing steps
增加台阶的效果

The way steps affects image quality is different from previous Stable Diffusion models. We are used to steps improving quality iteratively up to a certain point where the effect levels off and images remain almost static. But with SD3, as you increase steps, you’ll notice something different.
阶跃影响图像质量的方式与以往的稳定扩散模型不同。我们习惯于通过阶跃迭代来改善画质,但到了一定程度,效果就会趋于平缓,图像几乎保持静态。但在 SD3 中,随着步长的增加,你会发现一些不同之处。

SD3 can usually get to an OK-looking image in about 8 to 10 steps (here’s an example prediction at 10 steps), albeit with VAE noise artifacts and parts of the image that aren’t coherent. This is also dependent on prompt and seed. As the steps increase you get more coherent and interesting images. The sweet spot is around 26 to 36.
SD3 通常可以在大约 8 到 10 步内获得外观尚可的图像(以下是 10 步预测的示例),尽管会出现 VAE 噪音伪影和图像不连贯的部分。这也取决于提示和种子。随着步长的增加,你会得到更加连贯和有趣的图像。最佳点大约在 26 到 36 步之间。

You will also find that images and their subjects can sometimes change quite dramatically at different step values. For example, for a vague prompt of a person, you could find your subject changes age, gender or ethnicity as steps increase. Compare these two outputs: one at 10 steps, and another – with the same settings and seed – at 32 steps.
您还会发现,在不同的步长值下,图像及其主题有时会发生很大的变化。例如,对于一个模糊的人物提示,您会发现随着步长的增加,主体的年龄、性别或种族都会发生变化。比较这两个输出:一个是 10 步输出,另一个是 32 步输出(设置和种子相同)。

Guidance scale 指导比额表

The guidance scale (or CFG, classifier-free guidance) tells the model how similar the output should be to the prompt. For SD3, you need to use lower values than SD 1.5 and SDXL.
引导标度(或 CFG,无分类器引导)告诉模型输出与提示的相似程度。对于 SD3,您需要使用比 SD 1.5 和 SDXL 更低的值。

We recommend somewhere between 3.5 and 4.5. If your outputs look “burnt,” like they have too much contrast, lower the CFG (here’s an example of a burnt image where the CFG is too high).
我们建议介于 3.5 和 4.5 之间。如果输出的图像看起来 "焦糊",对比度过高,请降低 CFG(以下是一个焦糊图像的例子,CFG 过高)。

It’s also worth pointing out that the lower your CFG, the more similar your outputs will be across the different text encoder options (in other words, whether you use the T5 text encoder in fp8, fp16 or not at all). So if you’re using a very low CFG, you could do away with the large T5 encoder without affecting the image quality much. As an example, compare these two outputs that use the same seed and a CFG of 1.5: this is the output with fp16, which is very similar to the CLIP-only output.
还值得指出的是,您的 CFG 越低,不同文本编码器选项的输出就越相似(换句话说,无论您在 fp8、fp16 中使用 T5 文本编码器,还是根本不使用)。因此,如果您使用的 CFG 值很低,您可以不使用大型 T5 编码器,而不会对图像质量造成太大影响。举例来说,比较一下这两个使用相同种子和 1.5 CFG 的输出:这是使用 fp16 的输出,与仅 CLIP 的输出非常相似。

Sampler and scheduler 采样器和调度器

Different tools use different labels for these, but essentially this is the algorithm the model will use to manage noise. Different algorithms give different images.
不同的工具使用不同的标签,但本质上这是模型用来管理噪声的算法。不同的算法会产生不同的图像。

For SD3 we recommend using the dpmpp_2m sampler with the sgm_uniform scheduler in ComfyUI. Use dpm++ 2M in Automatic1111. Euler can also give good results.
对于 SD3,我们建议使用 dpmpp_2m 采样器和 ComfyUI 中的 sgm_uniform 调度器。在 Automatic1111 中使用 dpm++ 2MEuler 也能取得不错的效果。

Some samplers and schedulers simply do not work with SD3 – notably the ancestral and sde samplers and the popular SDXL noise scheduler, karras.
有些采样器和调度器根本无法与 SD3 配合使用,特别是 ancestralsde 采样器以及流行的 SDXL 噪音调度器 karras

Shift 轮班

Shift is a new parameter in SD3 that you can modify. It represents the timestep scheduling shift, where higher shift values are better at managing noise in higher resolutions. Essentially, noise is handled better and you get nicer-looking images when using a shift. You can read more about the theory behind timestep schedule shifting in the SD3 research paper.
移位是 SD3 中的一个新参数,您可以对其进行修改。它表示时间步调度的偏移,偏移值越高,在更高分辨率下管理噪点的能力越强。从本质上讲,使用移位时,噪点能得到更好的处理,图像也更美观。您可以在 SD3 研究论文中阅读更多有关时间步调度移动背后的理论。

3.0 is the recommended default value for shift based on a human preference evaluation, but you can of course change it. In ComfyUI, you can find the value on the “ModelSamplingSD3” node, and in Diffusers you can pass in a shift parameter to the FlowMatchEulerDiscreteScheduler.
3.0 是根据人类偏好评估推荐的移位默认值,当然您也可以更改它。在 ComfyUI 中,您可以在 "ModelSamplingSD3"(模型采样SD3)节点上找到该值,在 Diffusers(漫反射)中,您可以将偏移参数传递给 FlowMatchEulerDiscreteScheduler

A shift value of 6.0 scored well in the human evaluation and is worth trying. If you use lower values like 2.0 or 1.5, you can get a more raw and “less processed” looking image, which works well for certain prompts.
6.0 的偏移值在人类评估中得分很高,值得一试。如果使用较低的值,如 2.0 或 1.5,就能得到看起来更原始、"少处理 "的图像,这对某些提示很有效。

Conclusion 结论

Have fun experimenting with Stable Diffusion 3 using these tips! For more on working with SD3, check out our recent blog posts:
利用这些技巧,尽情尝试使用《稳定扩散 3》吧!有关使用 SD3 的更多信息,请查看我们最近的博文: