Stable Diffusion 稳定的扩散

Character Consistency in Stable Diffusion
稳定扩散中的特性一致性

Published June 30, 2023 · Updated April 10, 2024
发布于 2023 年 6 月 30 日 · 更新于 2024 年 4 月 10 日

UPDATED: 07/01 更新：07/01
– Changed templates so it’s easier to scale to 512 or 768
– 更改了模板，以便更容易扩展到 512 或 768
– Changed ImageSplitter script to make it more user friendly and added a GitHub link to it
– 更改了 ImageSplitter 脚本以使其更加用户友好，并添加了 GitHub 链接
– Added section on facial expressions
– 添加了有关面部表情的部分
– Added information on scaling in img2img and settings I’ve had success with
– 添加了有关 img2img 中缩放的信息以及我已经成功使用的设置
– Removed the hires fix approach, it was more for quick/dirty example, but led to some confusion.
- 删除了雇佣修复方法，这更多的是快速/肮脏的例子，但导致了一些混乱。
– Made a series of clarifications
– 做出了一系列澄清

One of the big questions that comes up often in regards to Stable Diffusion is how to create character consistency if we want to create more than a single image. Facial characteristics are the most important, followed by body shape, clothing style, setting, etc… in the AI image/video world it’s the sought after holy grail as of now (mid-’23).
关于稳定扩散经常出现的大问题之一是，如果我们想要创建多个图像，如何创建字符一致性。面部特征是最重要的，其次是体型、服装风格、背景等等……在人工智能图像/视频领域，它是目前（23 年中）备受追捧的圣杯。

One route to achieve this goal is through the use of a LoRA (Low-Rank Adaptation), a method of training that inserts weights into an existing AI model’s layers to bias it towards a defined outcome. But, most LoRAs are trained on real-life people (famous actors images, personal photographs) and styles, not on AI generated/created persona output. Because of this the question comes about, what if I want to create a character based on the output of the model itself so that I have a 100% unique fabricated persona to develop a character around, using my own character description, how do I achieve that?
实现这一目标的一个途径是使用 LoRA（低秩适应），这是一种将权重插入现有人工智能模型层的训练方法，使其偏向于确定的结果。但是，大多数 LoRA 都是针对现实生活中的人物（著名演员图像、个人照片）和风格进行训练，而不是针对人工智能生成/创建的角色输出进行训练。因此，问题来了，如果我想根据模型本身的输出创建一个角色，这样我就有一个 100% 独特的虚构角色来开发一个角色，使用我自己的角色描述，我该如何实现那？

I see this breaking down into 3 stages:
我认为这分为三个阶段：

The concept here is to iterate LoRAs a few times to achieve the model we want. The first iteration will focus on refining facial features, and the 2nd body features, which then leads to a final ‘person’ LoRA of our design/concept humanoid.
这里的概念是迭代 LoRA 几次以实现我们想要的模型。第一次迭代将侧重于完善面部特征和第二个身体特征，然后形成我们设计/概念人形机器人的最终“人”LoRA。

In this post we’ll focus on facial features, the most ‘recognizable’ element of a person – To achieve this we need to create a character sheet, a set of 15 images that are close enough in likeness to each other that they represent a single character. A character sheet is the primary foundation for consistent character output. Getting the facial features is #1, then refining body features.
在这篇文章中，我们将重点关注面部特征，这是一个人最“可识别”的元素 - 为了实现这一点，我们需要创建一个角色表，一组 15 张图像，这些图像彼此非常相似，可以代表一个人单个字符。字符表是一致字符输出的主要基础。获取面部特征是第一，然后是细化身体特征。

Achieving this took some experimentation, and fundamentally to get there requires an iterative approach using a LoRA model. How to build, refine, and include the rest of the character will be in part II of the series, I will do it using kohya_ss to train the LoRA for the outcome we need.
实现这一目标需要进行一些实验，从根本上来说，要实现这一目标需要使用 LoRA 模型的迭代方法。如何构建、完善和包含角色的其余部分将在本系列的第二部分中进行，我将使用 kohya_ss 来训练 LoRA 以获得我们需要的结果。

Now on to the character sheet development, which should be fairly straight forward.
现在进行角色表的开发，这应该相当简单。

SETUP: 设置：

In this initial phase we’ll focus on facial features. First, there’s two assets that I’ve created that will be needed so that Stable Diffusion can output a good quality character sheet. (1) For controlnet OpenPose [providing 15 views of a face] (2) For Controlnet Lineart [To guide SD to keep renderings in a specifix box/space]. To download these, just right click and ‘save as…’
在这个初始阶段，我们将重点关注面部特征。首先，我创建了两个资源，以便稳定扩散可以输出高质量的字符表。 (1) 对于 controlnet OpenPose [提供 15 个面部视图] (2) 对于 Controlnet Lineart [引导 SD 将渲染保留在特定框/空间中]。要下载这些，只需右键单击并“另存为...”

Note: Templates updated 7/1 and are now 1328×800
注意：模板已于 7 月 1 日更新，现在为 1328×800

Once you’ve downloaded these we want to bring them into Stable Diffusion text2image using the following Controlnet settings.
下载完这些后，我们希望使用以下 Controlnet 设置将它们引入稳定扩散 text2image。

The lineart is for providing better guidance to the AI of how we want the sheet segmented. I found using yellow and not 100% black provided better segmentation than pure black/white. For this line mask we’ll use the below settings:
艺术线条是为了向人工智能提供更好的指导，告诉我们如何分割纸张。我发现使用黄色而不是 100% 黑色可以比纯黑/白提供更好的分割效果。对于此线条蒙版，我们将使用以下设置：

The first one will configure OpenPose to have the proper facial directions we need for the character sheet. The second image using Lineart will provide ‘guideline’ borders for Stable Diffusion to draw within. Remember we’re dealing with AI so outputs are going to vary, but with the above we’re going to tell it what we want, giving it the best guardrails, to get the closest to the outcome we desire.
第一个将配置 OpenPose 以获得角色表所需的正确面部方向。使用艺术线条的第二张图像将为稳定扩散提供绘制的“指导”边界。请记住，我们正在处理人工智能，因此输出会有所不同，但通过上述内容，我们将告诉它我们想要什么，给它最好的护栏，以获得最接近我们想要的结果。

The size of the OpenPose and mask are setup specifically to have the greatest rendering accuracy from Stable Diffusion. A not well known fact is that Stable Diffusion output dimensions HAVE to be divisible by 8. The way these sheets are setup is to have 8 pixel separators and 256×256 images. This was the best compromise I could find for quality and size given VRAM demands.
OpenPose 和遮罩的大小经过专门设置，以通过稳定扩散获得最大的渲染精度。一个不太为人所知的事实是，稳定扩散输出尺寸必须能被 8 整除。这些工作表的设置方式是具有 8 个像素分隔符和 256×256 图像。考虑到 VRAM 的要求，这是我能找到的质量和尺寸的最佳折衷方案。

TXT2IMG SETTINGS: TXT2IMG 设置：

Prompt 迅速的
(a character sheet of a woman from different angles with a grey background:1.4) , auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography
（灰色背景下不同角度的女人角色表：1.4）、赤褐色头发、睁开眼睛、电影灯光、超写实主义、景深、摄影、超高、照片写实、8k、超写实主义、工作室照明、摄影

Negative Prompt 否定提示
easynegative, canvasframe, canvas frame, eyes shut, wink, blurry, hands, closed eyes, (easynegative), ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), out of frame, extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), ((bad art)), blurry, (((mutation))), (((deformed))), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), ((floating limbs)), ((disconnected limbs)), ((malformed hands)), ((missing fingers)), worst quality, ((disappearing arms)), ((disappearing legs)), (((extra arms))), (((extra legs))), (fused fingers), (too many fingers), (((long neck))), canvas frame, ((worst quality)), ((low quality)), lowres, sig, signature, watermark, username, bad, immature, cartoon, anime, 3d, painting, b&w
easy负片，画布框架，画布框架，闭上眼睛，眨眼，模糊，手，闭上眼睛，（easy负片），（（（（丑陋）））），（（（重复））），（（病态）），（（残缺不全） ))、出框、多余的手指、变异的手、((画得不好的手))、((画得不好的脸))、((糟糕的艺术))、模糊、(((突变)))、(((变形) )))、模糊、((不良解剖结构))、(((不良比例)))、((额外肢体))、克隆脸、(((毁容)))、毛比例、(畸形肢体)、((缺少手臂))、((缺少腿))、((漂浮的肢体))、((断开的肢体))、((畸形的手))、((缺少手指))、质量最差、((消失的手臂))、 ((消失的腿))、(((额外的手臂)))、(((额外的腿)))、(融合的手指)、(太多的手指)、(((长脖子)))、画布框架、((最差质量)), ((低质量)), lowres, sig, 签名, 水印, 用户名, 不良, 不成熟, 卡通, 动漫, 3d, 绘画, 黑白

Panel settings: 面板设置：

Scaling with img2img 使用 img2img 缩放

When it’s done we’ll click on ‘send to img2img’. For upscaling in img2img you can use any upscaler you’re comfortable with, for myself I use ultimate SD upscale, which can be installed via the extensions tab (ultimate-upscale-for-automatic1111). Here are my settings. For the noise level, I recommend something in the range of .4 – .6, you’ll need to experiment, but it’ll help remove deformations especially on the side angles. Though a higher noise level also means your character will also change some.
完成后，我们将单击“发送到 img2img”。对于 img2img 中的升级，您可以使用任何您喜欢的升级器，对于我自己来说，我使用 Ultimate SD upscale，它可以通过扩展选项卡安装（ultimate-upscale-for-automatic1111）。这是我的设置。对于噪音水平，我建议在 0.4 – 0.6 范围内，您需要进行试验，但它将有助于消除变形，尤其是在侧角上。虽然较高的噪音水平也意味着你的角色也会发生一些变化。

Here’s the primary panel settings I use for img2img:
这是我用于 img2img 的主要面板设置：

For Controlnet it’s pretty simple and straight forward:
对于 Controlnet 来说，它非常简单直接：

Lastly, Ultimate SD Options
最后，终极 SD 选项

With this setup, we can click ‘Generate’ and wait…
通过此设置，我们可以单击“生成”并等待...

The result: 结果：

A few other output examples using slightly different prompts and models:
使用略有不同的提示和模型的其他一些输出示例：

Model: disneyPixarCartoon_v10
型号：disneyPixarCartoon_v10
(a character sheet of a woman from different angles with a grey background:1.4), auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography
（灰色背景下不同角度的女人角色表：1.4）、赤褐色头发、睁开眼睛、电影灯光、超现实主义、景深、摄影、超高、照片写实、8k、超现实主义、工作室灯光、摄影

Model: henmixReal_v40 型号：henmixReal_v40
(a character sheet of a beautiful woman from different angles with a grey background:1.4) , blonde hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography
（灰色背景下不同角度的美女角色表：1.4）、金发、睁开眼睛、电影灯光、超写实主义、景深、摄影、超高、照片写实、8k、超写实、工作室灯光、摄影

Model: darkSushiMixMix_225D
型号：darkSushiMixMix_225D
(a character sheet of a beautiful woman from different angles with a grey background:1.4) , black hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, 8k, studio lighting
（灰色背景下不同角度的美女人物表：1.4），黑发，睁眼，电影灯光，超写实，景深，摄影，超高，8k，工作室灯光

Model: reliberate_v10 型号：reliberate_v10
(a character sheet of a woman from different angles with a grey background:1.4) , black hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography
（灰色背景下不同角度的女人角色表：1.4），黑发，睁开眼睛，电影灯光，超写实主义，景深，摄影，超高，照片写实，8k，超写实主义，工作室照明，摄影

Expressions 表达式

Since we want to build on this work and make sure our character has flexibility in expression I’d recommend creating a few alternate panels defining the expression in the prompt. For example, smiling or sad or angry, there’s actually quite a few. For example, using the prompt:
由于我们希望在这项工作的基础上进行构建并确保我们的角色在表达方面具有灵活性，因此我建议创建一些备用面板来定义提示中的表达。比如微笑、悲伤、愤怒，其实有很多。例如，使用提示：

(a character sheet of a woman smiling from different angles with a grey background:1.4) , auburn hair, eyes open, cinematic lighting, Hyperrealism, depth of field, photography, ultra highres, photorealistic, 8k, hyperrealism, studio lighting, photography
（灰色背景下从不同角度微笑的女人的角色表：1.4），赤褐色头发，睁开眼睛，电影照明，超现实主义，景深，摄影，超高，照片写实，8k，超现实主义，工作室照明，摄影

Leave all the other settings the same.
保留所有其他设置相同。

SPLITTING THE IMAGES (UPDATED SCRIPT 7/1):
分割图像（更新脚本 7/1）：

Once you have the character sheets and the output you want, the next part is splitting up each master image. To achieve this I have a simple Python script (v1.1) – You can grab it via this GitHub link.
一旦获得了所需的字符表和输出，下一部分就是分割每个主图像。为了实现这一目标，我有一个简单的 Python 脚本 (v1.1) – 您可以通过此 GitHub 链接获取它。

The script is very simple at this point, requires Pillow (PIL) (pip install Pillow) for the image handling but will split the master image up. You should expect some frame edges to show on the images, since SD is AI and output is unpredictable it won’t be pixel perfect, but you should get ‘close enough’ outputs during splitting for the next phase for creating a LoRA from what we have.
此时脚本非常简单，需要 Pillow (PIL) (pip install Pillow) 进行图像处理，但会将主图像分割。您应该期望图像上显示一些帧边缘，因为 SD 是 AI 并且输出是不可预测的，它不会是像素完美的，但您应该在下一阶段的分割过程中获得“足够接近”的输出，以便根据我们的内容创建 LoRA有。

Conclusion 结论

That get’s us through creating the base line for a character sheet. From here we’ll select a subset of the split images. Up next, creating the LoRA so we can instantiate a consistent character in our Stable Diffusion outputs. Since I’m getting a lot of questions daily on this process, I most likely will make other addendums, and there’s a slight chance that this is a 3 part series so I can cover another area prior to training.
这就是我们为字符表创建基线的过程。从这里我们将选择分割图像的子集。接下来，创建 LoRA，以便我们可以在稳定扩散输出中实例化一致的字符。由于我每天都会收到很多关于此过程的问题，因此我很可能会制作其他附录，并且有可能这是一个由 3 部分组成的系列，因此我可以在培训之前涵盖另一个领域。

spring 春天 says:

July 1, 2023 at 5:04 am
2023 年 7 月 1 日上午 5:04

What’s the purpose of generating on a grid instead of processing each pose individually within a batch?
在网格上生成而不是在批次中单独处理每个姿势的目的是什么？

Loading... 加载中...
- Dave Packer 戴夫·帕克 says:
  
  July 1, 2023 at 6:49 am
  2023 年 7 月 1 日上午 6:49
  
  From a Stable Diffusion perspective, creating a character sheet minimizes RNG differences of facial features. Getting the character sheet down, splitting the images, and then upscaling in batch will provide more consistent results for final output. Doing each separately will leave more up to RNG and create more spurious outputs providing inconsistent data to a LoRA for fine tuning.
  从稳定扩散的角度来看，创建角色表可以最大限度地减少面部特征的 RNG 差异。缩小字符表，分割图像，然后批量放大将为最终输出提供更一致的结果。单独执行每一项操作都会给 RNG 留下更多的余地，并产生更多的虚假输出，为 LoRA 提供不一致的数据以进行微调。
  
  Loading... 加载中...
柳树 says:

July 1, 2023 at 9:41 am
2023 年 7 月 1 日上午 9:41

thanks a lot, Dave pakcer,
非常感谢，戴夫·帕克，

Loading... 加载中...
Arno 阿诺 says:

July 1, 2023 at 6:27 pm
2023 年 7 月 1 日下午 6:27

Thanks a log, great write up
感谢日志，写得很好

Loading... 加载中...
Leyline 魔网 says:

July 1, 2023 at 9:19 pm
2023 年 7 月 1 日晚上 9:19

Please keep up the good work. Consistency for generated characters is so hard to maintain.
请保持好的工作状态。生成的字符的一致性很难维持。

Loading... 加载中...
Nicolaas Grobler 尼古拉斯·格罗布勒 says:

July 2, 2023 at 4:17 pm
2023 年 7 月 2 日下午 4:17

This guide is so well done!
这个指南做得太好了！

Loading... 加载中...
Richard Wall 理查德·沃尔 says:

July 4, 2023 at 12:21 pm
2023 年 7 月 4 日中午 12:21

Not sure if this will help anyone else but with the image dimensions being 1328*800 I found SD needed 16Gb of vRAM.
不确定这是否对其他人有帮助，但由于图像尺寸为 1328*800，我发现 SD 需要 16Gb 的 vRAM。
Adjusting the prompt generation size to 1280*768 while leaving everything else the same got the requirement down to 12Gb and without any major loss to the quality of the output on my 3060.
将提示生成大小调整为 1280*768，同时保持其他所有内容不变，将要求降至 12Gb，并且对我的 3060 上的输出质量没有任何重大损失。

Thanks for the guide, great method for keeping high quality and consistent character generation.
感谢您的指导，这是保持高质量和一致的角色生成的好方法。

Loading... 加载中...
- Dave Packer 戴夫·帕克 says:
  
  July 4, 2023 at 4:58 pm
  2023 年 7 月 4 日下午 4:58
  
  Thanks for the comment Richard. I’ve been doing a ton of testing since I wrote this post to try to reduce the footprint without compromising the ultimate output, which is why I’m delaying the LoRA part a bit because I want it to be achievable for most of the community and I know larger images are VRAM taxing. My plan is to switch to a 3×3 grid, which would be 800×800 (768×768 images, with 32 bits for borders) I can also clip the borders in half. I’ll be testing that today.
  感谢理查德的评论。自从我写这篇文章以来，我一直在做大量的测试，试图在不影响最终输出的情况下减少占用空间，这就是为什么我稍微推迟了 LoRA 部分，因为我希望它对于大多数社区来说都是可以实现的我知道较大的图像会占用 VRAM 的资源。我的计划是切换到 3×3 网格，即 800×800（768×768 图像，边框为 32 位）我也可以将边框剪成两半。今天我将对此进行测试。
  
  Loading... 加载中...