stable diffusion微调总结

stable diffusion 模型类别： SD SD2 SDXL SDXL LCM（潜在一致性模型） SDXL Distilled SDXL Turbo 安装accelerate 通过pip安装配置 accelerate config 查看配置安装diffusers 数据处理 BLIP 模型优化微调方法 Dreambooth微调准备数据：模型训练脚本：模型推理：模型转换脚本： Dream+LORA微调模型训练脚本：模型推理脚本： Full FineTune 数据格式：训练脚本：推理脚本 LORA微调数据格式：训练脚本：推理脚本：

stable diffusion

模型类别：

SD

SD是一个基于latent的扩散模型，它在UNet中引入text condition来实现基于文本生成图像。SD的核心来源于Latent Diffusion这个工作，常规的扩散模型是基于pixel的生成模型，而Latent Diffusion是基于latent的生成模型，它先采用一个autoencoder将图像压缩到latent空间，然后用扩散模型来生成图像的latents，最后送入autoencoder的decoder模块就可以得到生成的图像。

SD2

SD 2.0相比SD 1.x版本的主要变动在于模型结构和训练数据两个部分。

首先是模型结构方面，SD 1.x版本的text encoder采用的是OpenAI的CLIP ViT-L/14模型，其模型参数量为123.65M；而SD 2.0采用了更大的text encoder：基于OpenCLIP在laion-2b数据集上训练的CLIP ViT-H/14模型，其参数量为354.03M，相比原来的text encoder模型大了约3倍。

SDXL

Stable Diffusion XL (SDXL) 是一种强大的文本到图像生成模型，它以三种关键方式迭代以前的 Stable Diffusion 模型：

UNet 增大了 3 倍，SDXL 将第二个文本编码器 (OpenCLIP ViT-bigG/14) 与原始文本编码器相结合，显着增加了参数数量引入大小和裁剪调节，以防止训练数据被丢弃，并更好地控制生成图像的裁剪方式引入两阶段模型过程；基本模型（也可以作为独立模型运行）生成图像作为细化器模型的输入，该**模型添加了额外的高质量细节

SDXL LCM（潜在一致性模型）

SDXL 潜在一致性模型（LCM）如“潜在一致性模型：使用几步推理合成高分辨率图像”中提出的那样，通过减少所需的步骤数彻底改变了图像生成过程。它将原始 SDXL 模型提炼成一个需要更少步骤（4 到 8 个而不是 25 到 50 个步骤）来生成图像的版本。该模型对于需要在不影响质量的情况下快速生成图像的应用特别有利。值得一提的是，它比原来的 SDXL 小 50%，快 60%。

SDXL Distilled

SDXL Distilled 是指为特定目的而“蒸馏”的 SDXL 模型版本。例如，Segmind 稳定扩散模型（SSD-1B）是 SDXL 的精炼版本，体积缩小了 50%，速度提高了 60%，同时保持了高质量的文本到图像生成功能。此版本对于速度至关重要但图像质量不能受到影响的场景特别有用。

SDXL Turbo

SDXL Turbo 是 SDXL 1.0 的新版本，专为“实时合成”而开发。这意味着它可以非常快速地生成图像，这一功能由一种称为对抗扩散蒸馏（ADD）的新训练方法提供支持。这种变体是独一无二的，因为它具有有损自动编码组件，尽管在图像的编码和解码过程中会导致一些信息丢失，但可以更快地生成图像。

安装accelerate

通过pip安装

pip install accelerate

配置

accelerate config

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?（在本机服务器上就选择This machine ）
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Which type of machine are you using?（单机多卡选择multi-GPU，单卡选第一个选项）
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 几台机器用来训练
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]:
Do you want to use Megatron-LM ? [yes/NO]:
How many GPU(s) should be used for distributed training? [1]:2 用几张卡
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:all 全部都用来训练
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16 选择训练精度类型
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml 配置文件保存位置，可修改

查看配置

accelerate env

安装diffusers

注意必须从源码安装最新的版本，不然无法通过版本审核。

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .

数据处理

我们需要筛除分辨率较低，质量较差**（比如说768*768分辨率的图片< 100kb）**，存在破损，以及和任务目标无关的数据，接着去除数据里面可能包含的水印，干扰文字等，最后就可以开始进行数据标注了。

数据标注可以分为自动标注和手动标注。自动标注主要依赖像BLIP和Waifu Diffusion 1.4这样的模型，手动标注则依赖标注人员。

BLIP

图像字幕开放式视觉问答多模态/单模态特征提取图文匹配

数据注意事项：

当我们训练人物主题时，一般需要10-20张高质量数据；当我们训练画风主题时，需要100-200张高质量数据；当我们训练抽象概念时，则至少需要200张以上的数据。

不管是人物主题，画风主题还是抽象概念，一定要保证数据集中数据的多样性（比如说猫女姿态，角度，全身半身的多样性）。

每个数据都要符合我们的审美和评判标准！

模型注意事项：

1. 底模型的选择至关重要，SDXL LoRA的很多底层能力与基础概念的学习都来自于底模型的能力。并且底模型的优秀能力需要与我们训练的主题，比如说人物，画风或者某个抽象概念相适配。如果我们要训练二次元LoRA，则需要选择二次元底模型，如果我们要训练三次元LoRA，则需要选择三次元底模型，以此类推。

模型以savetensor为后缀的是加密的，ckpt是开源的。

模型优化

1.剪枝：剪枝后的模型pruned，泛化性好，存储空间小。

2.ema: ema是一种常用的优化神经网络的方法，他可以平滑模型的参数更新，降低模型训练过程中的波动和震荡，增强模型的鲁棒性和泛化能力

微调方法

目前主流训练 Stable Diffusion 模型的方法有：

Full FineTune

全量训练，数据以图片+标注的形式。

Dreambooth：

DreamBooth是一种训练技术，通过对某个主题或风格的几张图像进行训练来更新整个扩散模型。它的工作原理是将提示中的特殊单词与示例图像相关联。

Text Inversion：

文本反转是一种训练技术，用于通过一些您希望其学习内容的示例图像来个性化图像生成模型。该技术的工作原理是学习和更新文本嵌入（新嵌入与您必须在提示中使用的特殊单词相关联）以匹配您提供的示例图像。

LoRA

LoRA（大型语言模型的低秩适应）是一种流行的轻量级训练技术，可显着减少可训练参数的数量。它的工作原理是向模型中插入较少数量的新权重，并且仅对这些权重进行训练。这使得 LoRA 的训练速度更快、内存效率更高，并产生更小的模型权重（几百 MB），更容易存储和共享。LoRA 还可以与 DreamBooth 等其他训练技术相结合，以加速训练。

Dreambooth微调

准备数据：

https: //huggingface.co/datasets/diffusers/dog-example

模型训练脚本：

export MODEL_NAME="stable-diffusion-2"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="path-to-save-model"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=768 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=400 \

train_dreambooth.py: 脚本位置为：https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py

pretrained_model_name_or_path: 模型的路径

instance_data_dir：训练的图片位置

output_dir：微调模型保存位置

instance_prompt：罕见字符，使用 Stable Diffusion 模型去生成一个已有相关主题（class）的先验知识，并在训练中充分考虑原 class 和新 instance 的 prior preservation loss，从而避免新 instance 图片特征渗透到其他生成里。

resolution：图片尺寸，和训练的模型相对应

train_batch_size：训练批次

gradient_accumulation_steps：

gradient_accumulation_steps通过累计梯度来解决本地显存不足问题。
假设原来的batch_size=6，样本总量为24。
那么参数更新次数=24/6=4。

如果我的显存不够6batch，想调成3batch，那么我的参数更新次数就是=24/3=8次

但是我设置了gradient_accumulation_steps=2，batch还是6，但是内部是按照batch=3来算的，计算两次batch=3后进行累计梯度，即batch_size=6/2=3，参数更新次数不变=24/3/2=4，在梯度反传时，每gradient_accumulation_steps次进行一次梯度更新，之前照常利用loss.backward()计算梯度。

learning_rate：学习率

lr_scheduler：策略

lr_warmup_steps：预热的步数

max_train_steps：训练步数

模型推理：

from diffusers import StableDiffusionPipeline
import torch

model_id = "stable_finetine/path-to-save-model"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A photo of dog in a bucket"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]

image.save("dog-bucket2.png")

模型转换脚本：

python  convert_diffusers_to_original_stable_diffusion.py --model_path path-to-save-model --checkpoint_path dreambooth_dog.safetensors --use_safetensors

convert_diffusers_to_original_stable_diffusion.py：脚本位置在https://github.com/huggingface/diffusers/blob/main/scripts/convert_diffusers_to_original_stable_diffusion.py

model_path：经过dreambooth训练出来的模型

checkpoint_path：自定义命名

Dream+LORA微调

模型训练脚本：

export MODEL_NAME="stable-diffusion-2"
export INSTANCE_DIR="dog"
export OUTPUT_DIR="path-to-save-lora-model"

accelerate launch train_dreambooth_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=768 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=100 \
  --learning_rate=1e-4 \
  --report_to="wandb" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="A photo of sks dog in a bucket" \
  --validation_epochs=50 \
  --seed="0" \
  --mixed_precision "no"

mixed_precision: 默认是fp16，会报错：ValueError: Attempting to unscale FP16gradients

需要改成no，则是fp36。

train_dreambooth_lora.py：该脚本自带转换模型

模型推理脚本：

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

pipe = DiffusionPipeline.from_pretrained("stable-diffusion-2", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
# pipe.unet.load_attn_procs("path-to-save-lora-model")
pipe.load_lora_weights("path-to-save-lora-model")
image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0]
image.save("dog-bucket-lora.png")

注意点：

原文档中是:

pipe.unet.load_attn_procs("path-to-save-lora-model")

但是加载后的推理并没有明显的效果，怀疑根本没有加载到。作者后续更新了新方法，测试新方法有效果：

pipe.load_lora_weights("path-to-save-lora-model")

Full FineTune

数据格式：

folder/train/metadata.jsonl
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png

metadata.jsonl

{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}

或者用huggingface上现成的数据：

pokemon-blip-captions
├── data
│   └── train-00000-of-00001-566cc9b19d7203f8.parquet
└── dataset_infos.json

训练脚本：

export MODEL_NAME="stable-diffusion-2"
export DATASET_NAME="pokemon-blip-captions"

accelerate launch --mixed_precision="fp16"  train_full_finetune.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --use_ema \
  --resolution=768 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model"

推理脚本

import torch
from diffusers import StableDiffusionPipeline

model_path = "sd-pokemon-model"
pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(prompt="a drawing of a pokemon stuffed animal",num_inference_steps=50).images[0]
image.save("yoda-pokemon.png")

LORA微调

数据格式：

同full Fine Tune

训练脚本：

export MODEL_NAME="stable-diffusion-2"
export DATASET_NAME="pokemon-blip-captions"

accelerate launch --mixed_precision="no" train_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=768 --random_flip \
  --train_batch_size=2 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora" \
  --validation_prompt="cute dragon creature"

推理脚本：

from diffusers import StableDiffusionPipeline
import torch

model_path = "sd-pokemon-model-lora/checkpoint-10000"
pipe = StableDiffusionPipeline.from_pretrained("stable-diffusion-2", torch_dtype=torch.float16)
pipe.load_lora_weights(model_path)
pipe.to("cuda")

prompt = "A pokemon with green eyes and red legs."
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("pokemon.png")