AIGC：文生图模型Stable Diffusion

1 Stable Diffusion介绍

Stable Diffusion 是由CompVis、Stability AI和LAION共同开发的一个文本转图像模型，它通过LAION-5B子集大量的 512x512 图文模型进行训练，我们只要简单的输入一段文本，Stable Diffusion 就可以迅速将其转换为图像，同样我们也可以置入图片或视频，配合文本对其进行处理。

Stable Diffusion的发布是AI图像生成发展过程中的一个里程碑，相当于给大众提供了一个可用的高性能模型，不仅生成的图像质量非常高，运行速度快，并且有资源和内存的要求也较低。一张生成图片展示如下：

Stable Diffusion Demo：demo

1.1 Stable Diffusion的组成

Stable Diffusion不是一个整体模型，它由几个组件和模型组成。

文本理解组件：text-understanding component ，将文本信息转换成数字表示，以捕捉文本中的想法。图像生成器：image generator，图像生成器包括两步，图像信息创建者（ Image information creator）和图像解码器（Image Decoder）。

图像信息创建者这一组件运行多步以生成对象，这是stable diffusion接口和库中的步长参数，通常默认为50或者100。图像信息创建者完全在图像信息空间（隐藏空间）中工作，此特性比在像素空间中工作的扩散模型更快。

图像解码器根据从图像信息创建者哪里获得信息绘制图片，它仅仅在生成最终图像的结束阶段运行一次。

上图是stable diffusion的一个流程图，包含了上述描述的三个组件，每个组件都有相应的神经网络。

文本理解组件：Clip Text为文本编码器。以77 token为输入，输出为77 token 嵌入向量，每个向量有768维度图像信息创建者：UNet+Scheduler，在潜在空间中逐步处理扩散信息。以文本嵌入向量和由噪声组成的起始多维数组为输入，输出处理的信息数组。图像解码器：**自动编码解码器，使用处理后的信息数组绘制最终的图像。以处理后的维度为 4 × 64 × 64 4 \times 64 \times 64 4×64×64的信息数组为输入，输出尺寸为 3 × 512 × 512 3 \times 512 \times 512 3×512×512的图像。

1.2 什么是Diffusion

上述我们描述过“图像信息创建者”组件的功能，它以文本嵌入向量和由噪声组成的起始多维输入为输出，输出图像解码器用于绘制最终图像的信息阵列。扩散是发生在下图粉红色“图像信息创建者”组件内部的过程。

扩散这个过程是循序渐进的，每一步都会添加更多相关信息。扩散发生在多个步骤，每一步作用于一个输入latents array，生成另一个latents array，该数组能够更好类比输入文本和模型从训练模型中的所有图像中获取的所有视觉信息。下图将每一步生成的latents array作为图像解码器的输入，可视化了每一步中添加了什么信息。下图的diffusion迭代了50次，随着迭代步数的增加，latents array解码的图像越来越清晰。

1.3 Diffusion是如何工作的

扩散模型生成图像的主要思路基于业内已有强大的计算机视觉模型这一基础上。只要数据集够大，模型就可以学习到更复杂的逻辑。

假设有一张照片，有一些随机生成的噪声，然后随机选择一个噪声添加到此图像上，这样构成一条训练样本。用相同的方式可以生成大量的训练样本组成训练集，然后使用这份训练数据集，训练噪声预测器（UNet）。训练结束后将会得到一个高性能的噪声预测器，在特定配置下运行时创建图像。

1.4 去噪声绘制图像

基于上述描述构建的噪声训练集训练得到一个噪声预测器，噪声预测器可以产生一个噪声图像，如果我们从图像中减去此生成的噪声图像，那么就能够得到与模型训练样本尽可能接近的图像，这个接近是指分布上的接近，比如天空通常是蓝色的，人类有两个眼等。生成图像的风格倾向于训练样本存在的风格。

1.5 将文本信息添加到图像生成器中

上述描述的扩散生成图像并不包括任何文本图像，但是图像生成器的输入包括文本嵌入向量和由噪声组成的起始多维数组，所以调整噪声生成器来适配文本。这样基于大量训练数据训练后既可以得到图像生成器。基于选择的文本编码器加上训练后的图像生成器，就构成了整个stable diffusion模型。可以给定一些描述性的语句，整个stable diffusion模型就能够生成相应的画作。

2 运行环境构建

2.1 conda环境安装

conda环境准备详见：annoconda

2.2 运行环境准备

git clone https://github.com/CompVis/stable-diffusion.git

cd stable-diffusion

conda env create -f environment.yaml

conda activate ldm

pip install diffusers==0.12.1

2.3 模型下载

（1）下载模型文件“sd-v1-4.ckpt”

模型地址：模型

完成后执行如下命令

mkdir -p models/ldm/stable-diffusion-v1/

mv sd-v1-4.ckpt model.ckpt

mv model.ckpt models/ldm/stable-diffusion-v1/

（2）下载checkpoint_liberty_with_aug.pth模型

模型地址：模型

下载完成后，模型放到cache文件夹下

mv checkpoint_liberty_with_aug.pth ~/.cache/torch/hub/checkpoints/

（3）下载clip-vit-large-patch14模型

模型地址：模型

需要下载的模型文件如下：

创建模型的存储目录

mkdir -p openai/clip-vit-large-patch14

下载完成后，把下载的文件移动到上面的目录下。

（4）下载safety_checker模型

模型地址：模型

需要下载模型文件如下：

创建模型文件的存储目录

mkdir -p CompVis/stable-diffusion-safety-checker

下载完成后，把下载的文件移动到上面的目录下

将（3）中的preprocessor_config.json移动当前模型目录下：

mv openai/clip-vit-large-patch14/preprocessor_config.json CompVis/stable-diffusion-safety-checker/

3 运行效果展示

3.1 运行文生图

python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms

运行效果展示

txt2img.py参数

usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
                  [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
                  [--seed SEED] [--precision {full,autocast}]

optional arguments:
  -h, --help            show this help message and exit
  --prompt [PROMPT]     the prompt to render
  --outdir [OUTDIR]     dir to write results to
  --skip_grid           do not save a grid, only individual samples. Helpful when evaluating lots of samples
  --skip_save           do not save individual samples. For speed measurements.
  --ddim_steps DDIM_STEPS
                        number of ddim sampling steps
  --plms                use plms sampling
  --laion400m           uses the LAION400M model
  --fixed_code          if enabled, uses the same starting code across samples
  --ddim_eta DDIM_ETA   ddim eta (eta=0.0 corresponds to deterministic sampling
  --n_iter N_ITER       sample this often
  --H H                 image height, in pixel space
  --W W                 image width, in pixel space
  --C C                 latent channels
  --f F                 downsampling factor
  --n_samples N_SAMPLES
                        how many samples to produce for each given prompt. A.k.a. batch size
  --n_rows N_ROWS       rows in the grid (default: n_samples)
  --scale SCALE         unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
  --from-file FROM_FILE
                        if specified, load prompts from this file
  --config CONFIG       path to config which constructs model
  --ckpt CKPT           path to checkpoint of model
  --seed SEED           the seed (for reproducible sampling)
  --precision {full,autocast}
                        evaluate at this precision

3.2 运行图片转换

执行命令如下：

python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img assets/stable-samples/img2img/mountains-1.png --strength 0.8

4 问题解决

4.1 SAFE_WEIGHTS_NAME问题解决

运行txt2img，出现如下错误：

(ldm) [root@localhost stable-diffusion]# python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms 
Traceback (most recent call last):
  File "scripts/txt2img.py", line 22, in <module>
    from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/__init__.py", line 29, in <module>
    from .pipelines import OnnxRuntimeModel
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/__init__.py", line 19, in <module>
    from .dance_diffusion import DanceDiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/__init__.py", line 1, in <module>
    from .pipeline_dance_diffusion import DanceDiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/dance_diffusion/pipeline_dance_diffusion.py", line 21, in <module>
    from ..pipeline_utils import AudioPipelineOutput, DiffusionPipeline
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/diffusers/pipelines/pipeline_utils.py", line 67, in <module>
    from transformers.utils import SAFE_WEIGHTS_NAME as TRANSFORMERS_SAFE_WEIGHTS_NAME
ImportError: cannot import name 'SAFE_WEIGHTS_NAME' from 'transformers.utils' (/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/__init__.py)

通过变更组件diffusers版本解决，命令如下：

pip install diffusers==0.12.1

4.2 不能连接到huggingface.co的解决办法

 python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms 
Traceback (most recent call last):
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 403, in get_feature_extractor_dict
    resolved_feature_extractor_file = cached_path(
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 282, in cached_path
    output_path = get_from_cache(
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/utils/hub.py", line 545, in get_from_cache
    raise ValueError(
ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scripts/txt2img.py", line 28, in <module>
    safety_feature_extractor = AutoFeatureExtractor.from_pretrained(safety_model_id)
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/models/auto/feature_extraction_auto.py", line 270, in from_pretrained
    config_dict, _ = FeatureExtractionMixin.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs)
  File "/root/anaconda3/envs/ldm/lib/python3.8/site-packages/transformers/feature_extraction_utils.py", line 436, in get_feature_extractor_dict
    raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like CompVis/stable-diffusion-safety-checker is not the path to a directory containing a preprocessor_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

解决方法：

将模型下载到本地，过程详见2.3描述