stable diffusion文生图代码解读

使用diffusers运行stable diffusion，文生图过程代码解读。
只按照下面这种最简单的运行代码，省略了一些参数的处理步骤。

from diffusers import DiffusionPipeline
pipeline = DiffusionPipeline.from_pretrained(MODEL_PATH , torch_dtype=torch.float16)
pipeline.to("cuda")
img = pipeline("An image of a squirrel in Picasso style",num_inference_steps=10).images[0]
img.save("result.jpg")

0. 定义height 和 width

如果没有输入参数，默认为unet的采样大小乘以VAE缩放率，64*8=512。

1.检查输入的参数

一些常规检查。

2. 定义参数 batch_size

根据prompt或者prompt_embeds计算batch_size，按照上面的执行代码，默认为1。如果一次输入多个prompt，那么就是prompt的数量

 if prompt is not None and isinstance(prompt, str):
     batch_size = 1
 elif prompt is not None and isinstance(prompt, list):
     batch_size = len(prompt)
 else:
     batch_size = prompt_embeds.shape[0]

#多个prompt
#每个prompt生成的图片数量使用num_images_per_prompt控制
prompt = ["An image of a squirrel in Picasso style","Astronaut in a jungle, cold color palette"]
images = pipeline(prompt,num_images_per_prompt=1,num_inference_steps=10).images

3.对输入的prompt编码

默认使用CLIPTokenizer对输入prompt tokenize，输出为（1，77），CLIP模型默认设置最大文本长度为75，然后还有两个表示开始和结束的特殊字符’<|startoftext|>’ ‘<|endoftext|>’，最大长度就是77。
使用openai/clip-vit-large-patch14，对输入进行encoder。CLIP模型的默认embedding dim 为768，那么编码输出的prompt embedding的维度就是（1，77，768）。
如果参数没有输入negative_prompt，那么negative_prompt默认为 ‘‘’’，仍然可以tokenizer，encoder。
negative prompt embedding的维度也是（1，77，768）。
默认都是有do_classifier_free_guidance（CFG参数），为了避免计算两次，这里把negative prompt 和prompt合并在一起输入。

prompt_embeds, negative_prompt_embeds = self.encode_prompt(
    prompt,
    device,
    num_images_per_prompt,
    self.do_classifier_free_guidance,
    negative_prompt,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    lora_scale=lora_scale,
    clip_skip=self.clip_skip,
)
if self.do_classifier_free_guidance:
    prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds]) #（2，77，768）

4. 准备timesteps

根据使用的scheduler，计算timesteps。stable diffusion默认使用PNDMScheduler，输入的参数num_inference_steps为10步，那么timesteps的长度就为10。

timesteps, num_inference_steps = retrieve_timesteps(
    self.scheduler, num_inference_steps, device, timesteps, sigmas
)

5.准备latents

SD的主要计算都是在latent space进行，以加快计算速度。简单理解就是在小图计算再放大（并不准确）。
unet.config.in_channels 为4，latents的height 和width分别为输入参数height 和width 整除 VAE的缩放率，也就是 512 // 8 = 64，生成的latents的shape为（1，4，64，64）。
latents使用了 torch.randn 生成。

num_channels_latents = self.unet.config.in_channels
latents = self.prepare_latents(
    batch_size * num_images_per_prompt,
    num_channels_latents,
    height,
    width,
    prompt_embeds.dtype,
    device,
    generator,
    latents,
)
latents = torch.randn(shape, generator=generator, device=rand_device, dtype=dtype, layout=layout).to(device)

6.一些其他参数处理

略

7.逆扩散，去除噪音

默认使用CFG，那么输入的latents也要复制一遍，和之前的prompt_embeds一起输入到UNet去预测噪声。那么得到的噪声也是两个，分别是无条件（negative_prompt）噪声和条件（prompt）噪声。
CFG也是在这里起作用，CFG值越大，那么prompt对预测的最终噪声影响越大，那么对生成的图像影响也越大。

noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
#预测的噪声 = 无条件噪声 + cfg * （条件噪声 - 无条件噪声）

然后再根据scheduler的算法，计算前一次的latents（去噪），生成新的latents。
循环执行10次，得到最终的latents。

with self.progress_bar(total=num_inference_steps) as progress_bar:
    for i, t in enumerate(timesteps):
        if self.interrupt:
            continue
        # expand the latents if we are doing classifier free guidance
        latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents
        latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

        # predict the noise residual
        noise_pred = self.unet(
            latent_model_input,
            t,
            encoder_hidden_states=prompt_embeds,
            timestep_cond=timestep_cond,
            cross_attention_kwargs=self.cross_attention_kwargs,
            added_cond_kwargs=added_cond_kwargs,
            return_dict=False,
        )[0]
        # perform guidance
        if self.do_classifier_free_guidance:
            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
            noise_pred = noise_pred_uncond + self.guidance_scale * (noise_pred_text - noise_pred_uncond)
        # compute the previous noisy sample x_t -> x_t-1
        latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]

8.VAE解码

最后使用VAE decoder解码，从latents生成图片。

image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

总结

### 使用Diffusers运行Stable Diffusion文生图过程总结
#### 步骤概述
本过程通过`diffusers`库简化了Stable Diffusion模型的文生图任务，主要通过预处理输入、模型推断和后处理来获取最终的图像输出。以下是详细步骤总结：
1. **模型载入与初始化**：
- 使用`DiffusionPipeline.from_pretrained()`方法加载预先训练好的Stable Diffusion模型，并指定模型路径（`MODEL_PATH`）。
- 将模型移至GPU（通过`.to("cuda")`），以提升计算速度。
2. **设置输入参数**：
- 检查和定义必要的参数，如图像处理的高度（`height`）、宽度（`width`）以及批量大小（`batch_size`）。
- 批量大小默认为1，如果有多个`prompt`输入，批量大小则为prompt的数量。
3. **处理输入Prompt**：
- 使用`CLIPTokenizer`对输入的prompt进行编码，生成prompt embeddings。
- 如果启用了Classifier-Free Guidance（CFG），还会对negative prompt进行相同的操作，并将negative和normal prompt的embeddings合并。
4. **准备Timesteps和Latents**：
- 根据指定的推断步数（`num_inference_steps`），计算对应的timesteps。
- 在latent space中准备初始的随机latent张量，这些latents随后将用于生成图像。
5. **逆扩散过程**：
- 在指定的timesteps中，通过U-Net模型进行循环，以逐步去除latents中的噪声。
- 如果启用了CLASSIFIER-FREE GUIDANCE，则将无条件噪声和条件噪声的预测结合，以控制生成的图像如何受到输入prompt的影响。
- 经过多次迭代，逐渐得到接近最终图像的latents。
6. **VAE解码**：
- 使用Variational Autoencoder（VAE）将处理后的latents解码为图像。这一步骤从隐藏的latent space中恢复了可视化的图像表示。
7. **输出与保存**：
- 最终生成的图像保存在指定的文件路径（如"result.jpg"）中。
#### 关键技术点
- **Classifier-Free Guidance**: 一种在不使用分类器的前提下，通过结合无条件噪声预测和条件噪声预测来引导模型生成图像的方法。通过调节guidance scale可以控制prompt对生成图像的影响程度。
- **U-Net模型**: 扩散模型中使用U-Net作为预测网络的核心部分，负责在逆扩散过程中逐步去除噪声。
- **VAE解码器**: 通过训练过的VAE将从latent space恢复出的图像表示转换为实际可用的图像数据。
- **Latent Space操作**: 大部分计算都在latent space中进行，使用较小的维度操作然后上采样至图像分辨率，以加速计算过程。
以上步骤展示了Stable Diffusion通过text-to-image能力生成图像的底层实现，其中集成了现代深度学习技术如transformers、U-Net以及变分自编码器等技术。