Stable Diffusion背后原理(Latent Diffusion Models)

前言

2023年第一篇博客，大家新年好呀~

这次来关注一下Stable Diffusion背后的原理，即 High-Resolution Image Synthesis with Latent Diffusion Models 这篇论文。
之前关注的那些工作只能工作到 256 × 256 256 \times 256 256×256 像素(resize成这个后才输入模型)，甚至更低。
然而这篇 Latent Diffusion Models 可以到 512 × 512 512 \times 512 512×512 了，生成的质量也更好。

本文与之前的文章一样，会从论文和代码两个角度来分析. 本文会不断更新中…

DDPM原理与代码剖析
IDDPM原理和代码剖析
DDIM原理及代码(Denoising diffusion implicit models)
Classifier Guided Diffusion

理论

摘要

(1) 在摘要部分，作者就说啊，之前的diffusion模型，也可以实现SOTA，但需要耗费巨大算力。
“However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.”

(2) 作者就想了个办法，这也是模型名字 latent 的由来，我们不要在原始像素上推导了，我们让扩散模型在 latent space(可以理解为一个feature map的空间中)进行学习。
“we apply them in the latent space of powerful pretrained autoencoders.”
具体的，可以是图片经过encoder(可以是CNN) 后，得到一个feature map, 然后在这个feature map上进行标准的扩散过程，最后来个decoder映射回图片像素空间。

(3) 优势很显然
Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

Introduction

(1) 在introduction那里, 作者分析了一下概率密度相关模型可以分为两个阶段，一个是 perceptual上的，就是图像纹理细节，另一个是语义上的，例如帅哥变成了美女。
As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression).

所以呢，作者想先找到perceptual上的点，牺牲一点纹理的精度，换取生成高清图像( 512 × 512 512\times 512 512×512)的能力。

“Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.”

Method

(1) 图片经过一个编码器，得到特征 z z z, 即
z = E ( x ) z = E(x) z=E(x)

中途就是常规的DDPM，只是denoise的是 z, 而不是 x。

最后通过decoder返回预测的 x ^ \hat{x} x^。
x ^ = D ( z ^ ) \hat{x} = D(\hat{z}) x^=D(z^)

(2) 若是需要条件 (Conditioning Mechanisms) 的话，则可以输入相关条件的 feature
ϵ θ ( z t , t , y ) \epsilon_θ(zt, t, y) ϵθ(zt,t,y), 这里 y = E c ( x c ) y=E_c(x_c) y=Ec(xc)
例如，如果需要输入文本的话，先通过文本编码器，得到文本特征，再输入到Unet网络的condition embedding即可, 通过是和 step embedding相加或拼接等。这是一般的condition ddpm操作。

但是作者认为这样不好, “however, combining the generative power of DMs with other types of conditionings beyond class-labels [15] or blurred variants of the input image [72] is so far an under-explored area of research.”

本文引入了一种 cross-attention mechanism ,

这里的 τ θ \tau_\theta τθ 就是处理prompt y y y 的编码器，例如文本 y y y 对应的 τ θ \tau_\theta τθ 就是文本编码器。最后 ϵ θ \epsilon_\theta ϵθ 和 τ θ \tau_\theta τθ 靠下列式子更新: