来试试强大的Stable Diffusion吧,基于Stable Diffusion的pipeline,进一步了解Stable Diffusion的结构~
Diffusion实战篇:
【Diffusion实战】训练一个diffusion模型生成S曲线(Pytorch代码详解)
【Diffusion实战】训练一个diffusion模型生成蝴蝶图像(Pytorch代码详解)
【Diffusion实战】引导一个diffusion模型根据文字生成图像(Pytorch代码详解)
【Diffusion实战】训练一个类别引导diffusion模型(Pytorch代码详解)
Diffusion综述篇:
【Diffusion综述】医学图像分析中的扩散模型(一)
【Diffusion综述】医学图像分析中的扩散模型(二)
1、Stable Diffusion初探:从文本生成图像
首先,得看看Stable Diffusion用起来是个什么效果。
预训练pipeline下载:stabilityai/stable-diffusion-2-1-base(模型有点多,要下老半天…)
import torch
import requests
from PIL import Image
from io import BytesIO
from matplotlib import pyplot as plt
from diffusers import StableDiffusionPipeline
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# pipeline加载
model_id = "E:/Code/kuosan/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id).to(device)
# 为生成器设置一个随机种子, 使结果可重复
generator = torch.Generator(device=device).manual_seed(42)
# 运行pipeline
pipe_output = pipe(
prompt="Palette knife painting of an autumn cityscape", # 提示文字:需要生成的
negative_prompt="Oversaturated, blurry, low quality", # 提示文字:不需要生成的
height=1024, width=1024, # 定义生成图像尺寸大小
guidance_scale=10, # 提示文字的影响程度
num_inference_steps=50, # 一次生成需要的推理步骤
generator=generator # 设置随机数种子生成器
)
# 查看生成结果
plt.figure(dpi=300)
plt.imshow(pipe_output.images[0])
plt.axis('off')
plt.show()
输出图像为:
修改文字prompt可生成不同风格和内容的图像:
pipe_output = pipe(
prompt="A realistic photo of a cute giant panda eating fresh green bamboo", # 提示文字:需要生成的
negative_prompt="Oversaturated, blurry, low quality", # 提示文字:不需要生成的
height=480, width=640, # 定义生成图像尺寸大小
guidance_scale=10, # 提示文字的影响程度
num_inference_steps=50, # 一次生成需要的推理步骤
generator=generator # 设置随机数种子生成器
)
生成图像如下所示,赞叹一句,真强啊!连牙都没少啊!
探索guidance_scale参数对图像的影响:guidance_scale决定了无分类器引导的影响强度,增大该参数可以使生成的内容更接近给出的文本prompt,若该参数过大,则会导致图像过饱和,视觉不太美观。
cfg_scales = [2, 5, 8, 11, 14]
prompt = "A cute kitten sleeping in a pile of flower petals"
fig, axs = plt.subplots(1, len(cfg_scales), figsize=(16, 5))
for i, ax in enumerate(axs):
im = pipe(prompt, height=480, width=480,
guidance_scale=cfg_scales[i], num_inference_steps=35,
generator=torch.Generator(device=device).manual_seed(42)).images[0]
ax.imshow(im)
ax.axis('off')
ax.set_title(f'CFG Scale {cfg_scales[i]}')
狸奴小睡不知愁,忙添落花作锦裘~
一般来说,guidance_scale的值设置为8~12,当然,视觉判断也是比较主观的。
2、Stable Diffusion深入:结构解析
Stable Diffusion的pipeline中有哪些结构呢,可以打印查看一下:
print(list(pipe.components.keys()))
['vae', 'text_encoder', 'tokenizer', 'unet', 'scheduler', 'safety_checker', 'feature_extractor', 'image_encoder']
Latent Diffusion Models结构图:
2.1 变分自编码器(VAE)
VAE的作用是对输入图像进行压缩,其编码器完成从像素空间(Pixel space)到隐空间(Latent space)的编码,扩散过程在隐空间的图像特征中完成,VAE解码器实现从隐空间再到像素空间的转换。
# 创建测试图像, 取值范围为(-1,1)
images = torch.rand(1, 3, 512, 512).to(device) * 2 - 1
print("Input images shape:", images.shape)
# 编码到隐空间
with torch.no_grad():
latents = 0.18215 * pipe.vae.encode(images).latent_dist.mean
print("Encoded latents shape:", latents.shape)
# 从隐空间解码
with torch.no_grad():
decoded_images = pipe.vae.decode(latents / 0.18215).sample
print("Decoded images shape:", decoded_images.shape)
打印为:
Input images shape: torch.Size([1, 3, 512, 512])
Encoded latents shape: torch.Size([1, 4, 64, 64])
Decoded images shape: torch.Size([1, 3, 512, 512])
可以看到,原本512×512大小的图像,被压缩成64×64的隐式表达,隐编码可使扩散模型运行更快,更高效~
这里其实有一个疑惑,隐空间图像特征维度通道为什么设置为4而不是3呢?
2.2 分词器与文本编码器
文本编码器的作用是将输入字符串转换成数值形式,以便作为UNet的输入。输入的文字prompt先进行词嵌入,然后送入文本编码器进行特征映射。
知识传送:【中文编码】利用bert-base-chinese中的Tokenizer实现中文编码嵌入
# 对输入文字进行分词
input_ids = pipe.tokenizer(["A painting of a flooble"])['input_ids']
print("Input ID -> decoded token")
for input_id in input_ids[0]:
print(f"{input_id} -> {pipe.tokenizer.decode(input_id)}")
# 将分词结果输入CLIP文本编码器
input_ids = torch.tensor(input_ids).to(device)
with torch.no_grad():
text_embeddings = pipe.text_encoder(input_ids)['last_hidden_state']
print("Text embeddings shape:", text_embeddings.shape)
text_embeddings = pipe.encode_prompt("A painting of a flooble", device, 1, False, '')
print(text_embeddings[0].shape)
输出为:
Input ID -> decoded token
49406 -> <|startoftext|>
320 -> a
3086 -> painting
539 -> of
320 -> a
4062 -> floo
1059 -> ble
49407 -> <|endoftext|>
Text embeddings shape: torch.Size([1, 8, 1024])
torch.Size([1, 77, 1024])
输出维度为77,包含75词以内的文本prompt,以及一个开始符和终止符。
2.3 UNet网络
UNet网络的作用是接收带噪输入并预测噪声,实现去噪。UNet的输入有大小为 [ 1 , 77 , 1024 ] {[1, 77, 1024]} [1,77,1024] 的文本嵌入、大小为 [ 4 , 64 , 64 ] {[4, 64, 64]} [4,64,64] 的图像隐特征,以及时间步。
# 创建输入
timestep = pipe.scheduler.timesteps[0]
latents = torch.randn(1, 4, 64, 64).to(device)
text_embeddings = torch.randn(1, 77, 1024).to(device)
# 模型预测:
with torch.no_grad():
unet_output = pipe.unet(latents, timestep, text_embeddings).sample
print('UNet output shape:', unet_output.shape)
输出为:
UNet output shape: torch.Size([1, 4, 64, 64])
2.4 调度器
调度器保存了如何添加噪声的信息,默认调度器为PNDMScheduler。
前向加噪过程:
x
t
=
α
ˉ
t
x
0
+
1
−
α
ˉ
t
ε
{{x_t} = \sqrt {{{\bar \alpha }_t}} {x_0} + \sqrt {1 - {{\bar \alpha }_t}} \varepsilon }
xt=αˉt
x0+1−αˉt
ε
plt.figure(dpi=300)
plt.plot(pipe.scheduler.alphas_cumprod, label=r'$\bar{\alpha}$')
plt.xlabel('Timestep (high noise to low noise ->)')
plt.title('Noise schedule')
plt.legend()
画图为:
from diffusers import LMSDiscreteScheduler
# 更换调度器
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
# 输出配置参数
print('Scheduler config:', pipe.scheduler)
# 用新的调度器生成图像
pipe(prompt="Beautiful pastoral scenery, beautiful mountains and waters",
height=480, width=480, num_inference_steps=50,
generator=torch.Generator(device=device).manual_seed(42)).images[0]
输出设置有:
Scheduler config: LMSDiscreteScheduler {
"_class_name": "LMSDiscreteScheduler",
"_diffusers_version": "0.26.3",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"clip_sample": false,
"num_train_timesteps": 1000,
"prediction_type": "epsilon",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1,
"timestep_spacing": "linspace",
"trained_betas": null,
"use_karras_sigmas": false
}
生成图像:
2.5 自定义循环采样
探索了Stable Diffusion的各个组件,就可以自定义循环采样过程,将其组装起来实现文生图:
guidance_scale = 8
num_inference_steps = 60
prompt = "A cute little monkey is standing on a tree"
negative_prompt = "zoomed in, blurry, oversaturated, warped"
# 文本编码
text_embeddings = pipe._encode_prompt(prompt, device, 1, True, negative_prompt)
# 创建随机噪声作为起点
latents = torch.randn((1, 4, 64, 64), device=device, generator=generator)
latents *= pipe.scheduler.init_noise_sigma
# 准备调度器
pipe.scheduler.set_timesteps(num_inference_steps, device=device)
# 循环采样
for i, t in enumerate(pipe.scheduler.timesteps):
# 分类引导扩大隐特征
latent_model_input = torch.cat([latents] * 2)
# 应用调度器
latent_model_input = pipe.scheduler.scale_model_input(latent_model_input, t)
# 噪声预测
with torch.no_grad():
noise_pred = pipe.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# 进行引导
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 去噪计算前一个样本 x_t -> x_t-1
latents = pipe.scheduler.step(noise_pred, t, latents).prev_sample
# 将隐特征转换到像素域
with torch.no_grad():
image = pipe.decode_latents(latents.detach())
# 可视化
final_image = pipe.numpy_to_pil(image)[0]
plt.figure(dpi=300)
plt.imshow(final_image)
plt.axis('off')
plt.show()
生成图像:
3、代码汇总
import torch
import requests
from PIL import Image
from io import BytesIO
from matplotlib import pyplot as plt
from diffusers import StableDiffusionPipeline
# =============================================================================
# Stable Diffusion初探
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# pipeline加载
model_id = "E:/Code/kuosan/stable-diffusion-2-1-base"
pipe = StableDiffusionPipeline.from_pretrained(model_id).to(device)
# 为生成器设置一个随机种子, 使结果可重复
generator = torch.Generator(device=device).manual_seed(42)
# 运行pipeline
pipe_output = pipe(
prompt="A realistic photo of a cute giant panda eating fresh green bamboo", # 提示文字:需要生成的
negative_prompt="Oversaturated, blurry, low quality", # 提示文字:不需要生成的
height=480, width=640, # 定义生成图像尺寸大小
guidance_scale=10, # 提示文字的影响程度
num_inference_steps=50, # 一次生成需要的推理步骤
generator=generator # 设置随机数种子生成器
)
# 查看生成结果
plt.figure(dpi=300)
plt.imshow(pipe_output.images[0])
plt.axis('off')
plt.show()
# 探索guidance_scale的影响
cfg_scales = [2, 5, 8, 11, 14]
prompt = "A cute kitten sleeping in a pile of flower petals"
fig, axs = plt.subplots(1, len(cfg_scales), figsize=(16, 5))
for i, ax in enumerate(axs):
im = pipe(prompt, height=480, width=480,
guidance_scale=cfg_scales[i], num_inference_steps=35,
generator=torch.Generator(device=device).manual_seed(42)).images[0]
ax.imshow(im)
ax.axis('off')
ax.set_title(f'CFG Scale {cfg_scales[i]}')
# =============================================================================
# Stable Diffusion结构探索
# VAE -------------------------------------------------------------------------
print(list(pipe.components.keys()))
# 创建测试图像, 取值范围为(-1,1)
images = torch.rand(1, 3, 512, 512).to(device) * 2 - 1
print("Input images shape:", images.shape)
# 编码到隐空间
with torch.no_grad():
latents = 0.18215 * pipe.vae.encode(images).latent_dist.mean
print("Encoded latents shape:", latents.shape)
# 从隐空间解码
with torch.no_grad():
decoded_images = pipe.vae.decode(latents / 0.18215).sample
print("Decoded images shape:", decoded_images.shape)
# 分词与编码器 ------------------------------------------------------------------
# 对输入文字进行分词
input_ids = pipe.tokenizer(["A painting of a flooble"])['input_ids']
print("Input ID -> decoded token")
for input_id in input_ids[0]:
print(f"{input_id} -> {pipe.tokenizer.decode(input_id)}")
# 将分词结果输入CLIP文本编码器
input_ids = torch.tensor(input_ids).to(device)
with torch.no_grad():
text_embeddings = pipe.text_encoder(input_ids)['last_hidden_state']
print("Text embeddings shape:", text_embeddings.shape)
text_embeddings = pipe.encode_prompt("A painting of a flooble", device, 1, False, '')
print(text_embeddings.shape)
# UNet ------------------------------------------------------------------------
# 创建输入
timestep = pipe.scheduler.timesteps[0]
latents = torch.randn(1, 4, 64, 64).to(device)
text_embeddings = torch.randn(1, 77, 1024).to(device)
# 模型预测:
with torch.no_grad():
unet_output = pipe.unet(latents, timestep, text_embeddings).sample
print('UNet output shape:', unet_output.shape)
# 调度器 -----------------------------------------------------------------------
plt.figure(dpi=300)
plt.plot(pipe.scheduler.alphas_cumprod, label=r'$\bar{\alpha}$')
plt.xlabel('Timestep (high noise to low noise ->)')
plt.title('Noise schedule')
plt.legend()
from diffusers import LMSDiscreteScheduler
# 更换调度器
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
# 输出配置参数
print('Scheduler config:', pipe.scheduler)
# 用新的调度器生成图像
pipe(prompt="Beautiful pastoral scenery, beautiful mountains and waters",
height=480, width=480, num_inference_steps=50,
generator=torch.Generator(device=device).manual_seed(42)).images[0]
# 自制循环采样 ------------------------------------------------------------------
guidance_scale = 8
num_inference_steps = 60
prompt = "A cute little monkey is standing on a tree"
negative_prompt = "zoomed in, blurry, oversaturated, warped"
# 文本编码
text_embeddings = pipe._encode_prompt(prompt, device, 1, True, negative_prompt)
# 创建随机噪声作为起点
latents = torch.randn((1, 4, 64, 64), device=device, generator=generator)
latents *= pipe.scheduler.init_noise_sigma
# 准备调度器
pipe.scheduler.set_timesteps(num_inference_steps, device=device)
# 循环采样
for i, t in enumerate(pipe.scheduler.timesteps):
# 分类引导扩大隐特征
latent_model_input = torch.cat([latents] * 2)
# 应用调度器
latent_model_input = pipe.scheduler.scale_model_input(latent_model_input, t)
# 噪声预测
with torch.no_grad():
noise_pred = pipe.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# 进行引导
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# 去噪计算前一个样本 x_t -> x_t-1
latents = pipe.scheduler.step(noise_pred, t, latents).prev_sample
# 将隐特征转换到像素域
with torch.no_grad():
image = pipe.decode_latents(latents.detach())
# 可视化
final_image = pipe.numpy_to_pil(image)[0]
plt.figure(dpi=300)
plt.imshow(final_image)
plt.axis('off')
plt.show()
要不说AI绘画有意思呢,我能玩儿一个晚上…