intro
目前很多风格lora模型的训练都会对图片进行caption,训练风格lora时使用image caption步骤是否能带来正向的效果?
在sdxl的代码实现中,lora插入的位置为unet中cross attention的QKV的线性投影部分,而cross attention主要影响的是文本token和图像latents的对齐关系。如果每张图片都有caption,训练时是建立的图像风格和caption中所有words的连接,如果不使用caption,则是建立的图像风格和单一trigger words的关系,比如本文使用的‘gfzz’。为了得出哪一种方式更好的结论,下面进行一个对比实验。
对比实验
数据集:这里使用200张动漫数据集在sdxl base 1.0上进行风格lora模型训练实验,数据集的分布为80%的人物和20%的风景建筑以及动物其它,风格类型类似凡人修仙传等3D国漫,分辨率均为1024x1024。图片caption使用blip2和wd14 tagger的合并结果。
训练脚本:kohya_ss
训练参数:
calss images=None
repeat=20
instance=‘gfzz’
lr = 0.0001
batchsize=4
epoch=20 ( checkpoint-00016 is the best)
LR Scheduler=‘constant’
Optimizer=‘Adafactor’
resolution= 1024,1024
network alpha = 8
network rank = 8
于是可以对比三个模型的效果:
sdxl-base: 原始模型的生图效果
sdxl-base-gfzz: sdxl-base+不使用caption训练的风格lora
sdxl-base-gfzz-tag: sdxl-base+使用caption训练的风格lora
prompt测试集抽取自Parti 的评测集 PartiPrompts,包含一些基础的prompt和一些复杂的prompt。
a rabbit
a green pepper
a portrait of an old man
a close-up photo of a wombat wearing a red backpack and raising both arms in the air. Mount is in the background.
a Saint Bernard standing up with its paws in the air. A young girl is seated on the dog's shoulders.
a man and a woman standing in the back up an old pickup truck
a wooden deck overlooking a mountain valley
a man riding a camel on the beach
a volcano spewing fish into the sky
a young man wielding a sword, moonlight sprinkled around trees, engraved sword marks, determined expression, trees, sword
negative prompt:
blurry, low quality, worst quality, ugly, duplicate, mutated body parts, extra arms, (extra heads), extra legs, fused fingers, extra fingrers, bad anatomy, bad proportions, lowres, fewer digits, cloned face, repeated person, unclear eyes, blurry eyes, malformed limbs, out of focus, cropped, monochrome, text, JPEG artifacts
生成结果
图片左侧为sdxl-base,中间为使用caption的sdxl-base-gfzz-tag,右侧为无caption的sdxl-base-gfzz。
seed:40551640821 and sample 30 steps by Euler_a