论文: https://arxiv.org/pdf/2311.16933
代码:https://guoyww.github.io/projects/SparseCtrl
MOTIVATION
relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty.The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences to enhance controllability, whose collection accordingly increases the burden of inference. textual prompts, being inherently abstract expressions, struggle to accurately define complex structural attributes such as spatial layouts, poses, and shapes. for precise output control, existing works necessitate temporally dense structural map sequences, which means that users need to furnish condition maps for each frame in the generated video, thereby increasing the practical costsCONTRIBUTION
We introduce SparseCtrl, an efficient approach that targets controlling text-to-video generation via temporally sparse condition maps with an add-on encoder. More specifically, in order to control the synthesis ,we apply the philosophy of ControlNet [62], which implements an auxiliary encoder while preserving the integrity of the original generator.This design allows us to incorporate additional conditions by merely training the encoder network on top of the pre-trained T2V model, thereby eliminating the need for comprehensive model retraining. this design facilitates control over not only the original T2V but also the derived personalized models when combined with the plug-and-play motion module of AnimateDiff.To achieve this, we design a condition encoder equipped with temporal-aware layers that propagate the sparse condition signals from conditioned keyframes to unconditioned framesRELATED WORKS
Text-to-video diffusion models:缺乏对合成结果的细粒度可控性
最初的尝试集中在从头开始训练 T2V 模型。
例如,Video Diffusion Model 扩展了标准图像架构以适应视频数据,并同时对图像和视频进行训练。 Imagen Video 使用级联结构进行高分辨率 T2V 生成。 Make-A-Video 使用文本-图像先验模型来减少对文本-视频配对数据的依赖。基于 T2I 模型的 T2V 模型构建:
一些研究者在强大的文本到图像(Text-to-Image, T2I)模型的基础上构建 T2V 模型,例如 Stable Diffusion。 通过增加额外的层来模拟帧间运动和一致性。特定 T2V 模型的特点:
MagicVideo 采用因果设计,在压缩的潜在空间中执行训练,以减轻计算需求。 AlignYour-Latents 通过对齐独立采样的噪声图,有效地将 T2I 转换为视频生成器。 AnimateDiff 使用可插拔的运动模块,在个性化的图像骨干网络上实现高质量的动画创建。Controllable T2V generation:如何解决prompt模糊的问题
高级别视频运动控制:
LoRA层学习:一些研究提出通过学习LoRA层来适应特定的运动模式。 轨迹、运动向量或姿势序列:其他方法使用提取的轨迹、运动向量或姿势序列来控制视频运动。特定合成关键帧的管理:
图像单独编码:将关键帧图像单独编码并传递给生成器。 与噪声输入连接:将关键帧信息与噪声输入连接,以增强控制信号。 多级特征注入:利用多级特征注入技术来控制关键帧。细粒度空间结构控制:
单目深度序列:例如Gen-1模型,使用单目深度序列作为结构指导。 草图和深度序列编码:VideoComposer通过共享编码器编码草图和深度序列,便于推理时的灵活组合。现有方法的局限性:尽管上述方法能够实现一定程度的细粒度控制,但它们通常需要为每个合成帧提供条件,这在实际应用中成本过高。
Add-on network for additional control
训练基础的 T2I/T2V 生成模型在计算上要求很高,因此,为了在这些模型中加入额外的控制而又不重新训练整个模型,一种优选的方法是训练一个额外的条件编码器,同时保持原始模型骨架的完整性。
ControlNet 的创新:ControlNet [62] 是在预训练的 T2I 模型上训练可插拔条件编码器的先驱。它的做法是创建一个可训练的副本,这个副本复制了预训练层,并且能够容纳条件输入。然后,编码器的输出通过初始化为零的层重新整合到 T2I 模型中。 T2I-Adapter 和 IP-Adapter:T2I-Adapter [30] 利用轻量级结构注入控制,而 IP-Adapter [58] 通过将参考图像转换为补充嵌入,并将这些嵌入与文本嵌入连接起来,来整合风格条件。作者的目标是通过附加的编码器网络来增强对生成过程的控制,而不是对整个模型架构进行大规模的改动或重新训练。
METHODS-SparseCtrl
T2V Diffusion Models
T2V generators:Recent T2V model [4, 14, 61] typically extend a pre-trained T2I generator for videos by incorporating temporal layers between the 2D image layers.This arrangement enables crossframe information exchange, thereby effectively modeling the cross-frame motion and temporal consistency. Training objectives:E z 0 1 : N , c t , ϵ , t [ ∥ ϵ − ϵ θ ( α t z 0 1 : N + σ t ϵ , c t , t ) ∥ 2 2 ] \mathbb{E}_{\boldsymbol{z_0^{1:N}},\boldsymbol{c_t},\boldsymbol{\epsilon},t}\left[\|\epsilon-\epsilon_\theta(\alpha_t\boldsymbol{z_0^{1:N}}+\sigma_t\boldsymbol{\epsilon},\boldsymbol{c_t},t)\|_2^2\right] Ez01:N,ct,ϵ,t[∥ϵ−ϵθ(αtz01:N+σtϵ,ct,t)∥22] z 1 : N 0 z_{1:N}^0 z1:N0 表示没有添加噪声的干净RGB视频或潜在特征,这里 N N N是视频帧的数量。 c t c_t ct 是文本描述的嵌入(embeddings),它将文本转换为模型可以处理的数值形式。(the embeddings of the text description) ϵ \epsilon ϵ 是采样的高斯噪声,形状与 z 1 : N 0 z_{1:N}^0 z1:N0 相同。(sampled Gaussian noise) α t \alpha_t αt和 σ t \sigma_t σt 是控制添加噪声强度的项,它们可能随时间步 t t t变化。 t t t是均匀采样的扩散步骤, t = 1 , … , N t = 1, \ldots, N t=1,…,N。 T T T是总的扩散步骤数。 ϵ θ \epsilon_{\theta} ϵθ表示模型参数化为 θ \theta θ 的预测噪声。
Sparse Condition Encoder
Controlnet:In the T2I domain, ControlNet successfully adds structure control to the pre-trained image generator by partially replicating a copy of the pre-train model and its input, then adding the conditions and reintegrating the output back to the original model through zero-initialized layers.Limited controllability of frame-wise encoder
unuseful method1: first try training a ControlNet-like encoder to incorporate sparse condition signals,we build a frame-wise encoder akin to ControlNet, replicate it across the temporal dimension, and add the conditions to the desired keyframes through this auxiliary structure,For frames that are not directly conditioned, we input a zero image to the encoder and indicate the unconditioned state through an additional mask channel RESULTS:fail to maintain temporal consistency when used with sparse input conditions,e.g., in the image animation scenario where only the first frame is conditioned. In such cases, only keyframes react to the condition, leading to abrupt content changes between the conditioned and unconditioned frames.Condition propagation across frames.
assumptions:the above problem arises because the T2V backbone has difficulty inferring the intermediate condition states for the unconditioned frames‘ solution:add temporal layers to the sparse condition encoders that allow the conditional signal to propagate from frame to frame results:different frames within a video clip share similarities in both appearance and structure. The temporal layers can thus propagate such implicit information from the conditioned keyframes to the unconditioned frames, thereby enhancing consistencyQuality degradation caused by manually noised latents
PROBLEM Although the sparse condition encoder with temporal layers could tackle the sparsity of inputs, it sometimes leads to visual quality degradation of the generated videos the input for the original ControlNet encoder is the sum between the condition (after zero-initialized layers,the Unet encoder) and the noised sample( z t z_t zt). However, in terms of the unconditioned frames in our setting, the informative input of the sparse encoder becomes only the noised sample. This might encourage the sparse encoder to overlook the condition maps and rely on the noised sample ( z t z_t zt)during training, which contradicts our goal of controllability enhancement. SOLUTIONour proposed sparse encoder eliminates the noised sample input and only accepts the condition maps [ c s , m ] [c_s,m] [cs,m] after concatenation.
m:mask, m ∈ { 0 , 1 } h × w \boldsymbol{m}\in\{0,1\}^{h\times w} m∈{0,1}h×w c s c_s cs:conditional signalUnifying sparsity via masking(unify different sparsity with a single model)
use zero images as the input placeholder for unconditioned frames use concatenate a binary mask sequence to the input conditionswe concatenate a mask m ∈ { 0 , 1 } h × w \boldsymbol{m}\in\{0,1\}^{h\times w} m∈{0,1}h×wchannel-wise in addition to the condition signals cs at each frame to form the input of the sparse encoder. m = 0 indicates the current frame is unconditioned . m = 1 indicates the current frame is conditioned. In this way, different sparse input cases can be represented with a unified input format.
Multiple Modalities and Applications
Sketch-to-video generation.
With SparseCtrl, users can supply any number of sketches to shape the video content. For instance, a single sketch can establish the overall layout of the video, while sketches of the first, last, and selected intermediate frames can define coarse motion
Depth guided generation
users can render a video by directly exporting sparse depth maps from engines or 3D representations [29] or conduct video translation using depth as an intermediate representation.
Image animation and transition; video prediction and interpolation
Within the context of RGB video, numerous tasks can be unified into a single problem of video generation with RGB image conditions. In this scheme, image animation corresponds to video generation conditioned on the first frame; Transition is conditioned by the first and last frames; Video prediction is conditioned on a small number of beginning frames; Interpolation is conditioned on uniformly sparsed keyframes.
Experiments
Training
训练目标:
SparseCtrl的训练目标与前文提到的公式 E z 0 1 : N , c t , ϵ , t [ ∥ ϵ − ϵ θ ( α t z 0 1 : N + σ t ϵ , c t , t ) ∥ 2 2 ] \mathbb{E}_{\boldsymbol{z_0^{1:N}},\boldsymbol{c_t},\boldsymbol{\epsilon},t}\left[\|\epsilon-\epsilon_\theta(\alpha_t\boldsymbol{z_0^{1:N}}+\sigma_t\boldsymbol{\epsilon},\boldsymbol{c_t},t)\|_2^2\right] Ez01:N,ct,ϵ,t[∥ϵ−ϵθ(αtz01:N+σtϵ,ct,t)∥22]一致,即预测添加到干净RGB视频或潜在特征上的噪声量级。 训练中唯一的区别在于将提出的稀疏条件编码器(sparse condition encoder)集成到了预训练的T2Vbackbone.中。训练策略:
为了帮助条件编码器学习强大的可控性,作者采用了一种简单的策略,在训练过程中随机遮蔽(mask out)一些条件。 在每次迭代中,首先随机采样一个介于1和N之间的数 N c N_c Nc,以确定将有多少帧接收到条件。 然后,从集合{1, 2, …, N}中不重复地抽取 N c N_c Nc个索引,并为相应的帧保留条件。Main Results
作者展示了不同遮蔽率(rmask)下的方法性能,发现随着控制稀疏性的增加,SparseCtrl与密集控制基线方法保持了可比的错误率。相比之下,使用帧级控制信号的AnimateDiff与ControlNet组合方法在控制变得更稀疏时,错误率增加,表明该方法可能在控制更稀疏时忽视了条件信号。
总结
论文: https://arxiv.org/pdf/2311.16933
代码:https://guoyww.github.io/projects/SparseCtrl
MOTIVATION
relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty.The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences to enhance controllability, whose collection accordingly increases the burden of inference.
textual prompts, being inherently abstract expressions, struggle to accurately define complex structural attributes such as spatial layouts, poses, and shapes.
for precise output control, existing works necessitate temporally dense structural map sequences, which means that users need to furnish condition maps for each frame in the generated video, thereby increasing the practical costs
CONTRIBUTION
We introduce SparseCtrl, an efficient approach that targets controlling text-to-video generation via temporally sparse condition maps with an add-on encoder.
More specifically, in order to control the synthesis ,we apply the philosophy of ControlNet [62], which implements an auxiliary encoder while preserving the integrity of the original generator.This design allows us to incorporate additional conditions by merely training the encoder network on top of the pre-trained T2V model, thereby eliminating the need for comprehensive model retraining.
this design facilitates control over not only the original T2V but also the derived personalized models when combined with the plug-and-play motion module of AnimateDiff.To achieve this, we design a condition encoder equipped with temporal-aware layers that propagate the sparse condition signals from conditioned keyframes to unconditioned frames
RELATED WORKS
Text-to-video diffusion models:缺乏对合成结果的细粒度可控性
最初的尝试集中在从头开始训练 T2V 模型。
例如,Video Diffusion Model 扩展了标准图像架构以适应视频数据,并同时对图像和视频进行训练。
Imagen Video 使用级联结构进行高分辨率 T2V 生成。
Make-A-Video 使用文本-图像先验模型来减少对文本-视频配对数据的依赖。
基于 T2I 模型的 T2V 模型构建:
一些研究者在强大的文本到图像(Text-to-Image, T2I)模型的基础上构建 T2V 模型,例如 Stable Diffusion。
通过增加额外的层来模拟帧间运动和一致性。
特定 T2V 模型的特点:
MagicVideo 采用因果设计,在压缩的潜在空间中执行训练,以减轻计算需求。
AlignYour-Latents 通过对齐独立采样的噪声图,有效地将 T2I 转换为视频生成器。
AnimateDiff 使用可插拔的运动模块,在个性化的图像骨干网络上实现高质量的动画创建。
Controllable T2V generation:如何解决prompt模糊的问题
高级别视频运动控制:
LoRA层学习:一些研究提出通过学习LoRA层来适应特定的运动模式。
轨迹、运动向量或姿势序列:其他方法使用提取的轨迹、运动向量或姿势序列来控制视频运动。
特定合成关键帧的管理:
图像单独编码:将关键帧图像单独编码并传递给生成器。
与噪声输入连接:将关键帧信息与噪声输入连接,以增强控制信号。
多级特征注入:利用多级特征注入技术来控制关键帧。
细粒度空间结构控制:
单目深度序列:例如Gen-1模型,使用单目深度序列作为结构指导。
草图和深度序列编码:VideoComposer通过共享编码器编码草图和深度序列,便于推理时的灵活组合。
现有方法的局限性:尽管上述方法能够实现一定程度的细粒度控制,但它们通常需要为每个合成帧提供条件,这在实际应用中成本过高。
Add-on network for additional control
训练基础的 T2I/T2V 生成模型在计算上要求很高,因此,为了在这些模型中加入额外的控制而又不重新训练整个模型,一种优选的方法是训练一个额外的条件编码器,同时保持原始模型骨架的完整性。
ControlNet 的创新:ControlNet [62] 是在预训练的 T2I 模型上训练可插拔条件编码器的先驱。它的做法是创建一个可训练的副本,这个副本复制了预训练层,并且能够容纳条件输入。然后,编码器的输出通过初始化为零的层重新整合到 T2I 模型中。
T2I-Adapter 和 IP-Adapter:T2I-Adapter [30] 利用轻量级结构注入控制,而 IP-Adapter [58] 通过将参考图像转换为补充嵌入,并将这些嵌入与文本嵌入连接起来,来整合风格条件。
作者的目标是通过附加的编码器网络来增强对生成过程的控制,而不是对整个模型架构进行大规模的改动或重新训练。
METHODS-SparseCtrl
T2V Diffusion Models
T2V generators:Recent T2V model [4, 14, 61] typically extend a pre-trained T2I generator for videos by incorporating temporal layers between the 2D image layers.This arrangement enables crossframe information exchange, thereby effectively modeling the cross-frame motion and temporal consistency.
Training objectives:
E
z
0
1
:
N
,
c
t
,
ϵ
,
t
[
∥
ϵ
−
ϵ
θ
(
α
t
z
0
1
:
N
+
σ
t
ϵ
,
c
t
,
t
)
∥
2
2
]
\mathbb{E}_{\boldsymbol{z_0^{1:N}},\boldsymbol{c_t},\boldsymbol{\epsilon},t}\left[\|\epsilon-\epsilon_\theta(\alpha_t\boldsymbol{z_0^{1:N}}+\sigma_t\boldsymbol{\epsilon},\boldsymbol{c_t},t)\|_2^2\right]
Ez01:N,ct,ϵ,t[∥ϵ−ϵθ(αtz01:N+σtϵ,ct,t)∥22]
z
1
:
N
0
z_{1:N}^0
z1:N0 表示没有添加噪声的干净RGB视频或潜在特征,这里
N
N
N是视频帧的数量。
c
t
c_t
ct 是文本描述的嵌入(embeddings),它将文本转换为模型可以处理的数值形式。(the embeddings of the text description)
ϵ
\epsilon
ϵ 是采样的高斯噪声,形状与
z
1
:
N
0
z_{1:N}^0
z1:N0 相同。(sampled Gaussian noise)
α
t
\alpha_t
αt和
σ
t
\sigma_t
σt 是控制添加噪声强度的项,它们可能随时间步
t
t
t变化。
t
t
t是均匀采样的扩散步骤,
t
=
1
,
…
,
N
t = 1, \ldots, N
t=1,…,N。
T
T
T是总的扩散步骤数。
ϵ
θ
\epsilon_{\theta}
ϵθ表示模型参数化为
θ
\theta
θ 的预测噪声。
Sparse Condition Encoder
Controlnet:In the T2I domain, ControlNet successfully adds structure control to the pre-trained image generator by partially replicating a copy of the pre-train model and its input, then adding the conditions and reintegrating the output back to the original model through zero-initialized layers.
Limited controllability of frame-wise encoder
unuseful method1: first try training a ControlNet-like encoder to incorporate sparse condition signals,we build a frame-wise encoder akin to ControlNet, replicate it across the temporal dimension, and add the conditions to the desired keyframes through this auxiliary structure,For frames that are not directly conditioned, we input a zero image to the encoder and indicate the unconditioned state through an additional mask channel
RESULTS:fail to maintain temporal consistency when used with sparse input conditions,e.g., in the image animation scenario where only the first frame is conditioned. In such cases, only keyframes react to the condition, leading to abrupt content changes between the conditioned and unconditioned frames.
Condition propagation across frames.
assumptions:the above problem arises because the T2V backbone has difficulty inferring the intermediate condition states for the unconditioned frames‘
solution:add temporal layers to the sparse condition encoders that allow the conditional signal to propagate from frame to frame
results:different frames within a video clip share similarities in both appearance and structure. The temporal layers can thus propagate such implicit information from the conditioned keyframes to the unconditioned frames, thereby enhancing consistency
Quality degradation caused by manually noised latents
PROBLEM
Although the sparse condition encoder with temporal layers could tackle the sparsity of inputs, it sometimes leads to visual quality degradation of the generated videos
the input for the original ControlNet encoder is the sum between the condition (after zero-initialized layers,the Unet encoder) and the noised sample(
z
t
z_t
zt).
However, in terms of the unconditioned frames in our setting, the informative input of the sparse encoder becomes only the noised sample.
This might encourage the sparse encoder to overlook the condition maps and rely on the noised sample (
z
t
z_t
zt)during training, which contradicts our goal of controllability enhancement.
SOLUTION
our proposed sparse encoder eliminates the noised sample input and only accepts the condition maps
[
c
s
,
m
]
[c_s,m]
[cs,m] after concatenation.
m:mask,
m
∈
{
0
,
1
}
h
×
w
\boldsymbol{m}\in\{0,1\}^{h\times w}
m∈{0,1}h×w
c
s
c_s
cs:conditional signal
Unifying sparsity via masking(unify different sparsity with a single model)
use zero images as the input placeholder for unconditioned frames
use concatenate a binary mask sequence to the input conditions
we concatenate a mask
m
∈
{
0
,
1
}
h
×
w
\boldsymbol{m}\in\{0,1\}^{h\times w}
m∈{0,1}h×wchannel-wise in addition to the condition signals cs at each frame to form the input of the sparse encoder.
m = 0 indicates the current frame is unconditioned .
m = 1 indicates the current frame is conditioned.
In this way, different sparse input cases can be represented with a unified input format.
Multiple Modalities and Applications
Sketch-to-video generation.
With SparseCtrl, users can supply any number of sketches to shape the video content. For instance, a single sketch can establish the overall layout of the video, while sketches of the first, last, and selected intermediate frames can define coarse motion
Depth guided generation
users can render a video by directly exporting sparse depth maps from engines or 3D representations [29] or conduct video translation using depth as an intermediate representation.
Image animation and transition; video prediction and interpolation
Within the context of RGB video, numerous tasks can be unified into a single problem of video generation with RGB image conditions. In this scheme, image animation corresponds to video generation conditioned on the first frame; Transition is conditioned by the first and last frames; Video prediction is conditioned on a small number of beginning frames; Interpolation is conditioned on uniformly sparsed keyframes.
Experiments
Training
训练目标:
SparseCtrl的训练目标与前文提到的公式
E
z
0
1
:
N
,
c
t
,
ϵ
,
t
[
∥
ϵ
−
ϵ
θ
(
α
t
z
0
1
:
N
+
σ
t
ϵ
,
c
t
,
t
)
∥
2
2
]
\mathbb{E}_{\boldsymbol{z_0^{1:N}},\boldsymbol{c_t},\boldsymbol{\epsilon},t}\left[\|\epsilon-\epsilon_\theta(\alpha_t\boldsymbol{z_0^{1:N}}+\sigma_t\boldsymbol{\epsilon},\boldsymbol{c_t},t)\|_2^2\right]
Ez01:N,ct,ϵ,t[∥ϵ−ϵθ(αtz01:N+σtϵ,ct,t)∥22]一致,即预测添加到干净RGB视频或潜在特征上的噪声量级。
训练中唯一的区别在于将提出的稀疏条件编码器(sparse condition encoder)集成到了预训练的T2Vbackbone.中。
训练策略:
为了帮助条件编码器学习强大的可控性,作者采用了一种简单的策略,在训练过程中随机遮蔽(mask out)一些条件。
在每次迭代中,首先随机采样一个介于1和N之间的数
N
c
N_c
Nc,以确定将有多少帧接收到条件。
然后,从集合{1, 2, …, N}中不重复地抽取
N
c
N_c
Nc个索引,并为相应的帧保留条件。
Main Results
作者展示了不同遮蔽率(rmask)下的方法性能,发现随着控制稀疏性的增加,SparseCtrl与密集控制基线方法保持了可比的错误率。相比之下,使用帧级控制信号的AnimateDiff与ControlNet组合方法在控制变得更稀疏时,错误率增加,表明该方法可能在控制更稀疏时忽视了条件信号。