视频多模态模型: VideoChat、Video-LLaMA、Video-ChatGPT、Video-LLaVA等

VideoChat

VideoChat?: 基于视频指令数据微调的聊天机器人

https://arxiv.org/pdf/2305.06355.pdf

https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat

以视频为中心的多模态对话系统通过使用开源视觉模型将视频内容文本化，将视频理解表述为自然语言处理（NLP）问答。

引入了一种以视频为中心的多模态指令微调数据集。创建了一个独特的数据集，其中包含数千个视频，并配以详细的文本描述和对话，这些描述和对话使用密集的字幕按时间顺序提供给 ChatGPT。该数据集强调时空对象、动作、事件和因果关系，为训练以视频为中心的多模态对话系统提供了宝贵的资源。

模型描述

使用视觉模型从视频中提取概念：
[ E ] i j = f img j ( I i ) or E j = f vid j ( V ) w.r.t. V = [ I i ] i = 1 , 2 , … , T , \begin{aligned}[\mathbf{E}]_i^j=f_{\text{img}}^j(\mathbf{I}_i)\quad\text{or}\quad\mathbf{E}^j=f_{\text{vid}}^j(\mathbf{V})\quad\text{w.r.t.}\quad\mathbf{V}=[\mathbf{I}_i]_{i=1,2,\dots,T},\end{aligned} [E]ij=fimgj(Ii)orEj=fvidj(V)w.r.t.V=[Ii]i=1,2,…,T, 其中 E 表示文本描述或上下文嵌入， f i m g j f_{img}^j fimgj 表示第j个用于预测人类可读描述或视觉特征的模型，而 I 和 V 分别表示图像和视频。然后，我们将LLM基于用户问题的任务预测解码为：

W t a = f lim ⁡ ( E ∣ W ≤ t q , W < t a ) , \begin{aligned}\mathbf{W}_t^a=f_{\lim}(\mathbf{E}|\mathbf{W}_{\leq t}^q,\mathbf{W}_{<t}^a),\end{aligned} Wta=flim(E∣W≤tq,W<ta),

其中， W T a W_T^a WTa、 W ≤ t q W_{\leq t}^q W≤tq分别代表 LLM在第t论给出的答案和用户在轮次 t 之前给出的所有问题。 f l l m f_{llm} fllm表示LLM模型。

VideoChat(
  (visual_encoder): VisionTransformer()
  (ln_vision): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
  (Qformer): BertLMHeadModel(
   )
  (llama_model): LlamaForCausalLM(
    (lm_head): Linear(in_features=4096, out_features=32001, bias=False)
  )
  (llama_proj): Linear(in_features=768, out_features=4096, bias=True)
)

代码和效果

加载7B模型占用显存：18850MiB,demo中两个函数分别对应Upload & Start Chat(upload_img)和send（gradio_ask）按钮 upload_img会使用image_emb, _ = self.model.encode_img(image)将抽8帧的视频torch.Size([24, 224, 224])/torch.Size([1, 8, 3, 224, 224])变为torch.Size([1, 96, 4096])大小的特征，get_context_emb()会融合文字和图像特征然后作为LLM的输入，经过answer()进行推理并返回output_text, output_token.cpu().numpy(), conv

    def encode_img(self, image):
        device = image.device
        if self.low_resource:
            self.vit_to_cpu()
            image = image.to("cpu")

        with self.maybe_autocast():
            T = image.shape[1]
            # use_image = True if T == 1 else False
            image = image.permute(0, 2, 1, 3, 4) # [B,T,C,H,W] -> [B,C,T,H,W]

            image_embeds = self.ln_vision(self.visual_encoder(image)).to(device)
            image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(device)

            query_tokens = torch.cat([self.query_tokens, self.extra_query_tokens], dim=1)
            query_tokens = query_tokens.expand(image_embeds.shape[0], -1, -1)
            query_output = self.Qformer.bert(
                query_embeds=query_tokens,
                encoder_hidden_states=image_embeds,
                encoder_attention_mask=image_atts,
                return_dict=True,
            )

            inputs_llama = self.llama_proj(query_output.last_hidden_state)
            atts_llama = torch.ones(inputs_llama.size()[:-1], dtype=torch.long).to(image.device)
        return inputs_llama, atts_llama

Video-LLaVA

Video-LLaVA：通过投影前的对齐来学习统一的视觉表示

Video-ChatGPT

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Video-LLaMA

Video-LLaMA AnInstruction-tuned Audio-Visual Language Model for Video Understanding

A Simple LLM Framework for Long-Range Video Question-Answering

VTimeLLM：使 LLM 能够掌握视频时刻

https://paperswithcode.com/paper/vtimellm-empower-llm-to-grasp-video-moments 大型语言模型（LLM）已显示出卓越的文本理解能力，这些能力已扩展为视频 LLM，以处理视频数据以理解视觉细节。然而，现有的视频 LLM 只能提供整个视频的粗略描述，无法捕获特定事件的精确开始和结束时间边界。在本文中，我们通过提出VTimeLLM来解决这个问题，VTimeLLM是一种新颖的视频LLM，旨在对时间边界进行细粒度的视频时刻理解和推理。具体而言，我们的VTimeLLM采用边界感知三阶段训练策略，分别利用图像-文本对进行特征对齐，利用多事件视频来增加时间边界感知，以及高质量的视频教学调整，以进一步提高时间理解能力，并与人类意图保持一致。大量实验表明，在视频的细粒度时间相关理解任务中，如时态视频接地和密集视频字幕，VTimeLLM明显优于现有的视频LLM。此外，VTimeLLM在视频对话基准测试中进一步击败了现有的视频LLM，展示了其卓越的跨模态理解和推理能力。

从大型语言模型中学习视频表示

https://paperswithcode.com/paper/learning-video-representations-from-large

ImageBind-LLM：多模态指令调优

https://paperswithcode.com/paper/imagebind-llm-multi-modality-instruction

CG

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models MobiLlama: Towards Accurate and Lightweight Fully Transparent(透明的, 清楚易懂的) GPT

VideoChat(
  (visual_encoder): VisionTransformer(
    (patch_embed): PatchEmbed((proj): Conv3d(3, 1408, kernel_size=(1, 14, 14), stride=(1, 14, 14)))
    (pos_drop): Dropout(p=0.0, inplace=False)
    (blocks): ModuleList( (0-38): 39 x Block( (norm1): LayerNorm((1408,), eps=1e-06, elementwise_affine=True)+(attn): Attention())
    (gmhra): ModuleList((0-7): 8 x Global_MHRA())
  )
  (ln_vision): LayerNorm((1408,), eps=1e-05, elementwise_affine=True)
  (Qformer): BertLMHeadModel(
    (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): None + (position_embeddings): None + (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) + dropout): Dropout(p=0.0, inplace=False) ) + (encoder): BertEncoder() )
    (cls): None
  )
  (llama_model): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32001, 4096, padding_idx=0)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaAttention( ) + (mlp): LlamaMLP( ) + (input_layernorm): LlamaRMSNorm() + (post_attention_layernorm): LlamaRMSNorm()
        )
      )
      (norm): LlamaRMSNorm()
    )
    (lm_head): Linear(in_features=4096, out_features=32001, bias=False)
  )
  (llama_proj): Linear(in_features=768, out_features=4096, bias=True)
)