摘要
视频问答是视觉理解领域中非常重要且具有挑战性的任务。目前的视觉问答(VQA)方法主要关注单个静态图片的问答,而现实生活中的数据是立体动态的视频。此外,由于问题的复杂性,视频问答任务必须根据问答问题恰当地处理多种视觉特征才能获得高质量的答案。文中提出了一个通过利用局部和全局帧级别的视觉信息来进行视频问答的多共享注意力网络。具体来说,以不同帧率提取视频帧,并以此提取帧级的全局与局部视觉特征,这两种特征包含了多个帧级别特征,用于对视频时间动态建模,再以共享注意力的形式建模全局与局部视觉特征的相关性,然后结合文本问题来推断答案。在天池视频问答数据集上进行了大量的实验,验证了所提方法的有效性。
Video question answering is a challenging task of significant importance toward visual understanding.However,current visual question answering(VQA)methods mainly focus on a single static image,which is distinct from the sequential visual data we faced in the real world.In addition,due to the diversity of textual questions,the VideoQA task has to deal with various visual features to obtain the answers.This paper presents a multi-shared attention network by utilizing local and global frame-level visual information for video question answering(VideoQA).Specifically,a two-pathway model is proposed to capture the global and local frame-level features with different frame rates.The two pathways are fused together with the multi-shared attention by sharing the same attention funtion.Extensive experiments are conducted on Tianchi VideoQA dataset to validate the effectiveness of the proposed method.
作者
王雷全
候文艳
袁韶祖
赵欣
林瑶
吴春雷
WANG Lei-quan;HOU Wen-yan;YUAN Shao-zu;ZHAO Xin;LIN Yao;WU Chun-lei(College of Computer Science and Technology,China University of Petroleum,Qingdao,Shandong 266555,China;College of Oceanography and Space Informatics,China University of Petroleum,Qingdao,Shandong 266555,China)
出处
《计算机科学》
CSCD
北大核心
2021年第8期145-149,共5页
Computer Science
基金
科技部重点研发计划(2018YFC1406204),中央高校基本科研业务费专项资金(19CX05003A-11)。
关键词
视频问答
共享注意力机制
全局和局部帧级特征
Video question answering
Shared attention mechanism
Global and local pathways