摘要
针对视频描述任务,提出一种基于动态视觉注意的多语言视频描述算法。基于基础编解码结构,提取视频片段的时空特征信息和语义属性信息,用以视频表达。在解码阶段,两层长短期记忆网络构成的解码框架分别处理时空和语义信息,并通过嵌入注意力模块和动态选择模块,使得整个模型在有能力关注最重要信息出现时刻的同时,还能动态选择当前时刻最佳信息用以生成描述词。基于整个网络实现,通过共享编解码器的方式,在公开大型视频描述数据集VATEX上实现多语言描述生成方法,并测试提出方法生成描述语句的准确度,与基准方法相比,取得较好效果。
A multilingual video captioning algorithm based on dynamic visual attention is proposed for video captioning task in this paper.Based on the basic encoder-decoder structure,the spatiotemporal feature information and semantic attribute information of video clips are extracted for video expression.In the decoder,the decoder composed of two layers of long short term memory deals with spatiotemporal and semantic information respectively.By embedding an attention module and a dynamic selection module,the whole model can pay attention to the moment when the most important information appears,and dynamically select the best information at the current moment to generate words.Based on the whole network implementation,a multilingual video captioning method is implemented on a large public video captioning data set VATEX by sharing encoder-decoder,and the accuracy of the proposed method is tested in this paper.Compared with the benchmark method,a better result is achieved.
出处
《工业控制计算机》
2021年第7期62-64,共3页
Industrial Control Computer
基金
上海市科委港澳台科技合作项目(18510760300)
中国博士后基金项目(2020M681264)。
关键词
视频描述
语义属性
长短期记忆网络
动态注意力
多语言
video captioning
semantic attribute
long short term memory
dynamic attention
multilingual