摘要
文中探究了弹幕信息协助下的视频多标签分类任务。多标签视频分类任务根据视频内容从不同角度赋予视频多个标签,与视频推荐等应用紧密相关。多标签视频数据集的高标注成本和对视频内容的多角度理解是该研究领域面临的主要问题。弹幕是一种新近出现的用户评论形式,受到了众多用户的欢迎。由于用户参与度高,弹幕视频网站的视频拥有大量用户自发添加的标签,这些标签是天然的多标签数据。文中以此构建了一个多标签视频数据集,并整理出了视频标签间的层级语义关系,该数据集在未来将公开发布。同时,弹幕文本模态包含大量与视频内容相关的细粒度信息,因此在以往视频分类工作融合视觉和音频模态的基础上,引入弹幕文本模态进行视频多标签分类研究。在基于聚类的NeXtVLAD模型、注意力Dbof模型和基于时序的GRU模型上进行实验,在增加弹幕模态后,GAP指标最高提升了23%,证明了弹幕信息对该任务具有辅助作用。此外,还探索了如何在分类中利用标签层级关系,通过构建标签关系矩阵来改造标签,进而将标签语义融入训练。实验结果表明,加入标签关系后,Hit@1指标提升了15%,因此其能优化多标签分类的效果。此外,MAP指标在细粒度小类上提升了4%,说明标签语义的引入有利于预测样本量较少的类别,具有研究价值。
This work explores the multi-label video classification task assisted by danmaku.Multi-label video classification can associate multiple tags to a video from different aspects,which can benefit video understanding tasks such as video recommendation.There are two challenges in this task,one is the high annotation cost of dataset,and the other is how to understand video from multi-aspect and multimodal perspectives.Danmaku is a new trend of online commenting.Danmaku video has lots of manual annotations added by website users for high user engagement.It can be used as classification data directly.This work collects a multi-label danmaku video dataset and builds a hierarchical label correlation structure for the first time on danmaku video data.The dataset will be released in the future.Danmaku contains informative and fine-grained interaction data with the video content.This paper introduces danmaku modality to assist classification based on previous works,most of which only combine the visual and audio modalities.This paper choses cluster-based model NeXtVLAD,attention Dbof and temporal based GRU models as baselines.Experiments show that danmaku data is helpful,which improves GAP by 0.23.This paper also explores the use of label correlation,updating the video labels by a relationship matrix to integrate the semantic information into training.Experiments show that the leverage of label correlation improves Hit@1 by 0.15.Besides,the MAP can be improved by 0.04 in fine-grained labels,which indicates that the label semantic information benefits the prediction of small classes and it is valuable to explore.
作者
陈洁婷
王维莹
金琴
CHEN Jie-ting;WANG Wei-ying;JIN Qin(School of Information,Renmin University of China,Beijing 100872,China)
出处
《计算机科学》
CSCD
北大核心
2021年第1期167-174,共8页
Computer Science
基金
国家自然科学基金(61772535)
北京市自然科学基金(4192028)
国家重点研发计划(2016YFB1001202)。
关键词
分类
多标签
弹幕
视频
标签关系
多模态
Classification
Multi-label
Danmaku
Video
Label correlation
Multi-modal