摘要
针对现有局部近重复视频检测算法特征存储消耗大、整体查询效率低、提取特征时并未考虑近重复帧之间细微的语义差异等问题,文中提出了一种基于Transformer紧凑编码的局部近重复视频检测算法。首先,提出了一个基于Transformer的特征编码器,其学习了大量近重复帧之间细微的语义差异,可以在编码帧特征时对各个区域特征图引入自注意力机制,在有效降低帧特征维度的同时也提高了编码后特征的表示性。该特征编码器通过孪生网络训练得到,该网络不需要负样本就可以有效学习近重复帧之间的相似语义信息,因此无需沉重和困难的难负样本标注工作,使得训练过程更加简易和高效。其次,提出了一个基于视频自相似度矩阵的关键帧提取方法,可以从视频中提取丰富但不冗余的关键帧,从而使关键帧特征序列能够更全面地描述原视频内容,提升算法的性能,同时也大幅减少了存储和计算冗余关键帧带来的开销。最后,基于关键帧的低维紧凑编码特征,采用基于图网络的时间对齐算法,实现局部近重复视频片段的检测和定位。该算法在公开的局部近重复视频检测数据集VCDB上取得了优于现有算法的实验性能。
To address the issues of existing partial near-duplicate video detection algorithms,such as high storage consumption,low query efficiency,and feature extraction module that does not consider subtle semantic differences between near-duplicate frames,this paper proposes a partial near-duplicate video detection algorithm based on Transformer.First,a Transformer-based feature encoder is proposed,which canlearn subtle semantic differences between a large number of near-duplicate frames.The feature maps of frame regions are introduced with self-attention mechanism during frame feature encoding,effectively reducing the dimensionality of the feature while enhancing its representational capacity.The feature encoder is trained using a siamese network,which can effectively learn the semantic similarities between near-duplicate frames without negative samples.This eliminates the need for heavy and difficult negative sample annotation work,making the training process simpler and more efficient.Secondly,a key frame extraction method based on video self-similarity matrix is proposed.This method can extract rich,non-redundant key frames from the video,allowing for a more comprehensive description of the original video content and improved algorithm performance.Additionally,this approach significantly reduces the overhead associated with storing and computing redundant key frames.Finally,a graph network-based temporal alignment algorithm is used to detect and locate partial near-duplicate video clips based on the low-dimensional,compact encoded features of key frames.The proposed algorithm achieves impressive experimental results on the publicly available partial near-duplicate video detection dataset VCDB and outperforms existing algorithms.
作者
王萍
余圳煌
鲁磊
WANG Ping;YU Zhenhuang;LU Lei(School of Information and Communication Engineering,Xi’an Jiaotong University,Xi’an 710049,China)
出处
《计算机科学》
CSCD
北大核心
2024年第5期108-116,共9页
Computer Science