摘要
针对视频场景中底层特征与高层语义特征之间存在的“语义鸿沟”及多特征融合等问题,根据视频多模态之间时序关联共生的特性,提出了一种基于深度网络的多模态视频场景分割算法,从每个镜头中提取丰富的底层特征及语义概念特征,将其特征向量串联的整体特征向量作为深度网络的输入并进行嵌入空间学习,通过计算两个镜头整体特征向量之间的距离得到语义相似性的度量值,然后最小化时间段内距离的平方和对镜头进行聚类处理,最终得到语义层面的场景。实验结果表明,该算法在分类精度上具有良好的性能,能对视频场景实现有效分割。
The article is aiming at the semantic gap and multi-feature fusion between the bottom feature and the high-level semantic feature in the video scene.According to the characteristics of temporal correlation symbiosis between multi-modal video,a multi-modal video scene segmentation algorithm based on deep network is proposed.From each lens to extract rich underlying features and semantic conceptual features,the whole eigenvector in series is used as the input of deep network and embedded space is studied.By calculating the distance between the overall feature vectors of two lenses,a measure of semantic similarity is obtained.Then the sum of the squares of the distance in the time period is minimized to cluster the lens.Finally,the semantic scene is obtained.Experimental results show that the algorithm has good performance in classification accuracy and can effectively segment video scenes.
作者
苏筱涵
丰洪才
吴诗尧
SU Xiaohan;FENG Hongcai;WU Shiyao(School of Mathematic&Computer Science,Wuhan Polytechnic University,Wuhan 430023,China;不详)
出处
《武汉理工大学学报(信息与管理工程版)》
CAS
2020年第3期246-251,259,共7页
Journal of Wuhan University of Technology:Information & Management Engineering
基金
湖北省教育厅重点科研计划(D20101703)。
关键词
场景分割
多模态
深度网络嵌入
时间约束聚类
语义特征
scene segmentation
multi-modal
deep network embedding
time constrained clustering
semantic features