摘要
监控视频下的事件识别是近期计算机视觉领域的研究热点之一.然而,自然场景下监控视频往往具有背景复杂、事件区域内对象遮挡严重等特点,使得事件类内差异大、类间差异小,给识别带来了很大的困难.为解决复杂背景下事件识别问题,提出了一种基于深度残差双单向DLSTM(DRDU-DLSTM)的时空一致视频事件识别方法.该方法首先从训练好的时间CNN网络和空间CNN网络获取视频的时空深度特征,经LSTM同步解析后形成时空特征数据联接单元DLSTM,并作为残差网络的输入.双单向传递的DLSTM联接后构成DU-DLSTM层;多个DU-DLSTM层再加一个恒等映射形成残差模块;在此基础上,多层的残差模块堆叠构成了深度残差网络架构.为了进一步优化识别结果,设计了基于双中心Loss的2C-softmax目标函数,在最大化类间距离的同时最小化类内间隔距离.在监控视频数据集VIRAT 1.0和VIRAT 2.0上的实验表明,该文提出的事件识别方法有很好的性能表现和稳定性,识别准确率分别提高了5.1%和7.3%.
Event recognition in surveillance video is attracting growing interest in recent years. Nevertheless, event recognition in real-world surveillance video still faces great challenges due to various facets such as cluttered background, severe occlusion in event bounding box, tremendous intra-class variations while small inter-class variations, etc. A pronounced tendency is that more researches focus on learning deep features from raw data. Two-stream CNNs (Convolutional Neural Networks) architecture becomes a very successful model in video analysis field, in which appearance features and short-term motion features are utilized. In contrast, Long Short-Term Memory (LSTM) network can learn long-term motion features from the input sequence, which is widely used to process those tasks with quintessential time series. In order to combine the advantages of the two types of networks, in this paper, we propose a deep residual dual unidirectional double LSTM (DRDU - DLSTM) for video event recognition in surveillance video with complex scenes. In the first place, deep features are extracted from the fine - tuned temporal CNN and spatial CNN. Since fully connected layers (FC) takes more semantic information than convolutional layers, which are more suitable as the inputs of LSTM network, we extract FC6 feature of spatial CNN and FC7 feature of temporal CNN respectively. Secondly, to reinforce spatial-temporal consistency, the deep features are transformed by spatial LSTM (SLSTM) and temporal LSTM (TLSTM) respectively, and conjugated as a unit called double - LSTM (DLSTM), which forms the input of the residual network. DLSTM cells increase the number of hidden nodes of LSTM cells, and expand the width of the networks. The input features of spatial CNN and temporal CNN are deeply intertwined by DLSTM cells. At the same time, the features will be transmitted and evolved simultaneously, which will increase the consistency of spatial and temporal features. Furthermore, dual unidirectional DLSTMs are con
作者
李永刚
王朝晖
万晓依
董虎胜
龚声蓉
刘纯平
季怡
朱蓉
LI Yong-Gang;WANG Zhao-Hui;WAN Xiao-Yi;DONG Hu-Sheng;GONG Sheng-Rong;LIU Chun-Ping;JI Yi;ZHU Rong(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;College of Mathematics Physics and Information Engineering,Jiaxing University,Jiaxing,Zhejiang 314001;School of Computer Science and Engineering,Changshu Institute of Science and Technology,Changshu,Jiangsu 215500;School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012)
出处
《计算机学报》
EI
CSCD
北大核心
2018年第12期2852-2866,共15页
Chinese Journal of Computers
基金
国家自然科学基金(61773272
61170124
61272258
61301299)
教育部科技发展中心"云数融合科教创新"基金(2017B03112)
江苏省自然科学基金(BK20151260
BK20151254)
浙江省自然科学基金(LY15F020039)
江苏省"六大人才高峰"项目(DZXX-027)
吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172016K08)
江苏省研究生科研与实践创新计划项目(KYCX17_2006)资助~~
关键词
事件识别
时空一致
残差网络
LSTM
双单向
DLSTM
深度特征
监控视频
even recognition
spatial-temporal consistency
residual network
long short-term memory
dual unidirectional
double long short-term memory
deep feature
surveillance video