In order to effectively solve the problems of low accuracy,large amount of computation and complex logic of deep learning algorithms in behavior recognition,a kind of behavior recognition based on the fusion of 3 dime...In order to effectively solve the problems of low accuracy,large amount of computation and complex logic of deep learning algorithms in behavior recognition,a kind of behavior recognition based on the fusion of 3 dimensional batch normalization visual geometry group(3D-BN-VGG)and long short-term memory(LSTM)network is designed.In this network,3D convolutional layer is used to extract the spatial domain features and time domain features of video sequence at the same time,multiple small convolution kernels are stacked to replace large convolution kernels,thus the depth of neural network is deepened and the number of network parameters is reduced.In addition,the latest batch normalization algorithm is added to the 3-dimensional convolutional network to improve the training speed.Then the output of the full connection layer is sent to LSTM network as the feature vectors to extract the sequence information.This method,which directly uses the output of the whole base level without passing through the full connection layer,reduces the parameters of the whole fusion network to 15324485,nearly twice as much as those of 3D-BN-VGG.Finally,it reveals that the proposed network achieves 96.5%and 74.9%accuracy in the UCF-101 and HMDB-51 respectively,and the algorithm has a calculation speed of 1066 fps and an acceleration ratio of 1,which has a significant predominance in velocity.展开更多
Image paragraph generation aims to generate a long description composed of multiple sentences,which is different from traditional image captioning containing only one sentence.Most of previous methods are dedicated to...Image paragraph generation aims to generate a long description composed of multiple sentences,which is different from traditional image captioning containing only one sentence.Most of previous methods are dedicated to extracting rich features from image regions,and ignore modelling the visual relationships.In this paper,we propose a novel method to generate a paragraph by modelling visual relationships comprehensively.First,we parse an image into a scene graph,where each node represents a specific object and each edge denotes the relationship between two objects.Second,we enrich the object features by implicitly encoding visual relationships through a graph convolutional network(GCN).We further explore high-order relations between different relation features using another graph convolutional network.In addition,we obtain the linguistic features by projecting the predicted object labels and their relationships into a semantic embedding space.With these features,we present an attention-based topic generation network to select relevant features and produce a set of topic vectors,which are then utilized to generate multiple sentences.We evaluate the proposed method on the Stanford image-paragraph dataset which is currently the only available dataset for image paragraph generation,and our method achieves competitive performance in comparison with other state-of-the-art(SOTA)methods.展开更多
当前端到端自动驾驶系统的研究方法主要是采用图像或图像序列作为输入,使用卷积神经网络直接预测方向盘转角,取得了较好的效果,但仅通过转向命令并不足以完成自动驾驶车辆的控制。为了更好地实现对自动驾驶车辆的横纵向控制,构建基于端...当前端到端自动驾驶系统的研究方法主要是采用图像或图像序列作为输入,使用卷积神经网络直接预测方向盘转角,取得了较好的效果,但仅通过转向命令并不足以完成自动驾驶车辆的控制。为了更好地实现对自动驾驶车辆的横纵向控制,构建基于端到端学习的CNN-LSTM(卷积神经网络-长短时记忆)多模态多任务神经网络模型,将图像、速度序列和方向盘转角序列作为输入,从而同时预测车辆的方向盘转角和速度值。在搭建的基于GTAV(Grand Theft Auto V,侠盗猎车5)仿真平台数据集和真实场景数据集上进行实验和测试,实验结果表明模型能够较好地完成车道保持的驾驶行为和基本实现自动驾驶避障测试。展开更多
基金the National Natural Science Foundation of China(No.61772417,61634004,61602377)Key R&D Program Projects in Shaanxi Province(No.2017GY-060)Shaanxi Natural Science Basic Research Project(No.2018JM4018).
文摘In order to effectively solve the problems of low accuracy,large amount of computation and complex logic of deep learning algorithms in behavior recognition,a kind of behavior recognition based on the fusion of 3 dimensional batch normalization visual geometry group(3D-BN-VGG)and long short-term memory(LSTM)network is designed.In this network,3D convolutional layer is used to extract the spatial domain features and time domain features of video sequence at the same time,multiple small convolution kernels are stacked to replace large convolution kernels,thus the depth of neural network is deepened and the number of network parameters is reduced.In addition,the latest batch normalization algorithm is added to the 3-dimensional convolutional network to improve the training speed.Then the output of the full connection layer is sent to LSTM network as the feature vectors to extract the sequence information.This method,which directly uses the output of the whole base level without passing through the full connection layer,reduces the parameters of the whole fusion network to 15324485,nearly twice as much as those of 3D-BN-VGG.Finally,it reveals that the proposed network achieves 96.5%and 74.9%accuracy in the UCF-101 and HMDB-51 respectively,and the algorithm has a calculation speed of 1066 fps and an acceleration ratio of 1,which has a significant predominance in velocity.
基金supported in part by National Natural Science Foundation of China(Nos.61721004,61976214,62076078 and 62176246).
文摘Image paragraph generation aims to generate a long description composed of multiple sentences,which is different from traditional image captioning containing only one sentence.Most of previous methods are dedicated to extracting rich features from image regions,and ignore modelling the visual relationships.In this paper,we propose a novel method to generate a paragraph by modelling visual relationships comprehensively.First,we parse an image into a scene graph,where each node represents a specific object and each edge denotes the relationship between two objects.Second,we enrich the object features by implicitly encoding visual relationships through a graph convolutional network(GCN).We further explore high-order relations between different relation features using another graph convolutional network.In addition,we obtain the linguistic features by projecting the predicted object labels and their relationships into a semantic embedding space.With these features,we present an attention-based topic generation network to select relevant features and produce a set of topic vectors,which are then utilized to generate multiple sentences.We evaluate the proposed method on the Stanford image-paragraph dataset which is currently the only available dataset for image paragraph generation,and our method achieves competitive performance in comparison with other state-of-the-art(SOTA)methods.
文摘当前端到端自动驾驶系统的研究方法主要是采用图像或图像序列作为输入,使用卷积神经网络直接预测方向盘转角,取得了较好的效果,但仅通过转向命令并不足以完成自动驾驶车辆的控制。为了更好地实现对自动驾驶车辆的横纵向控制,构建基于端到端学习的CNN-LSTM(卷积神经网络-长短时记忆)多模态多任务神经网络模型,将图像、速度序列和方向盘转角序列作为输入,从而同时预测车辆的方向盘转角和速度值。在搭建的基于GTAV(Grand Theft Auto V,侠盗猎车5)仿真平台数据集和真实场景数据集上进行实验和测试,实验结果表明模型能够较好地完成车道保持的驾驶行为和基本实现自动驾驶避障测试。