摘要
目的通过深度学习卷积神经网络进行3维目标检测的方法已取得巨大进展,但卷积神经网络提取的特征既缺乏不同区域特征的依赖关系,也缺乏不同通道特征的依赖关系,同时难以保证在无损空间分辨率的情况下扩大感受野。针对以上不足,提出了一种结合混合域注意力与空洞卷积的3维目标检测方法。方法在输入层融入空间域注意力机制,变换输入信息的空间位置,保留需重点关注的区域特征;在网络中融入通道域注意力机制,提取特征的通道权重,获取关键通道特征;通过融合空间域与通道域注意力机制,对特征进行混合空间与通道的混合注意。在特征提取器的输出层融入结合空洞卷积与通道注意力机制的网络层,在不损失空间分辨率的情况下扩大感受野,根据不同感受野提取特征的通道权重后进行融合,得到全局感受野的关键通道特征;引入特征金字塔结构构建特征提取器,提取高分辨率的特征图,大幅提升网络的检测性能。运用基于二阶段的区域生成网络,回归定位更准确的3维目标框。结果KITTI(A project of Karlsruhe Institute of Technology and Toyota Technological Institute at Chicago)数据集中的实验结果表明,在物体被遮挡的程度由轻到高时,对测试集中的car类别,3维目标检测框的平均精度AP3D值分别为83.45%、74.29%、67.92%,鸟瞰视角2维目标检测框的平均精度APBEV值分别为89.61%、87.05%、79.69%;对pedestrian和cyclist类别,AP3D和APBEV值同样比其他方法的检测结果有一定优势。结论本文提出的3维目标检测网络,一定程度上解决了3维检测任务中卷积神经网络提取的特征缺乏视觉注意力的问题,从而使3维目标检测更有效地运用于室外自动驾驶。
Objective With the continuous development of convolutional neural network(CNN) used in deep learning in recent years, 3 D object detection networks based on deep learning have also made outstanding development. 3 D object detection aims to identify the class, location, orientation, and size of a target object in 3 D space. It is widely used in the visual field, such as autonomous driving, intelligent monitoring, and medical analysis. The feature extracted by a deep learning network is important in detection accuracy. The detection task is similar to human vision;that is, it also needs to distinguish the difference between the background and the objects. In human vision, attention is given to target objects, while the background is disregarded. Therefore, paying more attention to the target area and less attention to the background area is better when performing object detection in an image. However, a CNN does not distinguish which areas and channels in an image should be given more and less attention. Thus, the features extracted by a CNN not only lack the dependence relationship between different regions but also the dependence relationship between different channels. The current 3 D object detection method based on a deep learning network uses a combination of pooling layers behind the multilayer convolution layer. These network structures generally use maximum or averaging pooling in feature maps. They aim to adjust the receptive field size of the extracted features. However, transforming the receptive field of the features of the pooling layers must be performed by removing some information, causing a considerable loss of feature information. Information loss may result in detected errors. Therefore, a CNN should expand the receptive field without losing information, obtaining good detection results. To address the shortcomings of the aforementioned 3 D target detection methods, this study proposes a two-stage 3 D object detection network that combines mixed domain attention and dilated convolution. Method I
作者
严娟
方志军
高永彬
Yan Juan;Fang Zliijun;Gao Yongbin(Department of Electrical and Electronic Engineering,Shanghai University of Engineering Science,Slwnghoi 201620,China)
出处
《中国图象图形学报》
CSCD
北大核心
2020年第6期1221-1234,共14页
Journal of Image and Graphics
基金
国家自然科学基金项目(61802253,61772328)。
关键词
3维目标检测
注意力机制
空洞卷积
感受野
金字塔网络
卷积神经网络(CNN)
3D object detection
attention mechanism
dilated convolution
receptive field
feature pyramid network
convolutional neural network(CNN)