摘要
目的三维多目标跟踪是一项极具挑战性的任务,图像和点云的多模态融合能够提升多目标跟踪性能,但由于场景的复杂性以及多模态数据类型的不同,融合的充分性和关联的鲁棒性仍是亟待解决的问题。因此,提出图像与点云多重信息感知关联的三维多目标跟踪方法。方法首先,提出混合软注意力模块,采用通道分离技术对图像语义特征进行增强,更好地实现通道和空间注意力之间的信息交互。然后,提出语义特征引导的多模态融合网络,将点云特征、图像特征以及逐点图像特征进行深度自适应持续融合,抑制不同模态的干扰信息,提高网络对远距离小目标以及被遮挡目标的跟踪效果。最后,构建多重信息感知亲和矩阵,利用交并比、欧氏距离、外观信息和方向相似性等多重信息进行数据关联,增加轨迹和检测的匹配率,提升跟踪性能。结果在KITTI和NuScenes两个基准数据集上进行评估并与较先进跟踪方法进行对比。KITTI数据集上,HOTA(higher order tracking accuracy)和MOTA(multi-object tracking accuracy)指标分别达到76.94%和88.12%,相比于对比方法中性能最好的模型,分别提升1.48%和3.49%。NuScenes数据集上,AMOTA(average multi-object tracking accuracy)和MOTA指标分别达到68.3%和57.9%,相比于对比方法中性能最好的模型,分别提升0.6%和1.1%,两个数据集上的整体性能均优于先进的跟踪方法。结论提出的方法能够准确地跟踪复杂场景下的目标,具有更好的跟踪鲁棒性,更适合处理自动驾驶场景中的三维多目标跟踪任务。
Objective 3D multi object tracking is a challenging task in autonomous driving,which plays a crucial role in improving the safety and reliability of the perception system.RGB cameras and LiDAR sensors are the most commonly used sensors for this task.While RGB cameras can provide rich semantic feature information,they lack depth information.LiDAR point clouds can provide accurate position and geometric information,but they suffer from problems such as dense near distance and sparse far distance,disorder,and uneven distribution.The multimodal fusion of images and point clouds can improve multi object tracking performance,but due to the complexity of the scene and multimodal data types,the existing fusion methods are less effective and cannot obtain rich fusion features.In addition,existing methods use the intersection ratio or Euclidean distance between the predicted and detected bounding boxes of objects to calculate the similarity between objects,which can easily cause problems such as trajectory fragmentation and identity switching.Therefore,the adequacy of multimodal data fusion and the robustness of data association are still urgent problems to be solved.To this end,a 3D multi object tracking method based on image and point cloud multi-information perception association is proposed.Method First,a hybrid soft attention module is proposed to enhance the image semantic features using channel separation techniques to improve the information interaction between channel and spatial attention.The module includes two submodules.The first one is the soft channel attention submodule,which first compresses the spatial information of image features into the channel feature vector after the global average pooling layer,followed by two fully connected layers to capture the correlation between channels,followed by the Sigmoid function processing to obtain the channel attention map,and finally multiplies with the original features to obtain the channel enhancement features.The second is the soft spatial attention submodule.To
作者
刘祥
李辉
程远志
孔祥振
陈双敏
Liu Xiang;Li Hui;Cheng Yuanzhi;Kong Xiangzhen;Chen Shuangmin(School of Information Science and Technology,Qingdao University of Science and Technology,Qingdao 266061,China;School of Computer Science and Technology,Harbin Institute of Technology,Harbin 150006,China;Department of Industrial Engineering and Innovation Sciences,Eindhoven University of Technology,Eindhoven 5612,the Netherlands)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第1期163-178,共16页
Journal of Image and Graphics
基金
国家自然科学基金项目(62002190,61702295)
山东省高等学校青创科技支持计划项目(2019KJN047)
山东省自然科学基金项目(ZR2020MF036)。
关键词
点云
三维多目标跟踪
注意力
多模态融合
数据关联
point cloud
3D multi-object tracking
attention
multimodal fusion
data association