Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

下载PDF

导出

摘要 Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.

作者 Cong Pan Junran Peng Zhaoxiang Zhang

机构地区 the Center for Research on Intelligent Perception and Computing(CRIPAC) the School of Future Technology the Huawei Inc. IEEE the Institute of Automation the University of Chinese Academy of Sciences(UCAS) the Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science&Innovation

出处《IEEE/CAA Journal of Automatica Sinica》 SCIE EI CSCD 2024年第3期673-689,共17页 自动化学报（英文版）

基金 supported in part by the Major Project for New Generation of AI (2018AAA0100400) the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231) the InnoHK Program。

关键词 Monocular 3D object detection normalizing flows Swin Transformer

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1Shiming Liu,Yifan Xia,Zhusheng Shi,Hui Yu,Zhiqiang Li,Jianguo Lin.Deep Learning in Sheet Metal Bending With a Novel Theory-Guided Deep Neural Network[J].IEEE/CAA Journal of Automatica Sinica,2021,8(3):565-581. 被引量：6
2Imran Ahmed,Sadia Din,Gwanggil Jeon,Francesco Piccialli,Giancarlo Fortino.Towards Collaborative Robotics in Top View Surveillance:A Framework for Multiple Object Tracking by Detection Using Deep Learning[J].IEEE/CAA Journal of Automatica Sinica,2021,8(7):1253-1270. 被引量：8
3Jiayi Ma,Linfeng Tang,Fan Fan,Jun Huang,Xiaoguang Mei,Yong Ma.SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer[J].IEEE/CAA Journal of Automatica Sinica,2022,9(7):1200-1217. 被引量：31

二级参考文献5

1孙红磊,宋晓抗,马瑞,赵军.大型直缝焊管四点弯曲JCO成形及其弹复解析[J].中国机械工程,2014,25(2):257-262. 被引量：5
2Yifan Xia,Hui Yu,Fei-Yue Wang.Accurate and Robust Eye Center Localization via Fully Convolutional Networks[J].IEEE/CAA Journal of Automatica Sinica,2019,6(5):1127-1138. 被引量：7
3Mohammadhossein Ghahramani,Yan Qiao,Meng Chu Zhou,Adrian O’Hagan,James Sweeney.AI-Based Modeling and Data-Driven Evaluation for Smart Manufacturing Processes[J].IEEE/CAA Journal of Automatica Sinica,2020,7(4):1026-1037. 被引量：16
4Cosimo Ieracitano,Annunziata Paviglianiti,Maurizio Campolo,Amir Hussain,Eros Pasero,Francesco Carlo Morabito.A Novel Automatic Classification System Based on Hybrid Unsupervised and Supervised Machine Learning for Electrospun Nanofibers[J].IEEE/CAA Journal of Automatica Sinica,2021,8(1):64-76. 被引量：4
5Gaurav Bhatnagar,Q.M.Jonathan Wu.A Fractal Dimension Based Framework for Night Vision Fusion[J].IEEE/CAA Journal of Automatica Sinica,2019,6(1):220-227. 被引量：5

共引文献41

1陈凡,宋文革,范誉瀚,陈塞.基于CNN-Transformer融合模型的选煤厂振动筛上杂物语义分割研究[J].煤炭工程,2023,55(S01):193-199. 被引量：1
2Tianhao Zhang,Jiuhong Xiao,Liang Li,Chen Wang,Guangming Xie.Toward Coordination Control of Multiple Fish-Like Robots:Real-Time Vision-Based Pose Estimation and Tracking via Deep Neural Networks[J].IEEE/CAA Journal of Automatica Sinica,2021,8(12):1964-1976. 被引量：2
3Yang Yu,Zhenyu Lei,Yirui Wang,Tengfei Zhang,Chen Peng,Shangce Gao.Improving Dendritic Neuron Model With Dynamic Scale-Free Network-Based Differential Evolution[J].IEEE/CAA Journal of Automatica Sinica,2022,9(1):99-110. 被引量：3
4徐承亮,张祥林,王大军.机器学习耦合有限元分析预测板料气弯回弹行为[J].锻压技术,2022,47(6):107-112. 被引量：1
5ChiYan Lee,Hideyuki Hasegawa,Shangce Gao.Complex-Valued Neural Networks:A Comprehensive Survey[J].IEEE/CAA Journal of Automatica Sinica,2022,9(8):1406-1426. 被引量：4
6Hong Mo,Yinghui Meng,Fei-Yue Wang,Dongrui Wu.Interval Type-2 Fuzzy Hierarchical Adaptive Cruise Following-Control for Intelligent Vehicles[J].IEEE/CAA Journal of Automatica Sinica,2022,9(9):1658-1672. 被引量：3
7Zihang Feng,Liping Yan,Yuanqing Xia,Bo Xiao.An Adaptive Padding Correlation Filter With Group Feature Fusion for Robust Visual Tracking[J].IEEE/CAA Journal of Automatica Sinica,2022,9(10):1845-1860.
8Yu Shen,Yuhang Liu,Yonglin Tian,Xiaoxiang Na.Parallel Sensing in Metaverses: Virtual-Real Interactive Smart Systems for “6S” Sensing[J].IEEE/CAA Journal of Automatica Sinica,2022,9(12):2047-2054. 被引量：7
9Linfeng Tang,Yuxin Deng,Yong Ma,Jun Huang,Jiayi Ma.SuperFusion: A Versatile Image Registration and Fusion Network with Semantic Awareness[J].IEEE/CAA Journal of Automatica Sinica,2022,9(12):2121-2137. 被引量：8
10Quan Kong,Huabing Zhou,Yuntao Wu.NormFuse: Infrared and Visible Image Fusion With Pixel-Adaptive Normalization[J].IEEE/CAA Journal of Automatica Sinica,2022,9(12):2190-2192. 被引量：1

1汪洋继鸿,张路,于越,王健.一种轻量化三维人体姿态估计算法[J].通信与信息技术,2024(2):32-35.
2陈敏佳,盖绍彦,达飞鹏,俞健.采用辅助学习的物体六自由度位姿估计[J].光学精密工程,2024,32(6):901-914.
3毛国君,吴星臻,邢树礼.基于多尺度流模型的视觉异常检测研究[J].自动化学报,2024,50(3):640-648. 被引量：1
4钟海鑫,王晖,郭躬德.基于自编码标准流的异常点检测[J].计算机系统应用,2024,33(3):34-42.
5车俐,吕连辉,蒋留兵.AF-CenterNet:基于交叉注意力机制的毫米波雷达和相机融合的目标检测[J].计算机应用研究,2024,41(4):1258-1263. 被引量：2
6张羽丰,杨景,邓寒冰,周云成,苗腾.基于RGB和深度双模态的温室番茄图像语义分割模型[J].农业工程学报,2024,40(2):295-306. 被引量：2
7Yichen Li,Wenbin Yu,Xinping Guan.3D Localization for Multiple AUVs in Anchor-Free Environments by Exploring the Use of Depth Information[J].IEEE/CAA Journal of Automatica Sinica,2024,11(4):1051-1053.
8胡杰,昌敏杰,徐博远,徐文才.ConvFormer:基于Transformer的视觉主干网络[J].电子学报,2024,52(1):46-57. 被引量：2
9Yi Qiang LI.Embeddings Among Quantum Affine sl_(n)[J].Acta Mathematica Sinica,English Series,2024,40(3):792-805.
10Roberto Cavaliere.Bolzano Traffic: An Example of Open ITS Deployment for Advanced Traveller Information Services[J].Journal of Traffic and Transportation Engineering,2024,12(1):23-29.

IEEE/CAA Journal of Automatica Sinica

2024年第3期

浏览历史

内容加载中请稍等...

Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

参考文献3

二级参考文献5

共引文献41

相关作者

相关机构

相关主题

浏览历史