期刊文献+

基于模态预融合的三维指称表达理解

MP3DVG:modal pre-fusion for 3D visual grounding
下载PDF
导出
摘要 三维指称表达理解(3D VG)旨在通过理解指称表达来准确定位三维场景中的目标对象。现有3D VG研究通过引入文本和视觉分类任务优化文本和视觉编码器,这种方法可能由于文本和视觉特征的语义不对齐,从而导致模型难以在场景中定位文本描述的视觉对象。此外,3D VG数据集有限的数据量和复杂的模型结构往往导致模型过拟合。针对上述问题提出MP3DVG模型,通过学习统一的多模态特征表示完成单模态分类和3D VG任务,并降低模型的过拟合。基于跨模态特征交互提出TGV和VGT模块,在单模态任务之前预融合文本和视觉特征,减小不同模态特征因语义不对齐带来的不利影响。基于线性分类器可评价样本特征多样性的特性,提出周期性初始化的辅助分类器,并通过动态损失调节项自适应地调节样本损失,弱化模型的过拟合。大量实验结果表明所提方法的优越性,相比于MVT模型,MP3DVG在Nr3D和Sr3D数据集上性能分别提升1.1%和1.8%,模型的过拟合现象得到显著改善。 3D VG aims to locate target objects in 3D scenes by understanding the semantics of referring expressions.Researchers propose text and object classification tasks to optimize textual and visual encoders,which may incur semantic mis-matches between visual and textual features,making it impossible to localize targets in 3D scenes.In addition,the limited amount of data in 3D VG datasets and complex model structures often lead to overfitting.To address above issues,this paper proposed MP3DVG to realize classification and 3D VG tasks by uniting representations.Based on cross-modal interaction,it designed pre-fusion module to pre-fuse visual and textual embeddings before classification tasks by TGV and VGT respectively,alleviating the adverse effects caused by semantic mis-alignment of different modal features.Aiming at the overfitting of mo-dels,it devised periodically initialized auxiliary classifier to adjust sample losses by dynamic loss-adjusters,evaluating diffe-rences among sample features online.The experimental results demonstrate the superiority of the proposed method.MP3DVG outperforms MVT on Nr3D and Sr3D datasets by 1.1%and 1.8%respectively,the overfitting of model is improved as well.
作者 袁琨鹏 米金鹏 陈智谦 Yuan Kunpeng;Mi Jinpeng;Chen Zhiqian(Institute of Machine Intelligence,University of Shanghai for Science&Technology,Shanghai 200093,China;School of Opto-electronic Information&Computer Engineering,University of Shanghai for Science&Technology,Shanghai 200093,China)
出处 《计算机应用研究》 CSCD 北大核心 2023年第12期3666-3671,3677,共7页 Application Research of Computers
基金 国家自然科学基金重点资助项目(92048205) 国家自然科学基金资助项目(62106026) 中国博士后科学基金资助项目(2020M683243)。
关键词 三维指称表达理解 多模态融合 过拟合 注意力 3D visual grounding multi-modal fusion overfitting attention
  • 相关文献

参考文献11

二级参考文献15

共引文献14

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部