期刊文献+

基于空间关系聚合与全局特征注入的视觉问答模型

A visual question answering model based on spatial relationship aggregation and global feature injection
下载PDF
导出
摘要 现有视觉问答模型缺乏视觉对象间关系的理解能力,导致复杂问题的答案预测准确率较差;针对该问题,提出了一种基于空间关系聚合与全局特征注入的视觉问答模型。该模型首先利用空间关系聚合视觉区域特征,将其转换为视觉全局特征,并将这些特征注入网络;然后引入双边门控机制进行特征融合,使模型能够根据不同的问题输入,自适应地调整视觉全局特征和视觉区域特征对答案预测的贡献度;最后将融合特征输入分类网络,得到预测结果。在VQA 2.0和GQA公开数据集上进行实验,结果表明:该模型在VQA2.0的测试-开发集、测试-标准集和GQA的数据集上的总准确率分别达到71.12%、71.54%和57.71%,优于MCAN和SCAVQAN等主流模型。该模型由于引入了具有空间关系的视觉全局特征,能够更好地提升视觉对象间关系的理解能力,有效提高了视觉问答模型的准确率。 A visual question answering model based on spatial relationship aggregation and global feature injection was proposed aiming at the problem that the existing visual question answering models lack understanding of the relationship between visual objects and have low forecast accuracy.First,spatial relations were used for the model to aggregate visual regional features,which were subsequently transformed into visual global features,and injected into the network;then,by introducing a bilateral gating mechanism for feature fusion,the model could control the contribution of visual global features and visual regional features to answer prediction in an adaptive manner according to different question inputs;finally,the fusion features were input into the classification network to obtain the prediction results.Experiments were conducted on VQA 2.0 and GQA public datasets,and the results showed that the model achieved overall accuracy of 71.12%,71.54%,and 57.71%on VQA 2.0 test subsets Test-dev,Test-std,and GQA,superior to mainstream models MCAN and SCAVQAN.The model introduces visual global features with spatial relationships,which can better enhance the understanding ability of relationships between visual objects and effectively improve the accuracy of the visual question answering model.
作者 陈巧红 漏杨波 方贤 CHEN Qiaohong;LOU Yangbo;FANG Xian(School of Computer Science and Technology,Zhejiang Sci-Tech University,Hangzhou 310018,China)
出处 《浙江理工大学学报(自然科学版)》 2023年第6期764-774,共11页 Journal of Zhejiang Sci-Tech University(Natural Sciences)
基金 浙江省自然科学基金项目(LQ23F020021) 浙江理工大学科研启动项目(22232262-Y)。
关键词 视觉问答 空间关系聚合 全局特征注入 视觉区域特征 视觉全局特征 双边门控机制 visual question answering spatial relationship aggregation global feature injection visual regional feature visual global feature bilateral gating mechanism
  • 相关文献

参考文献4

二级参考文献14

共引文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部