摘要
视觉问答是一项结合计算机视觉和自然语言处理的多模态任务,具有极大的挑战性。然而,目前的视觉问答模型存在着严重的语言偏见问题,对其鲁棒性有负面影响。以往的研究主要集中在利用生成反事实样本来辅助模型解决语言偏见。然而,这些研究忽略了分析反事实样本与原始样本的预测差异以及关键特征与非关键特征之间的两两差异。文中通过建立反事实思考流程,结合因果推理与对比学习,使模型能够区分原始样本、事实样本和反事实样本。基于此,提出了一种基于反事实样本的对比学习范式。通过对比3类样本对的特征差异和预测差异,减小了模型的语言偏见。在VQA-CP v2等数据集上的实验证明了所提方法的有效性。与CL-VQA方法相比,所提方法的整体精度提高了0.19%,平均精度提高了0.89%,尤其是Num精度提高了2.6%。相比CSSVQA方法,所提方法的鲁棒性辅助指标Gap从0.96提高到了0.45。
Visual question answering(VQA)is a multi-modal task that combines computer vision and natural language proces-sing,which is extremely challenging.However,the current VQA model is often misled by the apparent correlation in the data,and the output of the model is directly guided by language bias.Many previous researches focus on solving language bias and assisting the model via counterfactual sample methods.These studies,however,ignore the prediction information and the difference between key features and non-key features in counterfactual samples.The proposed model can distinguish the difference between the original sample,the factual sample and the counterfactual sample.In view of this,this paper proposes a paradigm of contrastive learning based on counterfactual samples.By comparing these three samples in terms of feature gaps and prediction gaps,the VQA model has been significantly improved in its robustness.Compared with CL-VQA method,the overall precision,average precision and Num index of this method improves by 0.19%,0.89%and 2.6%respectively.Compared with the CSSVQA method,the Gap of the proposed method decrease to 0.45 from 0.96.
作者
袁德森
刘修敬
吴庆波
李宏亮
孟凡满
颜庆义
许林峰
YUAN De-sen;LIU Xiu-jing;WU Qing-bo;LI Hong-liang;MENG Fan-man;NGAN King-ngi;XU Lin-feng(School of Information and Communication Engineering,University of Electronic Science and Technology of China,Chengdu 611730,China)
出处
《计算机科学》
CSCD
北大核心
2022年第12期229-235,共7页
Computer Science
基金
国家自然科学基金(61831005,61971095)。
关键词
视觉问答
因果推理
反事实思考
对比学习
深度学习
Visual question answering
Causal inference
Counterfactual thinking
Contrastive learning
Deep learning