摘要
针对图像中文描述中传统循环神经网络(RNN)结构不利于生成长句、缺乏细节语义信息的问题,提出一种用Transformer多头注意力(multi-head attention, MHA)网络,融合粗粒度的全局特征和细粒度的区域目标实体特征方法.该方法通过多尺度特征的融合,使图像注意力更易聚焦于细粒度的目标区域,得到更具细粒度语义特征的图像表示,从而有效改善了图像描述.在数据集ICC上使用多种评价指标进行验证,结果表明,该模型在各项指标上均取得了更好的图像描述效果.
Aiming at the problem that the traditional recurrent neural network(RNN) structure in image Chinese caption was not conducive to long sentence generation and lacked detailed semantic information, we proposed a Transformer multi-head attention(MHA) network, which fused the coarse-grained global features and fine-grained regional target entity features. Through the fusion of multi-scale features, the method made it easier for image attention to focus on fine-grained target regions and an image representation with more fine-grained semantic features was obtained, thus effectively improving image caption. A variety of evaluation indicators were used for verification on the ICC dataset, the results show that the model achieves better image caption effects in all indicators.
作者
肖雄
徐伟峰
王洪涛
苏攀
高思华
XIAO Xiong;XU Weifeng;WANG Hongtao;SU Pan;GAO Sihua(Department of Computer,North China Electric Power University(Baoding),Baoding 071003,Hebei Province,China;School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)
出处
《吉林大学学报(理学版)》
CAS
北大核心
2022年第5期1103-1112,共10页
Journal of Jilin University:Science Edition
基金
国家自然科学基金(批准号:61802124)
全国高等院校计算机基础教育研究会项目(批准号:2019-AFCEC-125)。
关键词
图像中文描述
细粒度特征
多头注意力
image Chinese caption
fine-grained feature
multi-head attention(MHA)