摘要
针对当前的图像字幕方法只能够用一种黑盒的、从外部难以控制的架构描述图像的问题。创造性地将图像字幕问题转换为seq2seq问题,达到了可控生成图像字幕的效果。设计一个由图像区域构成的实体集合或实体序列作为控制信号,在实体块切换的块哨兵和带视觉哨兵的自适应注意力机制的指导下,将控制信号有规律地输入到双层的长短期记忆网络(long short term memory,LSTM)中,以可控的方式指导模型生成对应的图像字幕;此外,baseline使用cross entropy loss来早停模型的训练,引入强化学习思想来解决训练时的优化目标与评估算法效果时指标不一致的问题,进一步优化模型效果。实验表明:在MSCOCO及Flickr30k数据集上,提出的算法在生成可控图像字幕、字幕质量、多样性上达到了非常好的效果。
Aiming at the problem that the current image caption approaches can only describe the image with a black box architecture,which is difficult to control from the outside.This paper creatively transformed the problem of image caption into the problem of seq2seq,and achieved the effect of controllable generation of image caption.It designed an entity set or entity sequence composed of image regions as the control signal,under the guidance of the block sentry with entity block switching and the adaptive attention mechanism with visual sentry,regularly input the control signal into the double-layer long short term memory(LSTM),it guided the model to generate the corresponding image caption in a controllable way.In addition,baseline used cross-entropy loss to train and early stop the model,and introduced the idea of reinforcement learning to solve the problem that the optimization target in training was inconsistent with the index in evaluating the algorithm effect,so as to further optimize the model effect.The experimental results show that:on MSCOCO and Flickr30k datasets,the proposed algorithm achieves great results in generating controllable image caption,caption quality and diversity.
作者
王源顺
段迅
吴云
Wang Yuanshun;Duan Xun;Wu Yun(College of Computer Science&Technology,Guizhou University,Guiyang 550025,China)
出处
《计算机应用研究》
CSCD
北大核心
2021年第11期3510-3516,共7页
Application Research of Computers
基金
国家自然科学基金资助项目(61662009)。