摘要
Dense captioning aims to simultaneously localize and describe regions-of-interest(RoIs)in images in natural language.Specifically,we identify three key problems:1)dense and highly overlapping RoIs,making accurate localization of each target region challenging;2)some visually ambiguous target regions which are hard to recognize each of them just by appearance;3)an extremely deep image representation which is of central importance for visual recognition.To tackle these three challenges,we propose a novel end-to-end dense captioning framework consisting of a joint localization module,a contextual reasoning module and a deep convolutional neural network(CNN).We also evaluate five deep CNN structures to explore the benefits of each.Extensive experiments on visual genome(VG)dataset demonstrate the effectiveness of our approach,which compares favorably with the state-of-the-art methods.
图像密集描述是指自动检测图像中的感兴趣区域(region of interest,ROI),并生成自然语言短语或句子来描述这些区域中的语义内容。然而,它存在三个主要难题:第一,图像中密集且高度重叠的ROI,使得难以精确定位到目标区域;第二,图像中一些视觉模糊的ROI,使得难以仅凭借外观来识别目标区域;第三,图像特征表示的深度对视觉识别是极其重要的。针对这三个难题,本文提出了一种端到端的密集描述模型,包括三个关键模块:联合定位模块、上下文推理模块和深度卷积神经网络(convolutional neural network,CNN),其中,试验了5种深度CNN结构。在Visual Genome数据集上的实验结果显示,该模型性能显著,且优于同类其他方法。
作者
KONG Rui
XIE Wei
孔锐;谢玮(School of Intelligent Systems Science and Engineering,Jinan University,Zhuhai 519070,China)
基金
Project(2020A1515010718)supported by the Basic and Applied Basic Research Foundation of Guangdong Province,China。