期刊文献+

Stacked Attention Networks for Referring Expressions Comprehension

下载PDF
导出
摘要 Referring expressions comprehension is the task of locating the image region described by a natural language expression,which refer to the properties of the region or the relationships with other regions.Most previous work handles this problem by selecting the most relevant regions from a set of candidate regions,when there are many candidate regions in the set these methods are inefficient.Inspired by recent success of image captioning by using deep learning methods,in this paper we proposed a framework to understand the referring expressions by multiple steps of reasoning.We present a model for referring expressions comprehension by selecting the most relevant region directly from the image.The core of our model is a recurrent attention network which can be seen as an extension of Memory Network.The proposed model capable of improving the results by multiple computational hops.We evaluate the proposed model on two referring expression datasets:Visual Genome and Flickr30k Entities.The experimental results demonstrate that the proposed model outperform previous state-of-the-art methods both in accuracy and efficiency.We also conduct an ablation experiment to show that the performance of the model is not getting better with the increase of the attention layers.
出处 《Computers, Materials & Continua》 SCIE EI 2020年第12期2529-2541,共13页 计算机、材料和连续体(英文)
基金 This work was supported in part by audio-visual new media laboratory operation and maintenance of Academy of Broadcasting Science,Grant No.200304 in part by the National Key Research and Development Program of China(Grant No.2019YFB1406201).
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部