摘要
跨模态共指消解是根据人员交互意图对自然图像中所指目标进行定位,作为智能人机交互领域的关键技术之一,能够应用于抢险救灾、家庭服务或养老助残等场景.现有的目标指代方法一般采用单模态信息表现人类意图,例如语言或者眼动等,然而单一的模态用户输入只能够传达有限的交互信息,难以实现自然而智能的人机协同.本文针对这一问题,同时融合眼动和语言信息,建立了跨模态共指消解模型,利用多种模态信息的优势互补,实现人类意图所指目标的图像定位任务;设计了对比试验,验证了本文提出的眼动—语言跨模态的融合方法性能优于单模态的输入形式.
Object referring is a task to locate the target in the image according to human intention.As one of the key technologies of intelligent human-computer interaction,it can be applied to scenarios such as emergency rescue and disaster relief,family service or providing for the disabled.The existing works of object referring generally use single-modal information to express human intention,such as language or gaze,etc.However,a single modal can only convey limited information,it is difficult to perform natural and intelligent human-computer collaboration.In order to solve this problem,we propose a method to achieve object referring with language and human gaze,utilizing the advantages of multiple modals to realize localization of the target referred to by human intention.Comparative experiments are designed to verify that the performance of the gaze-language cross-modal object referring method proposed in this paper outperforms that of the single-modal input method.
作者
张珺倩
宋明武
谢良
张亚坤
印二威
闫野
ZHANG Junqian;SONG Mingwu;XIE Liang;ZHANG Yakun;YIN Erwei;YAN Ye(Academy of Medical Engineering and Translational Medicine,Tianjn University,Tianjn 300072,China;National Innovation Institute of Defense Technology,Academy of Military Sciences,Beijing 100071,China;Tianjin Artificial Intelligence Innovation Center,Tianjin 300450,China)
关键词
深度学习
跨模态
目标定位
眼动
自然语言处理
deep learning
multi-modal
localization
gaze
natural language processing