摘要
以深度学习为代表的表示学习在语音识别、图像分析和自然语言处理领域获得了广泛关注与应用,它不仅推动了人工智能的深入研究和快速发展,而且促使企业思索新的运营与盈利模式。本文拟通过综述的形式对这些研究进行梳理,形成较为完整的综述。通过对国内外相关文献的调查和整理,从信息抽取与表示、跨模态系统建模两维度评述了基于表示学习的跨模态检索与特征抽取方面的研究成果。文章首先概括了自动编码器、稀疏编码、限制玻尔兹曼机、深度信念网络、卷积神经网络等五个经典的表示学习算法,然后从基于共享层建立各模态间的关联、表示空间中各模态间的关联、以深度学习为基础的跨模态建模算法等三方面归纳跨模态系统建模研究的现状,最后总结了跨模态检索的评价指标。研究发现:已有检索研究对于单模态信息检索较为丰富,查询和候选集的内容均属于同一模态;跨模态检索也仅限于对图像、文本两个模态对齐的语料。未来需要增加语音、视频、图像、文本等多模态数据的检索,改进深度学习算法构建多模态检索模型,实现三种或以上的跨模态检索。此外,尚需建立适合多模态检索系统的评价指标。
Representation learning, particularly deep learning, has received wide attention and seen application in speech recognition, image analysis, and natural language processing fields. It not only promotes the research and development of artificial intelligence, but urges enterprises to consider new business and profit models. This paper aims to examine these studies in the form of reviews, and ultimately form a complete overview of the topic. Through the investigation and organization of relevant literature locally and internationally, this paper summarizes the research results of cross-modal retrieval and feature extraction based on representation learning from the two dimensions of information extraction and representation, and cross-modal system modeling. The main research includes summarizing five traditional representation learning algorithms, which are the autoencoder, sparse encoding, the restricted Boltzmann machine, deep belief networks, and convolutional neural networks. From the shared layer relationship between each mode, the representation space, and the correlation between each mode’s in-depth learning-based cross-modal modeling algorithm, the present state of research on modeling systems based on cross-modal modeling is summed up. Finally, the evaluation index of cross-modal retrieval is summarized. The study finds that the existing retrieval research is rich in single-modal information retrieval and that the content of queries and candidate sets belong to the same modality, whereas cross-modal retrieval is limited to two modal alignment languages of images and texts. Future research needs to see an increase of modal retrieval of audio, video, images, text, and other multimodal data, and using deeper constructing multimodal retrieval models and feature extraction algorithms to achieve three-orgreater cross-modal retrieval. In addition, an evaluation index of multimodal retrieval systems must be established.
作者
李志义
黄子风
许晓绵
Li Zhiyi;Huang Zifeng;Xu Xiaomian(Economic & Management College of South China Normal University,Guangzhou 510006)
出处
《情报学报》
CSSCI
CSCD
北大核心
2018年第4期422-435,共14页
Journal of the China Society for Scientific and Technical Information
基金
国家社会科学基金项目"基于表示学习的跨模态检索模型与特征抽取研究"(17BTQ062)
关键词
表示学习
跨模态检索
特征抽取
模型
综述
representation learning
cross modal retrieval
feature extraction
model
review