期刊文献+

面向短文本分析的分布式表示模型 被引量:7

A Distributed Representation Model for Short Text Analysis
下载PDF
导出
摘要 短文本的分布式表示已经成为文本数据挖掘的一项重要任务.然而,直接应用分布式表示模型Paragraph Vector尚有不足,其根本原因是其在训练过程中并没有利用到语料库级别的信息,从而不能有效改善短文本中语境信息不足的情况.鉴于此,提出了一种面向短文本分析的分布式表示模型——词对主题句向量模型(biterm topic paragraph vector,BTPV),该模型通过将词对主题模型(biterm topic model,BTM)得出的主题信息融入Paragraph Vector中,不仅使得模型训练过程中利用到了全局语料库的信息,而且还利用BTM显性的主题表示完善了Paragraph Vector隐性的空间向量.实验采用爬取到的热门新闻评论作为数据集,并选用K-Means聚类算法对各模型的短文本表示效果进行比较.实验结果表明,基于BTPV模型的分布式表示较常见的分布式向量化模型word2vec和Paragraph Vector来说能取得更好的短文本聚类效果,从而显现出该模型面向短文本分析的优势. The distributed representation of short texts has become an important task in text mining. However, the direct application of the traditional Paragraph Vector may not be suitable, and the fundamental reason is that it does not make use of the information of corpus in training process, so it can not effectively improve the situation of insufficient contextual information in short texts. In view of this, in this paper we propose a novel distributed representation model for short texts called BTPV (biterm topic paragraph vector). BTPV adds the topic information of BTM (biterm topic model) to the Paragraph Vector model. This method not only uses the global information of corpus, but also perfects the implicit vector of Paragraph Vector with the explicit topic information of BTM. At last, we crawl popular news comments from the Internet as experimental data sets, using K-Means clustering algorithm to compare the models’ representation performance. Experimental results have shown that the BTPV model can get better clustering results compared with the common distributed representation models such as word2vec and Paragraph Vector, which indicates the advantage of the proposed model for short text analysis.
作者 梁吉业 乔洁 曹付元 刘晓琳 Liang Jiye;Qiao Jie;Cao Fuyuan;Liu Xiaolin(School of Computer and Information Technology,Shanxi University,Taiyuan 03000;Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University),Ministry of Education,Taiyuan 03000)
出处 《计算机研究与发展》 EI CSCD 北大核心 2018年第8期1631-1640,共10页 Journal of Computer Research and Development
基金 国家自然科学基金项目(U1435212 61432011 61573229) 山西省重点科技攻关项目(MQ2014-09)~~
关键词 分布式表示 短文本 文本分析 句向量 词对主题模型 distributed representation short text document analysis paragraph vector biterm topicmodel (BTM)
  • 相关文献

参考文献4

二级参考文献48

  • 1黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量:17
  • 2A.K. JAIN, M.N. MURTY, P.J. FLYNN. Data Clustering: A Review[J]. ACM Computing Surveys, September 1999, 31(3). 被引量:1
  • 3Wang L, Jia Y, Han W H. Instance message clustering based on extended vector space model[EB/OL]. Proceedings of 2^nd Iternational Symposium on Intelligence Computation and Applications. Wuhan, China: Springer, 2007: 435-443. 被引量:1
  • 4He H, Chen B, Xu W R, Guo J. Short text feature extraction and clustering for web topic mining [EB/ OL]. Proceeding of the 3^rd International Conference on Semantics, Knowledge and Grid. Washington D. C. , USA: IEEE, 2007: 382-385. 被引量:1
  • 5http://tech. ifeng. com/internet/detail _ 2010 _ 08/09/ 1600761_0.shtml[DB/OL]. 被引量:1
  • 6HARTIGAN, J. and WONG, M. Algorithm AS136: A k-means clustering algorithm[J]. Applied Statistics, 1979,28: 100-108. 被引量:1
  • 7Horatiu Mocian. Survey of Distributed Clustering Techniques[EB/OL]. 1^st term ISO report, 2009. 被引量:1
  • 8M. E. J. Newman. Power laws, Pareto distributions and Zipf's law [J]. Contemporary Physics, 2005,46 (5):323-351. 被引量:1
  • 9Deerwester S C, Dumais S T, Landauer T K, et al. Indexing by latent semantic analysis [J]. Journal of the Association of Information Sience, 1990, 41(6) : 391-407. 被引量:1
  • 10Song Y, Wang H, Wang Z, et al. Short text conceptualization using a probabilistic knowledgebase [C]// Proc of the 22nd Int Joint Conf on Artificial Intelligence (IJCAI). Palo Alto, CA: AAAI, 2011:2330-2336. 被引量:1

共引文献330

同被引文献53

引证文献7

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部