期刊文献+

基于低秩分解的精细文本挖掘方法 被引量:2

Precise text mining using low-rank matrix decomposition
下载PDF
导出
摘要 全文检索等应用要求对文本进行精细表示。针对传统主题模型只能挖掘文本的主题背景,无法对文本的侧重点进行精细描述的问题,提出一种低秩稀疏文本表示模型,将文本表示分为低秩和稀疏两部分,低秩部分代表主题背景,稀疏部分则是对主题中不同方面的关键词描述。为了实现文本低秩部分和稀疏部分的分解,定义了主题矩阵,并引入鲁棒性主成分分析(PCA)方法进行矩阵分解。在新闻语料数据集上的实验结果表明,模型复杂度比隐含狄利克雷分配(LDA)模型降低了25%。在实际应用中,将模型所得的低秩部分应用于文本分类,分类所需的特征减少了28.7%,能用于特征集的降维;将稀疏部分应用于全文检索,检索结果精确度比LDA模型提高了10.8%,有助于检索结果命中率的优化。 Applications such as information retrieval need a precise representation of text content while the representations using traditional topic model can only extract topic background and have no ability for a precise description. A new low-rank and sparse model was proposed to decompose text into a low-rank component which represents topic background and a sparse component which represents keywords. To implement this model, the topic matrix was defined, and Robust Principal Component Analysis (RPCA) was introduced to realize the decomposition. The experimental result on news corpus shows that the model complexity is 25 percent lower than that of Latent Dirichlet Allocation ( LDA). In practical applications, the low- rank component reduces the features needed in text classification by 28.7 percent, which helps to reduce the dimension of features; And the sparse component improves the precision of information retrieval result by 10.8 percent compared with LDA, which improves the hit rate of information retrieval result.
出处 《计算机应用》 CSCD 北大核心 2014年第6期1626-1630,共5页 journal of Computer Applications
关键词 文本挖掘 主题背景 关键词 低秩分解 text mining topic background keyword low-rank decomposition
  • 相关文献

参考文献20

  • 1BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation [J]. The Journal of Machine Learning Research, 2003, 3:993 - 1022. 被引量:1
  • 2BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process [ C]// Ad- vances in Neural Information Processing Systems: Proceedings of the 2003 Conference. Cambridge: MIT Press, 2004, 16:106 -114. 被引量:1
  • 3CHEMUDUGUNTA C, SMYTH P, STEYVERS M. Modeling gener- al and specific aspects of documents with a probabilistic topic model [ C]// Advances in Neural Information Processing Systems: Pro- ceedings of the 2006 Conference. Cambridge: MIT Press, 2007, 19:241. 被引量:1
  • 4MEI Q, LING X, WONDRA M, et al. Topic sentiment mixture: modeling facets and opinions in weblogs [ C]// Proceedings of the 16th International Conference on World Wide Web. New York: ACM Press, 2007:171 - 180. 被引量:1
  • 5CANDIES E .l, LI X, MA Y, et al. Robust principal component a- nalysis? [ J]. Journal of the ACM, 2011,58(3) : 11. 被引量:1
  • 6MIN K, ZHANG Z, WRIGHT J, et al. Decomposing background topics from keywords by principal component pursuit [ C]// Pro- ceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2010:269 - 278. 被引量:1
  • 7LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to la- tent semantic analysis [ J]. Discourse Processes, 1998, 25 (2/3) : 259 - 284. 被引量:1
  • 8LIN Z, CHEN M, MAY. The augmented lagrange multiplier meth- od for exact recovery of corrupted low-rank matrices [ EB/OL]. [2013-08-16]. http://arxiv, org/pdf/1009. 5055v3. pdf. 被引量:1
  • 9BLEI D M, LAFFERTY J D. Dynamic topic models [ C]// Pro- ceedings of the 23rd International Conference on Machine Learning. New York: ACM Press, 2006:113 - 120. 被引量:1
  • 10BLEI D M, LAFFERTY J D. A correlated topic model of science [ J]. The Annals of Applied Statistics, 2007, 1 (1) : 17 - 35. 被引量:1

同被引文献9

引证文献2

二级引证文献3

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部