基于低秩分解的精细文本挖掘方法被引量：2

Precise text mining using low-rank matrix decomposition

下载PDF

导出

摘要全文检索等应用要求对文本进行精细表示。针对传统主题模型只能挖掘文本的主题背景,无法对文本的侧重点进行精细描述的问题,提出一种低秩稀疏文本表示模型,将文本表示分为低秩和稀疏两部分,低秩部分代表主题背景,稀疏部分则是对主题中不同方面的关键词描述。为了实现文本低秩部分和稀疏部分的分解,定义了主题矩阵,并引入鲁棒性主成分分析(PCA)方法进行矩阵分解。在新闻语料数据集上的实验结果表明,模型复杂度比隐含狄利克雷分配(LDA)模型降低了25%。在实际应用中,将模型所得的低秩部分应用于文本分类,分类所需的特征减少了28.7%,能用于特征集的降维;将稀疏部分应用于全文检索,检索结果精确度比LDA模型提高了10.8%,有助于检索结果命中率的优化。 Applications such as information retrieval need a precise representation of text content while the representations using traditional topic model can only extract topic background and have no ability for a precise description. A new low-rank and sparse model was proposed to decompose text into a low-rank component which represents topic background and a sparse component which represents keywords. To implement this model, the topic matrix was defined, and Robust Principal Component Analysis （RPCA） was introduced to realize the decomposition. The experimental result on news corpus shows that the model complexity is 25 percent lower than that of Latent Dirichlet Allocation （ LDA）. In practical applications, the low- rank component reduces the features needed in text classification by 28.7 percent, which helps to reduce the dimension of features; And the sparse component improves the precision of information retrieval result by 10.8 percent compared with LDA, which improves the hit rate of information retrieval result.

作者黄晓海郭智黄宇

机构地区中国科学院电子学研究所中国科学院空间信息处理与应用系统技术重点实验室中国科学院大学信息科学与工程学院

出处《计算机应用》 CSCD 北大核心 2014年第6期1626-1630,共5页 journal of Computer Applications

关键词文本挖掘主题背景关键词低秩分解 text mining topic background keyword low-rank decomposition

分类号 TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献20

1BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation [J]. The Journal of Machine Learning Research, 2003, 3:993 - 1022. 被引量：1
2BLEI D M, GRIFFITHS T L, JORDAN M I, et al. Hierarchical topic models and the nested Chinese restaurant process [ C]// Ad- vances in Neural Information Processing Systems: Proceedings of the 2003 Conference. Cambridge: MIT Press, 2004, 16:106 -114. 被引量：1
3CHEMUDUGUNTA C, SMYTH P, STEYVERS M. Modeling gener- al and specific aspects of documents with a probabilistic topic model [ C]// Advances in Neural Information Processing Systems: Pro- ceedings of the 2006 Conference. Cambridge: MIT Press, 2007, 19:241. 被引量：1
4MEI Q, LING X, WONDRA M, et al. Topic sentiment mixture: modeling facets and opinions in weblogs [ C]// Proceedings of the 16th International Conference on World Wide Web. New York: ACM Press, 2007:171 - 180. 被引量：1
5CANDIES E .l, LI X, MA Y, et al. Robust principal component a- nalysis? [ J]. Journal of the ACM, 2011,58(3) : 11. 被引量：1
6MIN K, ZHANG Z, WRIGHT J, et al. Decomposing background topics from keywords by principal component pursuit [ C]// Pro- ceedings of the 19th ACM International Conference on Information and Knowledge Management. New York: ACM Press, 2010:269 - 278. 被引量：1
7LANDAUER T K, FOLTZ P W, LAHAM D. An introduction to la- tent semantic analysis [ J]. Discourse Processes, 1998, 25 (2/3) : 259 - 284. 被引量：1
8LIN Z, CHEN M, MAY. The augmented lagrange multiplier meth- od for exact recovery of corrupted low-rank matrices [ EB/OL]. [2013-08-16]. http://arxiv, org/pdf/1009. 5055v3. pdf. 被引量：1
9BLEI D M, LAFFERTY J D. Dynamic topic models [ C]// Pro- ceedings of the 23rd International Conference on Machine Learning. New York: ACM Press, 2006:113 - 120. 被引量：1
10BLEI D M, LAFFERTY J D. A correlated topic model of science [ J]. The Annals of Applied Statistics, 2007, 1 (1) : 17 - 35. 被引量：1

同被引文献9

1高莹,马佳琳.基于改进的混合自注意力机制模型的研究[J].电视技术,2021,45(12):120-122. 被引量：1
2史加荣,郑秀云,魏宗田,杨威.低秩矩阵恢复算法综述[J].计算机应用研究,2013,30(6):1601-1605. 被引量：72
3李冬梅,高志荣,熊承义,周城,侯建华.低秩分解的人脸图像光照均衡化预处理[J].光电工程,2015,42(9):28-34. 被引量：3
4刘燕妮,张贵仓,安静.基于数学形态学的双直方图均衡化图像增强算法[J].计算机工程,2016,42(1):215-219. 被引量：12
5冯军军,李力.机器学习在垃圾邮件过滤中的实现[J].电脑知识与技术,2021,17(8):154-155. 被引量：2
6陈子妍,龙道银,王霄,覃涛,杨靖.基于逆通道与改进引导滤波的暗通道去雾算法[J].计算机工程,2021,47(6):245-252. 被引量：6
7韩梦妍,李良荣,蒋凯.基于光照图估计的Retinex低照度图像增强算法[J].计算机工程,2021,47(10):201-206. 被引量：22
8潘金凤,尹丽菊,高明亮,邹国峰.压缩感知观测信号的低秩稀疏分解[J].计算机工程,2022,48(8):234-239. 被引量：1
9陈行健,胡雪娇,薛卫.基于关系拓展的改进词袋模型研究[J].小型微型计算机系统,2019,40(5):1040-1044. 被引量：7

引证文献2

1王国栋,邵鹏,王国宇,刘少禹,张建涛.基于低秩分解与像素置乱的图像去雾方法[J].计算机工程,2022,48(12):212-217. 被引量：3
2许春荣,买买提依明·哈斯木.稀疏矩阵的概念与应用[J].信息与电脑,2023,35(21):254-256.

二级引证文献3

1王效灵,胡志杰,徐帅帅,黄浩如.改进暗通道先验和策略性融合的图像去雾算法[J].计算机工程,2023,49(10):212-221. 被引量：1
2朱兵,王晨,朱福珍,王曼威.改进的局部最小像素先验遥感图像盲复原算法[J].高技术通讯,2024,34(2):123-131.
3贺国平,张国荣.数字图像混沌序列抽样加权强置乱算法仿真[J].计算机仿真,2024,41(10):192-195.

1王志军.尽情定制Gmail的主题背景[J].电脑迷,2012(7):73-73.
2引火虫.共享Gmail的自定义主题背景[J].电脑迷,2014,0(8):82-82.
3山德鲁.在线设置谷歌浏览器的主题背景[J].电脑知识与技术（经验技巧）,2014(5):109-112.
4艾利和再添新贵 S100闪亮登场[J].个人电脑,2010(8):104-104.
5冮薇.银行密钥安全管理系统的开发[J].中国科技信息,2008(3):81-81.
6教你把XP系统主题改造成Windows7风格[J].计算机与网络,2011,37(11):27-27.
7陈国生.基于Elman网络模型的空中目标识别方法[J].舰船电子工程,2009,29(7):71-73. 被引量：3
8史椸,耿晨,齐勇.一种具有容错机制的MapReduce模型研究与实现[J].西安交通大学学报,2014,48(2):1-7. 被引量：4
9王新春,程满,刘渝民,岳开华.基于曲波变换决策融合的掌纹识别[J].楚雄师范学院学报,2015,30(3):16-20.
10吴建龙,罗海兵.遗传算法在人脸识别中的应用研究[J].计算机仿真,2010,27(12):282-285. 被引量：4

计算机应用

2014年第6期

浏览历史

内容加载中请稍等...

基于低秩分解的精细文本挖掘方法被引量：2

参考文献20

同被引文献9

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于低秩分解的精细文本挖掘方法 被引量：2

参考文献20

同被引文献9

引证文献2

二级引证文献3

相关作者

相关机构

相关主题

浏览历史

基于低秩分解的精细文本挖掘方法被引量：2