摘要
全文检索等应用要求对文本进行精细表示。针对传统主题模型只能挖掘文本的主题背景,无法对文本的侧重点进行精细描述的问题,提出一种低秩稀疏文本表示模型,将文本表示分为低秩和稀疏两部分,低秩部分代表主题背景,稀疏部分则是对主题中不同方面的关键词描述。为了实现文本低秩部分和稀疏部分的分解,定义了主题矩阵,并引入鲁棒性主成分分析(PCA)方法进行矩阵分解。在新闻语料数据集上的实验结果表明,模型复杂度比隐含狄利克雷分配(LDA)模型降低了25%。在实际应用中,将模型所得的低秩部分应用于文本分类,分类所需的特征减少了28.7%,能用于特征集的降维;将稀疏部分应用于全文检索,检索结果精确度比LDA模型提高了10.8%,有助于检索结果命中率的优化。
Applications such as information retrieval need a precise representation of text content while the representations using traditional topic model can only extract topic background and have no ability for a precise description. A new low-rank and sparse model was proposed to decompose text into a low-rank component which represents topic background and a sparse component which represents keywords. To implement this model, the topic matrix was defined, and Robust Principal Component Analysis (RPCA) was introduced to realize the decomposition. The experimental result on news corpus shows that the model complexity is 25 percent lower than that of Latent Dirichlet Allocation ( LDA). In practical applications, the low- rank component reduces the features needed in text classification by 28.7 percent, which helps to reduce the dimension of features; And the sparse component improves the precision of information retrieval result by 10.8 percent compared with LDA, which improves the hit rate of information retrieval result.
出处
《计算机应用》
CSCD
北大核心
2014年第6期1626-1630,共5页
journal of Computer Applications
关键词
文本挖掘
主题背景
关键词
低秩分解
text mining
topic background
keyword
low-rank decomposition