摘要
特定领域的主题识别和关键词提取有着广泛的应用,但通过人工指定识别或文本聚类自动生成的主题类别缺乏客观的度量方法。该文结合基于BIC准则的模型选择理论和独立分量分析技术对主题的数量进行概率估计,给出主题数量在BIC意义下的统计分布。在此基础上实现了文档矩阵的ICA分解,并根据分离的独立分量获得主题的关键词及其权重。实验表明,该方法在没有领域知识支持的情况下能估计出反映文本集合的主题数并提取相应的关键词。
There are many applications that can benefit from topic identification and keyword extraction. The traditional way of choosing the topic number depends on human labeling or automatic clustering which is immeasurable. This paper utilizes the Bayes lntonnation Criteria(BIC) based model selection theory to evaluate the probability of each topic numbers taking. After the topic number is acquired, the paper implements the Independent Component Analysis(ICA) decomposition of term-document, then calculates the weight and extracts the keyword according to the ICA separating matrix. Experiments show this method extracts the keyword in a meaningful way.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第7期183-185,共3页
Computer Engineering
关键词
主题识别
关键词提取
独立分量分析
贝叶斯信息准则
topic identification
keyword extraction
Independent Component Analysis(ICA)
Bayes Information Criteria(BIC)