摘要
文本特征抽取是文本过滤的一项重要基础,但通常采取的用字、词作为特征项的显著缺点是无法表达文本的语义信息,所以本文在向量空间模型的基础上提出了一种以知网为语义知识库、基于语义信息的文本特征项抽取方法。该方法比单纯的词汇信息更能体现文本的概念特征,提高过滤系统的性能;同时还能降低文本向量的维数,减少计算量,提高过滤效率。我们在引入了该方法的中文文本过滤系统上进行的实验结果也充分证实了其有效性。
Feature selection of documents is an important issue in text filtering. However, the lack of semantic information in document representation is a great disadvantage of word feature. This paper presents a novel method of semantic based feature selection on the basis of vector space model which takes Hownet as its semantic repository. This method can better represent the conceptual feature of texts than simple words, improve the system performance, meanwhile decrease the dimension of text vector to reduce the load of computation and improve the filtering efficiency. The experiment results on our Chinese text filtering system which integrated the method has sufficiently proved its effect.
出处
《通信学报》
EI
CSCD
北大核心
2004年第7期46-54,共9页
Journal on Communications
基金
国家信息安全保障持续发展计划基金资助项目
国家自然科学基金资助项目(69873011
69935010
60103014)
国家"863"基金资助项目(2001AA114120
2002AA142090)
关键词
文本过滤
特征抽取
向量窄间模型
知网
text filtering
feature selection
vector space model
Hownet