摘要
提出了一种结合关键词特征和共现词对特征的向量空间模型。首先,通过分词和去除停用词提取文本中的候选关键词,利用文本频率筛选关键词特征。然后,基于获得的关键词特征两两构造候选共现词对,定义支持度和置信度筛选共现词对特征。最后,结合关键词特征和共现词对特征构建向量空间模型。文本分类实验结果表明,提出的模型具有更强的文本分类能力。
A new vector space model is proposed, which uses both keyword and co-occurrence term as the representation features of documents. Firstly, the keyword candidates are extracted from docu- ments by segmenting texts and removing stop words,and the keyword features are filtered by document frequency. Secondly, based on the obtained keyword features, the co-occurrence word pairs are construc- ted,and support degree and confidence degree are defined to filter the features of co-occurrence word pairs. Finally, the keyword features and the features of co-occurrence word pairs are combined to construct the vector space model. The text-classification experiments show that the proposed model has better ability of text classification.
出处
《计算机工程与科学》
CSCD
北大核心
2014年第5期971-976,共6页
Computer Engineering & Science
基金
十二五科技支撑课题(2011BAH10B04)
关键词
向量空间模型
共现词对
语义相关性
文本分类
vector space model
co-occurrence word
semantical relationship
text classification