摘要
特征选择即是降维去噪的过程,一个词汇是否具有强的类别区分能力通过特征选择评价函数的权值大小来衡量,然而影响特征选择的因素有很多,主要包括特征的维度、重要性和语义;针对短文本信息量少导致特征表示高维稀疏和传统特征提取方法缺乏语义的问题,构建多因素融合的特征选择函数FS,和传统的特征选择函数TFIDF对比, FS不仅融入了特征的语义性,而且能够去除大量冗余特征,提高具有类别区分能力特征的权重;把FS作为新的特征选择函数,使用搜狗实验室的中文语料库进行短文本分类实验,验证了方法有效性.
Feature Selection (FS)is reducing dimensions and denoising.However,there are many factors that affect the features selection,mainly including the dimensions,importance,and semantic of terms.For feature representing highdimensional but sparse of short text and traditional features extraction lack semantic,a feature selection function FS fusing multi-factors is constructed.It is verified that FS not only can integrate the semantics of the feature,but also can remove a large number of redundant features,thus improve the weight of the features with class distinction capabilities, comparing with the traditional feature selection function TF-IDF.FS as a new function,using the Chinese corpus of Sogou Lab for short text classification,verifys the effectiveness of the method.
作者
李文慧
张英俊
潘理虎
LI Wen-Hui;ZHANG Ying-Jun;PAN Li-Hu(School of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan 030024,China;Instltute of Geographic Sciences and Natural Resources Research,Chinese Academy of Sciences,Beijing 100101,China)
出处
《计算机系统应用》
2018年第12期216-221,共6页
Computer Systems & Applications
基金
山西省中科院科技合作项目(20141101001)
"十二五"山西省科技重大专项项目(20121101001)
山西省社会发展科技攻关项目(20140313020-1)~~