期刊文献+

基于相关性及语义的n-grams特征加权算法 被引量:2

n-grams Features Weighting Algorithm Based on Relevance and Semantic
下载PDF
导出
摘要 n-grams作为文本分类特征时易造成分类准确率下降,并且在对n-grams加权时通常忽略单词间的冗余度和相关性.针对上述问题,文中提出基于相关性及语义的n-grams特征加权算法.在文本预处理时,对n-grams进行特征约简,降低内部冗余,再根据n-grams内单词与类别的相关性及n-grams与测试集的语义近似度加权.搜狗中文新闻语料库和网易文本分类语料库上的实验表明,文中算法能筛选高类别相关且低冗余的n-grams特征,在量化测试集时减少稀疏数据的产生. When n-grams are considered as text classification features, the classification accuracy is decreased. The redundancy and relevance between words are ignored while n-grams are weighted. Thus, n-grams features weighting algorithm based on relevance and semantic is proposed. To decrease the internal redundancy, feature reduction is conducted to n-grams during text preprocessing. Then, n-grams are weighted according to the relevance of words and classes in n-grams and the semantic similarity of n-grams and testing dataset. The experimental results on Sougo Chinese news corpse and NetEase text corpse show that the proposed algorithm can select n-grams features of high relevance and low redundancy, and reduce the sparse data while quantifying the testing dataset.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2015年第11期992-1001,共10页 Pattern Recognition and Artificial Intelligence
基金 国家自然科学基金项目(No.70971059) 辽宁省创新团队项目(No.2009T045) 辽宁省高等学校杰出青年学者成长计划项目(No.LJQ2012027)资助
关键词 最大相关度最小冗余度(mRMR) 语义相似度 N-GRAMS 特征加权 Maximum Relevance Minimum Redundancy (mRMR), Semantic Similarity, n-grams,Feature Weighting
  • 相关文献

参考文献5

二级参考文献22

共引文献99

同被引文献20

引证文献2

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部