摘要
【目的】通过一种特征加权方法解决高校新浪微博主题分类研究所面临的高维性和稀疏性问题。【方法】计算特征属于类别的概率,进一步预测文档属于类别的概率,使得特征由基于词的表示转换为基于类别的表示,最终采用支持向量机对转换后的特征矩阵进行分类。【结果】传统tf,tf?idf以及tf?rf三种方法在结合本文提出的方法后,在微平均F1/宏平均F1方面分别提升:7.2%/7.8%,7.5%/7.9%以及6.4%/5.7%。【局限】仅针对主题分类中特征加权方法进行探索,未对主题分类中其他部分展开研究。【结论】在高校网路舆情主题分类中,该方法可以有效地降低特征矩阵维度,同时提升分类能力与分类效率。
[Objective] This paper introduces a term weighting method to classify topics of Sina Weibo posts by college students, aiming to solve the high dimension and sparsity issues. [Methods] First, we calculated the probability of a term's falling to specific categories and then predicted the probability of a document's category. Then, we converted the word-based features to a class-based matrix, which was classified by the support vector machine. [Results] Our new method increased the MicroF1/MacroF1values of the traditional tf, tf×idf and tf×tf methods by 7.2%/7.8%, 7.5%/7.9% and 6.4%/5,7%, respectively. [Limitations] More research is needed to explore topic classification methods other than the term weighting one in this paper. [Conclusions] The proposed method could effectively reduce the dimension of feature matrix and improve the classification efficiency for lnternet public opinion studies.
作者
贾隆嘉
张邦佐
Jia Longjia;Zhang Bangzuo(School of Mathematics and Statistics,Northeast Normal University,Changchun 130024,China;Department of Planning and Development,Northeast Normal University,Changchun 130024,China;School of Computer Science and Information Technology,Northeast Normal University,Changchun 130024,China)
出处
《数据分析与知识发现》
CSSCI
CSCD
北大核心
2018年第7期55-62,共8页
Data Analysis and Knowledge Discovery
基金
国家自然科学基金项目"基于网络结构演化的Folksonomy模式中社群知识组织与知识涌现研究"(项目编号:71473035)
国家自然科学基金青年科学基金项目"基于贝叶斯图模型的海量短文本数据统计推断"(项目编号:11501095)
吉林省科技厅重点科技攻关项目"基于异构信息网络融合社会关系的电子商务推荐系统关键技术研究与开发"(项目编号:20150204040GX)的研究成果之一
关键词
网络舆情安全
主题分类
特征加权
机器学习
Internet Public Opinion Security
Theme Classification
Term Weighting
Machine Learning