摘要
对《中图法》中内容极为相似的两个类别,开展基于机器学习的自动分类(两类分类)研究。以《中图法》中E271和E712.51两个类别的书目信息作为两类分类的对象,对涉及的CHI、IG和MI等特征选择法,TF和TF*IDF等加权方式,KNN、NB和SVM等分类算法等主要分类环节中的各种代表性技术的分类性能进行比较研究,为今后对《中图法》中极为相似类目开展针对性的自动分类研究提供基础数据。实验结果表明,关于特征选择法,CHI和IG的效果较佳,MI的表现稍弱,但是MI在特征数为4000以上时,性能明显提高;关于分类算法,NB在采取MI特征选择法时表现较佳,但SVM在采取CHI和IG两种特征选择法下表现更佳,而KNN比前两者均差;关于特征加权方式,大多数情况下TF优于TF*IDF,但易受到分类算法、特征数目或特征选择法的影响。各个分类环节中的相关技术组合在一起能够适应对相似类目的自动分类,但性能上优劣不一,需要针对相似类目分类改进相关技术,以进一步提高对相似类目开展自动分类时的分类性能。
The purpose of this paper is to study the automatic classification(two types of classification) based on machine learning in two categories with very similar contents in the Chinese Library Classification. In this paper, we use the bibliographic information of E271 and E712.51 as two types of bibliographic information, and provide a comparative study of the performance of some representative technologies, three feature selection methods, namely, CHI, IG and MI, two feature weighting methods, namely, TF and TF * IDF, and three classification algorithm, namely, KNN, NB and SVM, in the classification of two categories, which provides basic data for targeted automatic classification research. The experimental results show that the performance of CHI and IG is better than MI. However, when the number of features of MI are more than 4000, the performance is improved enormouslyly. For the classification algorithm, the performance of the NB, which adopts the MI feature selection, is the best. The performance of the SVM is better, which uses the feature selection of CHI and IG, than NB and KNN. And the KNN is worse than the former. For feature weighting, TF is better than TF * IDF in most cases. However, the performance of feature weighting is easily influenced by the classification algorithm, the number of features or feature selection method. The related technology in each classification can be combined to adapt to the automatic classification of imitation classification, but the performance of related methods have different advantages and disadvantages, which needs to further improve the classification of related technology and to further improve the classification of similar categories to carry out automatic classification of performance.
作者
李湘东
阮涛
Li Xiangdong;Ruan Tao(School of Information Management, Wuhan University;Center for Electronic Commerce Research and Development, Wuhan University)
出处
《图书馆杂志》
CSSCI
北大核心
2018年第6期11-21,30,共12页
Library Journal
关键词
两类分类
《中国图书馆分类法》
特征选择
特征加权
文本分类
Classification for two categories
Chinese Library Classification
Feature selection
Featureweighting
Text classification