摘要
在统计双语词典的基础上,提出一种特征加强的多语言文本分类方法.在执行文本分类时,考虑到其他语言的训练文本,使得多种语言的文本集合中均存在训练文本,放松了MLTC的要求.特征加强是一种交叉检查过程,即获取两种语言所有特征的卡方统计后,通过语言中相关特征的辨识力,再次对语言的特征辨识力进行评估,以提高分类的可信度.实验选择汉语或英语作为目标语言.实验结果表明:提出的方法具有更高的分类精度,且对训练集规格的敏感度更低.
Aiming at the problem that multiple language text classification(MLTC)can only solve single language text classification problem of multiple independent,on the basic of statistical bilingual dictionary,multiple language text classification based on feature enhancing has been proposed.In the implementation of text classification,the training texts of other languages have been taken into account,which makes the text of a variety of languages in the training texts.And it relaxes MLTC requirements.Feature enhancing is a processing of cross examination.After chi square statistics of all the features for the two languages is obtained,the identification of language feature is reassessed through the feature identification to improve the reliability of classification.Chinese or English is chosen as the target language in the experiment.Experimental results show that the proposed method has a higher classification accuracy,and the sensitivity of the training set is lower.
作者
龚静
李英杰
黄欣阳
GONG Jing;LI Ying-jie;HUANG Xin-yang(Department of Public Basic Course,Hunan Polytechnic of Environment and Biology,Hengyang Hunan 421005,China;Computer School,University of South China,Hengyang Hunan 421001,China)
出处
《西南师范大学学报(自然科学版)》
CAS
北大核心
2018年第9期45-50,共6页
Journal of Southwest China Normal University(Natural Science Edition)
基金
国家自然科学基金项目(60572137)
湖南省教育厅项目(12C1056
17C0599)
关键词
多语言文本分类
双语词典
特征加强
交叉检查
敏感度
multiple language text classification
bilingual dictionary
feature enhancing
cross examination
sensitivity