无标记训练样本的Web文本分类方法被引量：2

The Method of Web Text Classification of Using Non-labeled Training Sample

下载PDF

导出

摘要在文本分类中获得有类别标记训练样本的代价是很高昂的,本文针对这个问题对传统的模糊聚类方法进行改进,提出模糊划分聚类方法 FPCM,将聚类的无监督性和样本的先验知识结合起来,通过相似度度量聚类相关文本,取得比较客观的簇和少量标记文本,为监督学习找到分类依据,并结合朴素贝叶斯增量学习方式进行分类器的学习。本文进一步用估计分类误差损失的方法平衡选取候选样本,提高了分类准确率,实现了应用范围更加广泛的无标记文本分类学习模型。 Bayes learning theory is to obtain estimate of non-labeled samples by transcendental information and sample data. The application of text classification is to classify non-labeled texts by learning labeled class samples. But it is very difficult to obtain labeled training samples. In the paper the problem is analyzed in point of clustering view. The clustering is a non-supervised learning method, and has a character of independence on defined classes and labeled training samples. The thesis improve on tradition fuzzy clustering to bring forward Fuzzy Partition Clustering Method （FPCM）. FPCM is a dynamic clustering method based on centroid technique. A few labeled texts are obtained to find classification foundation for supervised learning by fuzzy Partition clustering non-labeled Web texts. The sample＇ s transcendental knowledge and clustering＇s non-supervisory are combined, and correlation texts are clustered by measuring similar degree. Naive Bayes augment learning style is further used to design and learn classifier. At the same time, classification precision is advanced using the way of selecting balance candidate samples after estimating the loss of classifying error. The model of text classifying using non-labeled training sample with more extensive application is realized.

作者刘丽珍宋瀚涛陆玉昌

机构地区首都师范大学信息工程学院北京理工大学计算机系清华大学计算机系

出处《计算机科学》 CSCD 北大核心 2006年第3期200-201,211,共3页 Computer Science

基金 973国家重点基础研究项目(G1998030414) 北京市优秀人才专项经费资助项目(20042D0501604)

关键词 WEB文本分类模糊聚类朴素贝叶斯 Web text classification, Fuzzy clustering, Naive Bayes

分类号 TP301.2 [自动化与计算机技术—计算机系统结构] TP18 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

参考文献7

1Linoff G S,J.a.Berry M.Mining the web,America,2001,348. 被引量：1
2Mena J.Data Mining your website.America,2000,368. 被引量：1
3Wang Shi,Gao Wen.Web data mining.Computer Science,2000,27(4) :237～240. 被引量：1
4Hutter M.Distribution of Mutual Information.In:Proc.of the 14th Intl.Conf.on Neural Information Processing Systems,NIPS-2001. 被引量：1
5边肇祺等编著..模式识别第2版[M].北京:清华大学出版社,2000:338.
6Keogh E J,et al.Learning Augmented Bayesian Classifiers:A Comparison of Distribution-based and Classification-based Approache,2002 http://citeseer.nj.nec.com/context. 被引量：1
7宫秀军,孙建平,史忠植.主动贝叶斯网络分类器[J].计算机研究与发展,2002,39(5):574-579. 被引量：37

二级参考文献1

1史忠植.知识发现[M].北京:清华大学出版社,2000.. 被引量：7

共引文献36

1王利民,李雄飞,张海龙.基于广义信息论的贝叶斯分类器动态建模[J].吉林大学学报（工学版）,2009,39(3):776-780. 被引量：5
2李笛,胡学钢,胡春玲.主动贝叶斯分类方法研究[J].计算机研究与发展,2007,44(z2):47-51. 被引量：1
3李仪,蔡自兴.基于贝叶斯分类器的移动机器人避障[J].控制工程,2004,11(4):332-334. 被引量：4
4谷峰,吴扬扬.文本分类关键技术[J].福建电脑,2006,22(9):5-6. 被引量：2
5赵悦,穆志纯.基于委员会投票选择方法的主动学习的研究[J].太原理工大学学报,2006,37(4):469-472. 被引量：7
6黄光球,孙周军,刘兆明.基于贝叶斯置信网的日志服务系统容侵方法研究[J].微电子学与计算机,2006,23(12):53-57. 被引量：1
7赵悦,穆志纯.基于QBC的主动学习研究及其应用[J].计算机工程,2006,32(24):23-25. 被引量：5
8赵悦,穆志纯,李霞丽,潘秀琴.一种基于EM和分类损失的半监督主动DBN学习算法[J].小型微型计算机系统,2007,28(4):656-660. 被引量：2
9赵悦,穆志纯,董洁,付冬梅,何伟.基于QBC主动学习方法建立电信客户信用风险等级评估模型[J].北京科技大学学报,2007,29(4):442-446. 被引量：2
10赵悦,穆志纯,潘秀琴,李霞丽.一种基于半监督主动学习的动态贝叶斯网络算法[J].信息与控制,2007,36(2):224-229. 被引量：3

同被引文献24

1刘远超,王晓龙,刘秉权,钟彬彬.基于聚类分析策略的用户偏好挖掘[J].计算机应用研究,2005,22(12):21-23. 被引量：8
2李宝林,兰芸,张翼英.基于动态遗传算法的用户模型进化研究[J].计算机工程与应用,2006,42(14):200-203. 被引量：7
3乐兵,王明文.基于遗传算法的动态文本聚类[J].江西师范大学学报（自然科学版）,2006,30(3):278-281. 被引量：3
4邓健爽,郑启伦,彭宏,邓维维.基于搜索引擎的关键词自动聚类法[J].计算机科学,2007,34(3):162-164. 被引量：2
5Al-Sultan K S,Khan M M.Computational experience on four algorithms for the hard clustering problem[J].Pattern Recognition Letters,1996,17(3),295-308 被引量：1
6Bandyopadhyay S,Saha S.GAPS:A clustering method using a new point symmetry-based distance measure[J].Pattern Recognition,2007,40(12):3430-3451 被引量：1
7Maulik U,Bandyopadhyay S.Genetic algorithm-based clustering technique[J,].Pattern Recognition,2000,33(9),1455-1465 被引量：1
8Chou C-H,Su M-C,Lai E.A new cluster validity measure and its application to image compression[J].Pattern Analysis Applications (Springer London),2004,7(2):205-220 被引量：1
9CHEN Y, LI Z, NIE L, et al. A semi-supervised bayesian network model for microblog topic classification[ C ]//Pro- ceedings of the 24th International Conference on Computa- tional Linguistics. Mumbai, India, 2012: 561-576. 被引量：1
10HA-THUC V, RENDERS J M. Large-scale hierarchical text classification without labelled data [ C ]//Proceedings of the fourth ACM International Conference on Web Search and Data Mining. Hong Kong, China, 2011: 685-694. 被引量：1

引证文献2

1朱征宇,李力沛,罗颖,周智,朱庆生.一种应用于中文文本聚类的适应值函数[J].计算机科学,2009,36(5):244-246.
2何力,谭霜,贾焰,韩伟红.基于无标记Web数据的层次式文本分类[J].智能系统学报,2014,9(3):330-335.

1闫婷.牛博网慈善实验室[J].东方企业家,2008(7):78-79.
2蒋志方,祝翠玲,吴强.一个对不带类别标记文本进行分类的方法[J].计算机工程,2007,33(12):96-98. 被引量：1
3张丽娜,周润景,那日苏.基于黄金分割法的ISODATA算法的大样本特征数据提取方法[J].内蒙古大学学报（自然科学版）,2013,44(1):93-96. 被引量：2
4张玉芳,娄娟,李智星,熊忠阳.基于模糊关系的文本分类方法[J].计算机工程,2011,37(16):149-151. 被引量：2
5耿晓明,文玉梅,刘祥明.基于综合度量聚类的LED路灯控制系统自动组网方法研究及其应用[J].照明工程学报,2016,27(5):112-117. 被引量：1
6贲圣兰,苏光大.基于错误度量的模糊聚类有效性函数[J].模式识别与人工智能,2010,23(1):11-16. 被引量：1
7黄国超,王衍波,张凯泽.基于Unicode编码的信息隐藏算法研究与设计[J].计算机技术与发展,2011,21(10):233-236. 被引量：4
8李鹏.FPCM算法在彩色图像分割中的方法研究[J].广西民族大学学报（自然科学版）,2012,18(4):47-51.
9黄高攀,宋庆武,王会羽.机房温湿度变化趋势预测模型探究[J].产业与科技论坛,2016,0(3):67-68. 被引量：1
10李鹏.改进FCM算法在医学图像分割的方法研究[J].数字技术与应用,2012,30(9):116-117.

计算机科学

2006年第3期

浏览历史

内容加载中请稍等...

无标记训练样本的Web文本分类方法被引量：2

参考文献7

二级参考文献1

共引文献36

同被引文献24

引证文献2

相关作者

相关机构

相关主题

浏览历史

无标记训练样本的Web文本分类方法 被引量：2

参考文献7

二级参考文献1

共引文献36

同被引文献24

引证文献2

相关作者

相关机构

相关主题

浏览历史

无标记训练样本的Web文本分类方法被引量：2