摘要
随着网络的普及,网络上产生了越来越多的数据,但是在实际生产的时候,会发现这些数据大部分都不会被打上标签;而要进行数据挖掘的任务,监督型学习算法要求有足够的标签才能进行训练。针对样本缺少标签的问题,提出并实现了正样本-无标签样本学习的方法。第一种方法首先对没有标签的样本进行评估,用评估值将样本打上标签,然后利用这些标签训练出一个模型。第二种方法通过对样本权重的把控,达到利用大量数据中信息的目的。实验结果表明,这两种方法的效果与之前的方法相似甚至能超过之前的算法,而且实现起来更加简单。
With the development of Internet,more and more data are generated.Most of these data would not be labeled in real world while the label is of vital importance when using these data.Concerning the problem of lacking labeled data,two optimized methods for positive-unlabeled learning were proposed.The first method is to evaluate and label the unknown samples before they were used to train a model.The second method is to set the sample weight to make use of the information in these datasets.The experiments show that these two methods get the similar result as the previous ones,while they are easier to implement and more robust.
作者
熊智翔
陆青
王胤
XIONG Zhixiang;LU Qing;WANG Yin(Department of Computer Science and Technology,Tongji University,Shanghai 201804,China;Key Laboratory of Embedded Systems and Service Computing(Tongji University),Shanghai 201804,China;Technique Center,Eleme,Shanghai 200333,China)
出处
《计算机应用》
CSCD
北大核心
2018年第A02期11-15,41,共6页
journal of Computer Applications