摘要
随着微博机器人账户的不断增多,对其识别检测已成为当前数据挖掘领域的热点问题。已有的微博机器人识别研究多使用爬取搜集的相关数据,在小规模平衡分布的机器人与普通用户数据集上训练并验证算法模型,在样本分布不平衡的真实情况下存在局限性。重采样是一种针对不平衡数据集分类的常用技术,为探究重采样对相关监督学习机器人识别算法的影响,该文以微热点数据挖掘竞赛的真实数据为基础,提出一种结合重采样的微博机器人识别框架,在5种不同采样方式的基础上使用多种评价指标,综合评估了7种监督学习算法在不平衡验证集上的分类性能。实验结果表明,以往基于小规模平衡样本数据训练的模型在真实情况下的Recall有较大降低,而结合重采样的算法框架能够大幅提高机器人账户的识别率,其中使用NearMiss欠采样会让算法的Recall大幅提升,而使用ADASYN过采样会让算法的G_mean有所提高。一般而言,微博用户的发布时间、发布地域以及发布时间间隔等属性是区分正常用户和机器人的重要特征属性。重采样调整了机器学习算法所依赖的特征属性,从而获得更好的预测性能。
With the increasing number of microblog robot accounts,its identification has become a prominent problem in the current data mining field.To deal with the imbalance data issue in this task,we choose a large data set to explore the influence of resampling on the supervised learning algorithms and propose a novel microblog robot recognition framework combined with resampling.A variety of indexes have been used to evaluate the performance of 7 supervised learning algorithms on imbalanced validation sets based on 5different resampling methods.The experimental results show that the Recall of the trained model from the small balanced training set will be seriously reduced in real situations,while the framework combined with resampling can significantly improve the recognition of robot accounts.The NearMiss undersampling method can increase the Recall,while the ADASYN oversampling method will improve the G_mean measure.Generally speaking,the release time,publishing region,and release interval are important features to distinguish normal users from robots.At the same time,resampling can adjust the rank of features that the machine learning algorithm depends on so that the model can get better performance.
作者
罗云松
黄慕宇
贾韬
LUO Yunsong;HUANG Muyu;JIA Tao(School of Computer and Information Science,Southwest University,Chongqing 400715,China;Chang'an Automobile Finance,Chongqing 400020,China)
出处
《中文信息学报》
CSCD
北大核心
2021年第12期133-148,共16页
Journal of Chinese Information Processing
基金
国家自然科学基金(62006198)
教育部中国高校产学研创新基金(2021ALA03016)