摘要
针对非平衡交互文本少数类实例匮乏易导致训练的情感分类模型泛化性能差的问题,提出基于超平面距离的非平衡交互文本情感实例迁移方法。该方法将在少数类和多数类支持向量之间的源数据集实例作为待迁实例,并基于目标数据集上的分类超平面构造一个偏移超平面。依据最优信息效用原则基于待迁实例到偏移超平面的距离最短来筛选迁入的实例,同时通过调节迁入比例控制迁入实例规模生成合成数据集。实验结果表明:随着迁入实例增多,合成数据集对原始分布的偏离增大,所训练的序列最小优化算法(SMO)模型的泛化分类性能呈现先提升后降低的现象,类似于信息效用的Wundt曲线;与SMOTE、Subsampling、Oversampling 3种数据层处理方法相比,所提方法训练的SMO、LibSVM、随机森林、代价敏感、CNN 5个分类模型在少数类识别F值上平均获得11%的增幅,且迁入比例最佳范围为20%~30%,在有效缓解非平衡特性的同时提高了少数类识别的泛化分类性能。
A transfer method of emotional instances for unbalanced interactive texts is proposed based on hyperplane distance to focus the problem of poor generalization ability of sentiment classification models when they are trained on an unbalanced interactive text dataset that lacks of minority-class instances.The method uses instances of source dataset between support vectors of the minority class and the majority class as the transferrable instances,and constructs an offset hyperplane based on the classification hyperplane on the target dataset.The method uses the principle of optimal information utility to select the transfer instances based on the shortest distance between the instances and the offset hyperplane,and adopts the migration ratio to control the size of the transfer instances and to generate a synthetic dataset.Experiment results show that when transfer instances increase,the deviation of the synthetic dataset from the original distribution increases,and the generalized classification performance of the trained SMO model rises at the beginning and then decreases after it reaches its maximum,which is similar to the Wundt curve of the information utility.Comparisons with three data layer processing methods(SMOTE,Subsampling and Oversampling)show that five classification models(SMO,LibSVM,random forest,cost sensitive and CNN)trained by the proposed method obtain an average increase of 11%in the F-value of recognizing the minority class,and the optimal range of the migration ratio is[20%,30%].It is concluded that the proposed method effectively alleviates the unbalanced characteristics and raises the generalized classification performance of the minority class.
作者
田锋
王媛媛
吴凡
郑庆华
TIAN Feng;WANG Yuanyuan;WU Fan;ZHENG Qinghua(Shannxi Key Laboratory of Satellite and Terrestrial Network Technology Research and Development, Xi’an Jiaotong University,Xi’an 710049,China;School of Electronics and Information Engineering, Xi’an Jiaotong University,Xi’an 710049,China)
出处
《西安交通大学学报》
EI
CAS
CSCD
北大核心
2018年第10期1-7,共7页
Journal of Xi'an Jiaotong University
基金
国家自然科学基金资助项目(61472315)
国家自然科学基金创新研究群体资助项目(61721002)
国家重点研发计划资助项目(2016YFB1000903)
教育部"创新团队"资助项目(IRT17R86)
关键词
实例迁移
信息效用
非平衡分类
超平面
instance transfer
information utility
unbalance classification
hyperplane