Clustering is widely exploited in data mining.It has been proved that embedding weak label prior into clustering is effective to promote its performance.Previous researches mainly focus on only one type of prior.Howev...Clustering is widely exploited in data mining.It has been proved that embedding weak label prior into clustering is effective to promote its performance.Previous researches mainly focus on only one type of prior.However,in many real scenarios,two kinds of weak label prior information,e.g.,pairwise constraints and cluster ratio,are easily obtained or already available.How to incorporate them to improve clustering performance is important but rarely studied.We propose a novel constrained Clustering with Weak Label Prior method(CWLP),which is an integrated framework.Within the unified spectral clustering model,the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation.To approximate a variant of the embedding matrix more precisely,we replace a cluster indicator matrix with its scaled version.Instead of fixing an initial similarity matrix,we propose a new similarity matrix that is more suitable for deriving clustering results.Except for the theoretical convergence and computational complexity analyses,we validate the effectiveness of CWLP through several benchmark datasets,together with its ability to discriminate suspected breast cancer patients from healthy controls.The experimental evaluation illustrates the superiority of our proposed approach.展开更多
当前,深度主动学习(Deep Active Learning,DAL)在分类数据标注工作中获得成功,但如何筛选出最能提升模型性能的样本仍是难题.本文提出基于弱标签争议的半自动分类数据标注方法(Dispute about Weak Label based Deep Active Learning,DWL...当前,深度主动学习(Deep Active Learning,DAL)在分类数据标注工作中获得成功,但如何筛选出最能提升模型性能的样本仍是难题.本文提出基于弱标签争议的半自动分类数据标注方法(Dispute about Weak Label based Deep Active Learning,DWLDAL),迭代地筛选出模型难以区分的样本,交给人工进行准确标注.该方法包含伪标签生成器和弱标签生成器,伪标签生成器是在准确标注的数据集上训练而成,用于生成无标签数据的伪标签;弱标签生成器则是在带伪标签的随机子集上训练而成.弱标签生成器委员会决定哪些无标签数据最有争议,则交给人工标注.本文针对文本分类问题,在公开数据集IMDB(Internet Movie DataBase)、20NEWS(20NEW Sgroup)和chnsenticorp(chnsenticorp_htl_all)上进行实验验证.从数据标注和分类任务的准确性2个角度,对3种不同投票决策方式进行评估.DWLDAL方法中数据标注的F1分数比现有方法Snuba分别提高30.22%、14.07%和2.57%,DWLDAL方法中分类任务的F1分数比Snuba分别提高1.01%、22.72%和4.83%.展开更多
弱标记学习是多标记学习的一个重要分支,近几年已被广泛研究并被应用于多标记样本的缺失标记补全和预测等问题.然而,针对特征集合较大、更容易拥有多个语义标记和出现标记缺失的高维数据问题,现有弱标记学习方法普遍易受这类数据包含的...弱标记学习是多标记学习的一个重要分支,近几年已被广泛研究并被应用于多标记样本的缺失标记补全和预测等问题.然而,针对特征集合较大、更容易拥有多个语义标记和出现标记缺失的高维数据问题,现有弱标记学习方法普遍易受这类数据包含的噪声和冗余特征的干扰.为了对高维多标记数据进行准确的分类,提出了一种基于标记与特征依赖最大化的弱标记集成分类方法 En WL.En WL首先在高维数据的特征空间多次利用近邻传播聚类方法,每次选择聚类中心构成具有代表性的特征子集,降低噪声和冗余特征的干扰;再在每个特征子集上训练一个基于标记与特征依赖最大化的半监督多标记分类器;最后,通过投票集成这些分类器实现多标记分类.在多种高维数据集上的实验结果表明,En WL在多种评价度量上的预测性能均优于已有相关方法.展开更多
This paper presents a novel algorithm for an extreme form of weak label learning, in which only one of all relevant labels is given for each training sample. Using genetic algorithm, all of the labels in the training ...This paper presents a novel algorithm for an extreme form of weak label learning, in which only one of all relevant labels is given for each training sample. Using genetic algorithm, all of the labels in the training set are optimally divided into several non-overlapping groups to maximize the label distinguishability in every group. Multiple classifiers are trained separately and ensembled for label predictions. Experimental results show significant improvement over previous weak label learning algorithms.展开更多
基金supported by the National Key R&D Program(No.2022ZD0114803)the National Natural Science Foundation of China(Grant Nos.62136005,61922087).
文摘Clustering is widely exploited in data mining.It has been proved that embedding weak label prior into clustering is effective to promote its performance.Previous researches mainly focus on only one type of prior.However,in many real scenarios,two kinds of weak label prior information,e.g.,pairwise constraints and cluster ratio,are easily obtained or already available.How to incorporate them to improve clustering performance is important but rarely studied.We propose a novel constrained Clustering with Weak Label Prior method(CWLP),which is an integrated framework.Within the unified spectral clustering model,the pairwise constraints are employed as a regularizer in spectral embedding and label proportion is added as a constraint in spectral rotation.To approximate a variant of the embedding matrix more precisely,we replace a cluster indicator matrix with its scaled version.Instead of fixing an initial similarity matrix,we propose a new similarity matrix that is more suitable for deriving clustering results.Except for the theoretical convergence and computational complexity analyses,we validate the effectiveness of CWLP through several benchmark datasets,together with its ability to discriminate suspected breast cancer patients from healthy controls.The experimental evaluation illustrates the superiority of our proposed approach.
文摘当前,深度主动学习(Deep Active Learning,DAL)在分类数据标注工作中获得成功,但如何筛选出最能提升模型性能的样本仍是难题.本文提出基于弱标签争议的半自动分类数据标注方法(Dispute about Weak Label based Deep Active Learning,DWLDAL),迭代地筛选出模型难以区分的样本,交给人工进行准确标注.该方法包含伪标签生成器和弱标签生成器,伪标签生成器是在准确标注的数据集上训练而成,用于生成无标签数据的伪标签;弱标签生成器则是在带伪标签的随机子集上训练而成.弱标签生成器委员会决定哪些无标签数据最有争议,则交给人工标注.本文针对文本分类问题,在公开数据集IMDB(Internet Movie DataBase)、20NEWS(20NEW Sgroup)和chnsenticorp(chnsenticorp_htl_all)上进行实验验证.从数据标注和分类任务的准确性2个角度,对3种不同投票决策方式进行评估.DWLDAL方法中数据标注的F1分数比现有方法Snuba分别提高30.22%、14.07%和2.57%,DWLDAL方法中分类任务的F1分数比Snuba分别提高1.01%、22.72%和4.83%.
文摘弱标记学习是多标记学习的一个重要分支,近几年已被广泛研究并被应用于多标记样本的缺失标记补全和预测等问题.然而,针对特征集合较大、更容易拥有多个语义标记和出现标记缺失的高维数据问题,现有弱标记学习方法普遍易受这类数据包含的噪声和冗余特征的干扰.为了对高维多标记数据进行准确的分类,提出了一种基于标记与特征依赖最大化的弱标记集成分类方法 En WL.En WL首先在高维数据的特征空间多次利用近邻传播聚类方法,每次选择聚类中心构成具有代表性的特征子集,降低噪声和冗余特征的干扰;再在每个特征子集上训练一个基于标记与特征依赖最大化的半监督多标记分类器;最后,通过投票集成这些分类器实现多标记分类.在多种高维数据集上的实验结果表明,En WL在多种评价度量上的预测性能均优于已有相关方法.
基金Supported by the National Natural Science Foundation of China(61672433)the Fundamental Research Fund for Shenzhen Science and Technology Innovation Committee(201703063000511,201703063000517)+1 种基金the National Cryptography Development Fund(MMJJ20170210)the Science and Technology Project of State Grid Corporation of China(522722180007)
文摘This paper presents a novel algorithm for an extreme form of weak label learning, in which only one of all relevant labels is given for each training sample. Using genetic algorithm, all of the labels in the training set are optimally divided into several non-overlapping groups to maximize the label distinguishability in every group. Multiple classifiers are trained separately and ensembled for label predictions. Experimental results show significant improvement over previous weak label learning algorithms.