摘要
研究跨领域学习与分类是为了将对多源域的有监督学习结果有效地迁移至目标域,实现对目标域的无标记分类.当前的跨领域学习一般侧重于对单一源域到目标域的学习,且样本规模普遍较小,此类方法领域自适应性较差,面对大样本数据更显得无能为力,从而直接影响跨域学习的分类精度与效率.为了尽可能多地利用相关领域的有用数据,本文提出了一种多源跨领域分类算法(Multiple sources cross-domain classification,MSCC),该算法依据被众多实验证明有效的"罗杰斯特回归模型"与"一致性方法"构建多个源域分类器并综合指导目标域的数据分类.为了充分高效利用大样本的源域数据,满足大样本的快速运算,在MSCC的基础上,本文结合最新的CDdual(Dual coordinate descent method)算法,提出了算法MSCC的快速算法MSCC-CDdual,并进行了相关的理论分析.人工数据集、文本数据集与图像数据集的实验运行结果表明,该算法对于大样本数据集有着较高的分类精度、快速的运行速度和较高的领域自适应性.本文的主要贡献体现在三个方面:1)针对多源跨领域分类提出了一种新的"一致性方法",该方法有利于将MSCC算法发展为MSCC-CDdual快速算法;2)提出了MSCC-CDdual快速算法,该算法既适用于样本较少的数据集又适用于大样本数据集;3)MSCC-CDdual算法在高维数据集上相比其他算法展现了其独特的优势.
Cross-domain learning and classification involved in this paper attempts to effectively transfer the classification results obtained from supervised multisource domains to an unsupervised target domain. Generally speaking, although current cross-domain learning methods have obtained great successes for cross-single-domain learning problems, they will encounter overwhelming troubles in the sense of classification accuracy and running speed when carrying out them on large cross-multisource datasets. In this paper, based on the logistic regression model and the proposed consensus measure, a multi-source cross-domain classification (MSCC) algorithm is proposed to realize effective cross-domain classification for the target domain. In order to enable the MSCC to work well for large datasets, based on the algorithm CDdual (Dual coordinate descent method) as the recent advance about large-scale logistic regression, an MSCC^s fast version MSCC-CDdual for large datasets is derived and theoretically analysed. The experimental results on artificial data, text data and image data indicate that the proposed algorithm MSCC-CDdual has a fast speed, high classification accuracy and good domain adaption for large cross-multisource datasets. The contributions of the work here contain three aspects: 1) A novel consensus measure is proposed, which is suitable for boosting multi-classifiers and convenient for us to develop MSCC's fast version for large datasets; 2) The proposed algorithm MSCC-CDdual is demonstrated to be suitable for cross-multisource learning for both small and large datasets; 3) MSCC-CDdual exhibits its additional advantage, i.e., the applicability for high dimensional datasets from another "large" perspective.
出处
《自动化学报》
EI
CSCD
北大核心
2014年第3期531-547,共17页
Acta Automatica Sinica
基金
国家自然科学基金(60903100
60975027)资助~~
关键词
跨领域
多源
罗杰斯特回归
后验概率
分类
Cross-domain, multi-source, logistic regression, posterior probability, classification