摘要
在机器学习中,K折交叉验证方法常常通过把数据分成多个训练集和测试集来进行模型评估与选择,然而其折数K的选择一直是一个公开的问题。注意到上述交叉验证数据划分的一个前提假定是训练集和测试集的分布一致,但是实际数据划分中,往往不是这样。因此,可以通过度量训练集和测试集的分布一致性来进行K折交叉验证折数K的选择。直观地,KL(Kullback-Leibler)距离是一种合适的度量方法,因为它度量了两个分布之间的差异。然而直接基于KL距离进行K的选择时,从多个数据实验结果发现随着K的增加KL距离也在增大,显然这是不合适的。为此,提出了一种基于正则化KL距离的K折交叉验证折数K的选择准则,通过最小化此正则KL距离来选择合适的折数K。进一步多个真实数据实验验证了提出准则的有效性和合理性。
In machine learning,the K-fold cross-validation method often divides the data into multiple training and test sets for model evaluation and selection.However,the selection of the fold K is always an open problem.Note that one of the premises of the above cross-validation data division assumes that the training set and the test set have the same distribution,but in actual data division,this is often not the case.Therefore,the selection of the fold K can be performed by measuring the distribution consistency of the training set and the test set in K-fold cross-validation.Intuitively,KL(Kullback-Leibler)distance is a suitable measure because it measures the difference between two distributions.However,when selecting K directly based on the KL distance,it is found from multiple data experimental results that the KL distance also increases with the increase of K,which is obviously inappropriate.To this end,a selection criterion of the fold K in K-fold cross-validation based on regularized KL distance is proposed,and the appropriate fold K is selected by minimizing this regular KL distance.Multiple real data experiments in a recent step have verified the effectiveness and rationality of the proposed criterion.
作者
褚荣燕
王钰
杨杏丽
李济洪
CHU Rong-yan;WANG Yu;YANG Xing-li;LI Ji-hong(School of Mathematical Sciences,Shanxi University,Taiyuan 030006,China;School of Modern Educational Technology,Shanxi University,Taiyuan 030006,China;School of Software,Shanxi University,Taiyuan 030006,China)
出处
《计算机技术与发展》
2021年第3期52-57,共6页
Computer Technology and Development
基金
山西省应用基础项目研究计划(201901D111034,201801D211002)
国家自然科学基金资助项目(61806115)。