摘要
海量数据下研究广义线性模型参数的估计算法,针对通常的极大似然估计或拟似然估计方程算法中每步迭代均需使用到全体观测数据而造成存储空间不足、计算负担繁重的问题,对广义线性模型参数估计方法进行了改进。结合分治算法与Newton-Raphson算法,提出一种适用于在单机和分布式并行环境下广义线性模型参数求解的聚合拟似然估计方程算法,并进一步研究了聚合拟似然估计量的渐近性质。研究结果表明,当数据分块数目满足一定条件时,所得到的聚合拟似然估计与基于全部数据直接得到的极大拟似然估计具有相同的渐近性质。在数值模拟中,通过单机和Spark集群的实现方式对算法进行数值计算,结果表明聚合拟似然估计方法在解决了数据存储问题的同时提高了计算效率。最后,利用该算法估计Probit模型参数,并将估计出的模型应用于超对称粒子分类问题。
In the parameter estimation problem of the generalized linear model under massive data,in order to solve the problem of insufficient storage space caused by the use of all observation data in each iteration of the usual maximum likelihood estimation or quasi-likelihood estimation equation algorithm,the estimation method is improved.Combining the divide and conquer algorithm with Newton-Raphson.An algorithm is proposed for aggregate quasi-likelihood estimation equations suitable for solving in a single machine and distributed parallel environment,and the asymptotic properties of aggregate estimators are furbher studied.The results show that,when the number of data partitions meets certain conditions,the obtained aggregate quasi-likelihood estimation has the same asymptotic properties as the maximum quasi-likelihood estimation based directly on all data.In the numerical simulation,the algorithm is numerically calculated through the implementation of stand-alone and Spark clusters which shows that the aggregation quasi-likelihood estimation method improves the calculation efficiency while solving the data storage problem.Finally,the algorithm is used to estimate the Probit model parameters,and the estimated model is applied to the supersymmetric particle classification problem.
作者
陈少东
李志强
CHEN Shao-dong;LI Zhi-qiang(College of Mathematics and Science,Beijing University of Chemical Technology,Beijing 100029,China)
出处
《统计与信息论坛》
CSSCI
北大核心
2020年第7期18-24,共7页
Journal of Statistics and Information
关键词
广义线性模型
海量数据
分治算法
聚合拟似然估计方程
generalized linear model
massive data
divide and conquer algorithm
aggregated quasi-likelihood estimation equation