期刊文献+

大数据的整合分析方法 被引量:27

Integrative Analysis for Big Data
下载PDF
导出
摘要 大数据具有数据来源差异性、高维性及稀疏性等特点,如何挖掘数据集间的异质性和共同性并降维去噪是大数据分析的目标与挑战之一。整合分析(Integrative Analysis)同时分析多个独立数据集,避免因地域、时间等因素造成的样本差异而引起模型不稳定,是研究大数据差异性的有效方法。它的特点是将每个解释变量在所有数据集中的系数视为一组,通过惩罚函数对系数组进行压缩,研究变量间的关联性并实现降维。本文从同构数据整合分析、异构数据整合分析以及考虑网络结构的整合分析三方面梳理了惩罚整合分析方法的原理、算法和研究现状。统计模拟发现,在弱相关、一般相关和强相关三种情形下,L1Group Bridge、L1Group MCP、Composite MCP都表现良好,其中L1Group Bridge的假阳数最低且最稳定。最后,将整合分析用于研究具有来源差异性的新农合家庭医疗支出,以及具有超高维、小样本等大数据典型特征的癌症基因数据,得到了一些有意义的结论。 The difference of data source, high dimensionality and sparsity are the main characteristics of big data. How to mining the heterogeneity and association of different datasets and achieve dimension reduction is one of goals and challenges of big data analysis. Integrative analysis provides an effective way of analyzing big data. It simultaneously analyzes multiple datasets, avoiding the model instability from individual variations caused by regional and time factor and so on. The coefficients of each covariate across all datasets are treated as a group and penalty function is used to shrinkage these groups of coefficients to achieve variable selection. In this paper, we review the existing research of penalized integrative analysis from three aspects of homogeneity integrative analysis, heterogeneity integrative analysis and network integrative analysis. Three simulations are conducted to verify the performance of integrative analysis, including weak, moderate and strong correlations. It shows that L1 Group Bridge, L1 Group MCP.Composite MCP perform well, and L1 Group Bridge has the lowest false positive and is most stable. Finally, integrative analysis is applied to analyze the new rural cooperative medical expenditure data with source difference, as well as cancer genetics data with typical characteristics of big data such as super high dimension and small sample.
出处 《统计研究》 CSSCI 北大核心 2015年第11期3-11,共9页 Statistical Research
基金 国家统计局重大项目"大数据的统计方法研究"(2012LD001) 国家统计局重点项目"大数据线性 理论及处理技术的发展和创新研究"(2013LZ53) 国家社会科学基金重大项目"大数据与统计学理论的发展研究"(13&ZD148) 国家社会科学基金青年项目"大数据的高维变量选择方法及其应用研究"(13CTJ001) 国家自然科学基金面上项目"广义线性模型的组变量选择及其在信用评分中的应用"(71471152)资助
关键词 大数据 整合分析 变量选择 医疗支出 癌症基因 Big Data Integrative Analysis Variable Selection Medical Expenditure Cancer Genetics Data
  • 相关文献

参考文献16

  • 1Fan J, Han F, Liu H. Challenges of Big Data analysis [J] National Science Review, 2014, 1 (2) :293 -314. 被引量:1
  • 2Yuan M, Lin Y. Model selection and estimation in regression with grouped variables [ J ]. Journal of the Royal Statistical Society: Series B, 2006, 68:49 -67. 被引量:1
  • 3Simon N, Friedman J, Hastie T and Tibshirani R. A sparse Group lasso [ J]. Journal of Computational and Graphical Statistics, 2013, 22(2) :231 -245. 被引量:1
  • 4Huang J, Ma S, Xie H and Zhang C. -H. A group bridge approach for variable selection [ J]. Biometrika, 2009, 96:339 - 355. 被引量:1
  • 5Ma S, Huang J, Song X. Integrative analysis and variable selection with multiple high-dimensional data sets [ J]. Biostatistics, 2011 a, 12(4) : 763 -775. 被引量:1
  • 6Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties [ J]. Journal of the American Statistical Association, 2001, 96 : 1348 - 1360. 被引量:1
  • 7Ma S, Dai Y, Huang J and Xie Y. Identification of breast cancer prognosis markers via integrative analysis [ J ]. Computational statistics and data analysis, 2012, 56 (9) : 2718 - 2728. 被引量:1
  • 8Huang J, Wei F, Ma S. Consistent group selection and estimation via normed minimax concave penalty, 2010. Unpublished manuscript. 被引量:1
  • 9Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models [J]. Statistical Science, 2012, 27(4): 481 - 499. 被引量:1
  • 10Ma S, Huang J, Wei F, et al. Integrative analysis of multiple cancer prognosis studies with gene expression measurements [ J]. Statistics in medicine, 2011b, 30(28) : 3361 -3371. 被引量:1

同被引文献310

引证文献27

二级引证文献135

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部