This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screeni...This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.展开更多
In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior ...In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.展开更多
Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association ...Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.展开更多
The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theore...The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.展开更多
The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and tru...The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and true discovery rates. One more criterion of estimation of directional hypotheses tests quality, the Type III errors rate, is considered. The ratio among discovery rates and the Type III errors rate in CBM is considered. The advantage of CBM in comparison with Bayes and frequentist methods is theoretically proved and demonstrated by an example.展开更多
When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is l...When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is low.To address the problem,an integrated strategy is proposed.It organically combines the fundamental theories of the three mainstream methods(read-pair approaches,split-read technologies and read-depth analysis) with modern machine learning algorithms,using the recipe of feature extraction as a bridge.Compared with the state-of-art split-read methods for deletion detection in both low and high sequence coverage,the machine-learning-aided strategy shows great ability in intelligently balancing sensitivity and false discovery rate and getting a both more sensitive and more precise call set at single-base-pair resolution.Thus,users do not need to rely on former experience to make an unnecessary trade-off beforehand and adjust parameters over and over again any more.It should be noted that modern machine learning models can play an important role in the field of structural variation prediction.展开更多
基金supported partially by the China National Key R&D Program under Grant Nos.2019YFC1908502,2022YFA1003703,2022YFA1003802,and 2022YFA1003803the National Natural Science Foundation of China under Grant Nos.11925106,12231011,11931001,and 11971247。
文摘This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.
基金supported by the National Natural Science Foundation of China under Grant Nos.11771332,11771220,11671178,11925106,11971247the National Science Foundation of Tianjin under Grant Nos.18JCJQJC46000,18ZXZNGX00140+1 种基金the 111Project B20016Mushtaq was also supported by the Fundamental Research Funds for the Central Universities。
文摘In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.
基金financial support of the Austrian Ministry for Transport,Innovation and Technology(BMVIT)and the Austrian Science Fund(FWF)via the project TRP46-B19Part of the study was conducted using a travel grant provided by the European Science Foundation(ESF).
文摘Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.
基金supported by the National Key R&D Program of China(No.2018YFB0704304)the National Natural Science Foundation of China(Nos.32070668,62002231,61832003,61433014)the K.C.Wong Education Foundation。
文摘The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
文摘The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and true discovery rates. One more criterion of estimation of directional hypotheses tests quality, the Type III errors rate, is considered. The ratio among discovery rates and the Type III errors rate in CBM is considered. The advantage of CBM in comparison with Bayes and frequentist methods is theoretically proved and demonstrated by an example.
基金Project(61472026)supported by the National Natural Science Foundation of ChinaProject(2014J410081)supported by Guangzhou Scientific Research Program,China
文摘When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is low.To address the problem,an integrated strategy is proposed.It organically combines the fundamental theories of the three mainstream methods(read-pair approaches,split-read technologies and read-depth analysis) with modern machine learning algorithms,using the recipe of feature extraction as a bridge.Compared with the state-of-art split-read methods for deletion detection in both low and high sequence coverage,the machine-learning-aided strategy shows great ability in intelligently balancing sensitivity and false discovery rate and getting a both more sensitive and more precise call set at single-base-pair resolution.Thus,users do not need to rely on former experience to make an unnecessary trade-off beforehand and adjust parameters over and over again any more.It should be noted that modern machine learning models can play an important role in the field of structural variation prediction.