Driven by large-scale optimization problems arising from machine learning,the development of stochastic optimization methods has witnessed a huge growth.Numerous types of methods have been developed based on vanilla s...Driven by large-scale optimization problems arising from machine learning,the development of stochastic optimization methods has witnessed a huge growth.Numerous types of methods have been developed based on vanilla stochastic gradient descent method.However,for most algorithms,convergence rate in stochastic setting cannot simply match that in deterministic setting.Better understanding the gap between deterministic and stochastic optimization is the main goal of this paper.Specifically,we are interested in Nesterov acceleration of gradient-based approaches.In our study,we focus on acceleration of stochastic mirror descent method with implicit regularization property.Assuming that the problem objective is smooth and convex or strongly convex,our analysis prescribes the method parameters which ensure fast convergence of the estimation error and satisfied numerical performance.展开更多
The fungus Ophiocordyceps sinensis is endemic to the vast region of the Qinghai-Tibetan plateau(QTP).The unique and complex geographical environmental conditions have led to the "sky island" distribution str...The fungus Ophiocordyceps sinensis is endemic to the vast region of the Qinghai-Tibetan plateau(QTP).The unique and complex geographical environmental conditions have led to the "sky island" distribution structure of O.sinensis.Due to limited and unbalanced sample collections,the previous data on O.sinensis regarding its genetic diversity and spatial structure have been deemed insufficient.In this study,we analyzed the diversity and phylogeographic structures of O.sinensis using internally transcribed spacer region(ITS) and 5-locus datasets by a large-scale sampling.A total of 111 haplotypes of ITS sequences were identified from 948 samples data of the fungus O.sinensis,with representing high genetic diversity,and 8 phylogenetic clades were recognized in O.sinensis.Both the southeastern Tibet and the northwestern Yunnan were the centers of genetic diversity and genetic differentiation of the fungus,and they were inferred as the glacial refugia in the Quaternary.Three distribution patterns were identified to correspond to the 8 clades,including but not limited to the coexistence of widely and specific local distributive structures.It also revealed that the differentiation pattern of O.sinensis did not fit for the isolation-by-distance model.The differentiation into the 8 clades occurred between 1.56 Myr and6.62 Myr.The ancestor of O.sinensis most likely originated in the late Miocene(6.62 Myr) in the northwestern Yunnan,and the Scene A-C of the Qinghai-Tibetan movements may have played an important role in the differentiation of O.sinensis during the late Miocene-Pliocene periods.Our current results provide a much clearer and detailed understanding of the genetic diversity and geographical spatial distribution of the endemic alpine fungus O.sinensis.It also revealed that the geochronology resulting from paleogeology could be cross-examined with biomolecular clock at a finer scale.展开更多
A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, i...A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, independent, anonymous group of non-expert respondents (the “crowd”). The objective of this research is to examine the statistical distribution of solutions from a large crowd to a quantitative problem involving image analysis and object counting. Theoretical analysis by the author, covering a range of conditions and types of factor variables, predicts that composite random variables are distributed log-normally to an excellent approximation. If the factors in a problem are themselves distributed log-normally, then their product is rigorously log-normal. A crowdsourcing experiment devised by the author and implemented with the assistance of a BBC (British Broadcasting Corporation) television show, yielded a sample of approximately 2000 responses consistent with a log-normal distribution. The sample mean was within ~12% of the true count. However, a Monte Carlo simulation (MCS) of the experiment, employing either normal or log-normal random variables as factors to model the processes by which a crowd of 1 million might arrive at their estimates, resulted in a visually perfect log-normal distribution with a mean response within ~5% of the true count. The results of this research suggest that a well-modeled MCS, by simulating a sample of responses from a large, rational, and incentivized crowd, can provide a more accurate solution to a quantitative problem than might be attainable by direct sampling of a smaller crowd or an uninformed crowd, irrespective of size, that guesses randomly.展开更多
Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from...Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.展开更多
Traditional models for semantic segmentation in point clouds primarily focus on smaller scales.However,in real-world applications,point clouds often exhibit larger scales,leading to heavy computational and memory requ...Traditional models for semantic segmentation in point clouds primarily focus on smaller scales.However,in real-world applications,point clouds often exhibit larger scales,leading to heavy computational and memory requirements.The key to handling large-scale point clouds lies in leveraging random sampling,which offers higher computational efficiency and lower memory consumption compared to other sampling methods.Nevertheless,the use of random sampling can potentially result in the loss of crucial points during the encoding stage.To address these issues,this paper proposes cross-fusion self-attention network(CFSA-Net),a lightweight and efficient network architecture specifically designed for directly processing large-scale point clouds.At the core of this network is the incorporation of random sampling alongside a local feature extraction module based on cross-fusion self-attention(CFSA).This module effectively integrates long-range contextual dependencies between points by employing hierarchical position encoding(HPC).Furthermore,it enhances the interaction between each point’s coordinates and feature information through cross-fusion self-attention pooling,enabling the acquisition of more comprehensive geometric information.Finally,a residual optimization(RO)structure is introduced to extend the receptive field of individual points by stacking hierarchical position encoding and cross-fusion self-attention pooling,thereby reducing the impact of information loss caused by random sampling.Experimental results on the Stanford Large-Scale 3D Indoor Spaces(S3DIS),Semantic3D,and SemanticKITTI datasets demonstrate the superiority of this algorithm over advanced approaches such as RandLA-Net and KPConv.These findings underscore the excellent performance of CFSA-Net in large-scale 3D semantic segmentation.展开更多
文摘Driven by large-scale optimization problems arising from machine learning,the development of stochastic optimization methods has witnessed a huge growth.Numerous types of methods have been developed based on vanilla stochastic gradient descent method.However,for most algorithms,convergence rate in stochastic setting cannot simply match that in deterministic setting.Better understanding the gap between deterministic and stochastic optimization is the main goal of this paper.Specifically,we are interested in Nesterov acceleration of gradient-based approaches.In our study,we focus on acceleration of stochastic mirror descent method with implicit regularization property.Assuming that the problem objective is smooth and convex or strongly convex,our analysis prescribes the method parameters which ensure fast convergence of the estimation error and satisfied numerical performance.
基金supported by the National Natural Science Foundation of China(Grant Nos.31870017,31760011)the Science and Technology Development Fund of Guidance from the Central Government to Locals(KC1610530)+2 种基金the Department of Science and Technology of Yunnan Province(Grant Nos.2018IA075,2018FY001006)the Biodiversity Survey,the Assessment Project of the Ministry of Ecology and Environment,China(Grant No.2019HJ2096001006)the China Postdoctoral Science Foundation(Grant No.2017M613017)。
文摘The fungus Ophiocordyceps sinensis is endemic to the vast region of the Qinghai-Tibetan plateau(QTP).The unique and complex geographical environmental conditions have led to the "sky island" distribution structure of O.sinensis.Due to limited and unbalanced sample collections,the previous data on O.sinensis regarding its genetic diversity and spatial structure have been deemed insufficient.In this study,we analyzed the diversity and phylogeographic structures of O.sinensis using internally transcribed spacer region(ITS) and 5-locus datasets by a large-scale sampling.A total of 111 haplotypes of ITS sequences were identified from 948 samples data of the fungus O.sinensis,with representing high genetic diversity,and 8 phylogenetic clades were recognized in O.sinensis.Both the southeastern Tibet and the northwestern Yunnan were the centers of genetic diversity and genetic differentiation of the fungus,and they were inferred as the glacial refugia in the Quaternary.Three distribution patterns were identified to correspond to the 8 clades,including but not limited to the coexistence of widely and specific local distributive structures.It also revealed that the differentiation pattern of O.sinensis did not fit for the isolation-by-distance model.The differentiation into the 8 clades occurred between 1.56 Myr and6.62 Myr.The ancestor of O.sinensis most likely originated in the late Miocene(6.62 Myr) in the northwestern Yunnan,and the Scene A-C of the Qinghai-Tibetan movements may have played an important role in the differentiation of O.sinensis during the late Miocene-Pliocene periods.Our current results provide a much clearer and detailed understanding of the genetic diversity and geographical spatial distribution of the endemic alpine fungus O.sinensis.It also revealed that the geochronology resulting from paleogeology could be cross-examined with biomolecular clock at a finer scale.
文摘A composite random variable is a product (or sum of products) of statistically distributed quantities. Such a variable can represent the solution to a multi-factor quantitative problem submitted to a large, diverse, independent, anonymous group of non-expert respondents (the “crowd”). The objective of this research is to examine the statistical distribution of solutions from a large crowd to a quantitative problem involving image analysis and object counting. Theoretical analysis by the author, covering a range of conditions and types of factor variables, predicts that composite random variables are distributed log-normally to an excellent approximation. If the factors in a problem are themselves distributed log-normally, then their product is rigorously log-normal. A crowdsourcing experiment devised by the author and implemented with the assistance of a BBC (British Broadcasting Corporation) television show, yielded a sample of approximately 2000 responses consistent with a log-normal distribution. The sample mean was within ~12% of the true count. However, a Monte Carlo simulation (MCS) of the experiment, employing either normal or log-normal random variables as factors to model the processes by which a crowd of 1 million might arrive at their estimates, resulted in a visually perfect log-normal distribution with a mean response within ~5% of the true count. The results of this research suggest that a well-modeled MCS, by simulating a sample of responses from a large, rational, and incentivized crowd, can provide a more accurate solution to a quantitative problem than might be attainable by direct sampling of a smaller crowd or an uninformed crowd, irrespective of size, that guesses randomly.
基金partially supported by the National Natural Science Foundation of China(No.61572250)the Science and Technology Program of Jiangsu Province(No.BE2017155)
文摘Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.
基金funded by the National Natural Science Foundation of China Youth Project(61603127).
文摘Traditional models for semantic segmentation in point clouds primarily focus on smaller scales.However,in real-world applications,point clouds often exhibit larger scales,leading to heavy computational and memory requirements.The key to handling large-scale point clouds lies in leveraging random sampling,which offers higher computational efficiency and lower memory consumption compared to other sampling methods.Nevertheless,the use of random sampling can potentially result in the loss of crucial points during the encoding stage.To address these issues,this paper proposes cross-fusion self-attention network(CFSA-Net),a lightweight and efficient network architecture specifically designed for directly processing large-scale point clouds.At the core of this network is the incorporation of random sampling alongside a local feature extraction module based on cross-fusion self-attention(CFSA).This module effectively integrates long-range contextual dependencies between points by employing hierarchical position encoding(HPC).Furthermore,it enhances the interaction between each point’s coordinates and feature information through cross-fusion self-attention pooling,enabling the acquisition of more comprehensive geometric information.Finally,a residual optimization(RO)structure is introduced to extend the receptive field of individual points by stacking hierarchical position encoding and cross-fusion self-attention pooling,thereby reducing the impact of information loss caused by random sampling.Experimental results on the Stanford Large-Scale 3D Indoor Spaces(S3DIS),Semantic3D,and SemanticKITTI datasets demonstrate the superiority of this algorithm over advanced approaches such as RandLA-Net and KPConv.These findings underscore the excellent performance of CFSA-Net in large-scale 3D semantic segmentation.