Extending the income dynamics approach in Quah (2003), the present paper studies the enlarging income inequality in China over the past three decades from the viewpoint of rural-urban migration and economic transiti...Extending the income dynamics approach in Quah (2003), the present paper studies the enlarging income inequality in China over the past three decades from the viewpoint of rural-urban migration and economic transition. We establish non-parametric estimations of rural and urban income distribution functions in China, and aggregate a population- weighted, nationwide income distribution function taking into account rural-urban differences in technological progress and price indexes. We calculate 12 inequality indexes through non-parametric estimation to overcome the biases in existingparametric estimation and, therefore, provide more accurate measurement of income inequalitY. Policy implications have been drawn based on our research.展开更多
Short-term traffic flow is one of the core technologies to realize traffic flow guidance. In this article, in view of the characteristics that the traffic flow changes repeatedly, a short-term traffic flow forecasting...Short-term traffic flow is one of the core technologies to realize traffic flow guidance. In this article, in view of the characteristics that the traffic flow changes repeatedly, a short-term traffic flow forecasting method based on a three-layer K-nearest neighbor non-parametric regression algorithm is proposed. Specifically, two screening layers based on shape similarity were introduced in K-nearest neighbor non-parametric regression method, and the forecasting results were output using the weighted averaging on the reciprocal values of the shape similarity distances and the most-similar-point distance adjustment method. According to the experimental results, the proposed algorithm has improved the predictive ability of the traditional K-nearest neighbor non-parametric regression method, and greatly enhanced the accuracy and real-time performance of short-term traffic flow forecasting.展开更多
The study investigates long-term changes in annual and seasonal rainfall patterns in the Indira Sagar Region of Madhya Pradesh, India, from 1901 to 2010. Agriculture sustainability, food supply, natural resource devel...The study investigates long-term changes in annual and seasonal rainfall patterns in the Indira Sagar Region of Madhya Pradesh, India, from 1901 to 2010. Agriculture sustainability, food supply, natural resource development, and hydropower system reliability in the region rely heavily on monsoon rainfall. Monthly rainfall data from three stations (East Nimar, Barwani, and West Nimar) were analyzed. Initially, the pre-whitening method was applied to eliminate serial correlation effects from the rainfall data series. Subsequently, statistical trends in annual and seasonal rainfall were assessed using both parametric (student-t test) and non-parametric tests [Mann-Kendall, Sen’s slope estimator, and Cumulative Sum (CUSUM)]. The magnitude of the rainfall trend was determined using Theil-Sen’s slope estimator. Spatial analysis of the Mann-Kendall test on an annual basis revealed a statistically insignificant decreasing trend for Barwani and East Nimar and an increasing trend for West Nimar. On a seasonal basis, the monsoon season contributes a significant percentage (88.33%) to the total annual rainfall. The CUSUM test results indicated a shift change detection in annual rainfall data for Barwani in 1997, while shifts were observed in West and East Nimar stations in 1929. These findings offer valuable insights into regional rainfall behavior, aiding in the planning and management of water resources and ecological systems.展开更多
This study aimed to examine the performance of the Siegel-Tukey and Savage tests on data sets with heterogeneous variances. The analysis, considering Normal, Platykurtic, and Skewed distributions and a standard deviat...This study aimed to examine the performance of the Siegel-Tukey and Savage tests on data sets with heterogeneous variances. The analysis, considering Normal, Platykurtic, and Skewed distributions and a standard deviation ratio of 1, was conducted for both small and large sample sizes. For small sample sizes, two main categories were established: equal and different sample sizes. Analyses were performed using Monte Carlo simulations with 20,000 repetitions for each scenario, and the simulations were evaluated using SAS software. For small sample sizes, the I. type error rate of the Siegel-Tukey test generally ranged from 0.045 to 0.055, while the I. type error rate of the Savage test was observed to range from 0.016 to 0.041. Similar trends were observed for Platykurtic and Skewed distributions. In scenarios with different sample sizes, the Savage test generally exhibited lower I. type error rates. For large sample sizes, two main categories were established: equal and different sample sizes. For large sample sizes, the I. type error rate of the Siegel-Tukey test ranged from 0.047 to 0.052, while the I. type error rate of the Savage test ranged from 0.043 to 0.051. In cases of equal sample sizes, both tests generally had lower error rates, with the Savage test providing more consistent results for large sample sizes. In conclusion, it was determined that the Savage test provides lower I. type error rates for small sample sizes and that both tests have similar error rates for large sample sizes. These findings suggest that the Savage test could be a more reliable option when analyzing variance differences.展开更多
A non-parametric method is used in this study to analyze and predict short-term rainfall due to tropical cyclones(TCs) in a coastal meteorological station. All 427 TCs during 1953-2011 which made landfall along the So...A non-parametric method is used in this study to analyze and predict short-term rainfall due to tropical cyclones(TCs) in a coastal meteorological station. All 427 TCs during 1953-2011 which made landfall along the Southeast China coast with a distance less than 700 km to a certain meteorological station- Shenzhen are analyzed and grouped according to their landfalling direction, distance and intensity. The corresponding daily rainfall records at Shenzhen Meteorological Station(SMS) during TCs landfalling period(a couple of days before and after TC landfall) are collected. The maximum daily rainfall(R-24) and maximum 3-day accumulative rainfall(R-72) records at SMS for each TC category are analyzed by a non-parametric statistical method, percentile estimation. The results are plotted by statistical boxplots, expressing in probability of precipitation. The performance of the statistical boxplots is evaluated to forecast the short-term rainfall at SMS during the TC seasons in 2012 and 2013. Results show that the boxplot scheme can be used as a valuable reference to predict the short-term rainfall at SMS due to TCs landfalling along the Southeast China coast.展开更多
Travel time reliability(TTR) modeling has gain attention among researchers’ due to its ability to represent road user satisfaction as well as providing a predictability of a trip travel time.Despite this significant ...Travel time reliability(TTR) modeling has gain attention among researchers’ due to its ability to represent road user satisfaction as well as providing a predictability of a trip travel time.Despite this significant effort,its impact on the severity of a crash is not well explored.This study analyzes the effect of TTR and other variables on the probability of the crash severity occurring on arterial roads.To address the unobserved heterogeneity problem,two random-effect regressions were applied;the Dirichlet random-effect(DRE)and the traditional random-effect(TRE) logistic regression.The difference between the two models is that the random-effect in the DRE is non-parametrically specified while in the TRE model is parametrically specified.The Markov Chain Monte Carlo simulations were adopted to infer the parameters’ posterior distributions of the two developed models.Using four-year police-reported crash data and travel speeds from Northeast Florida,the analysis of goodness-of-fit found the DRE model to best fit the data.Hence,it was used in studying the influence of TTR and other variables on crash severity.The DRE model findings suggest that TTR is statistically significant,at 95 percent credible intervals,influencing the severity level of a crash.A unit increases in TTR reduces the likelihood of a severe crash occurrence by 25 percent.Moreover,among the significant variables,alcohol/drug impairment was found to have the highest impact in influencing the occurrence of severe crashes.Other significant factors included traffic volume,weekends,speed,work-zone,land use,visibility,seatbelt usage,segment length,undivided/divided highway,and age.展开更多
In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (...In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.展开更多
Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In mos...Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In most cases mutual independence among the variables is assumed, however this fails to take into accounts the correlation between the outcomes of interests. A special bivariate form of the multivariate Lagrange family of distribution, names Generalized Bivariate Poisson Distribution, is considered in this paper. Objectives: We estimate the model parameters using the method of maximum likelihood and show that the model fits the count variables representing components of metabolic syndrome in spousal pairs. We use the likelihood local score to test the significance of the correlation between the counts. We also construct confidence interval on the ratio of the two correlated Poisson means. Methods: Based on a random sample of pairs of count data, we show that the score test of independence is locally most powerful. We also provide a formula for sample size estimation for given level of significance and given power. The confidence intervals on the ratio of correlated Poisson means are constructed using the delta method, the Fieller’s theorem, and the nonparametric bootstrap. We illustrate the methodologies on metabolic syndrome data collected from 4000 spousal pairs. Results: The bivariate Poisson model fitted the metabolic syndrome data quite satisfactorily. Moreover, the three methods of confidence interval estimation were almost identical, meaning that they have the same interval width.展开更多
The probability distributions of wave characteristics from three groups of sampled ocean data with different significant wave heights have been analyzed using two transformation functions estimated by non-parametric a...The probability distributions of wave characteristics from three groups of sampled ocean data with different significant wave heights have been analyzed using two transformation functions estimated by non-parametric and parametric methods. The marginal wave characteristic distribution and the joint density of wave properties have been calculated using the two transformations, with the results and accuracy of both transformations presented here. The two transformations deviate slightly between each other for the calculation of the crest and trough height marginal wave distributions, as well as the joint densities of wave amplitude with other wave properties. The transformation methods for the calculation of the wave crest and trough height distributions are shown to provide good agreement with real ocean data. Our work will help in the determination of the most appropriate transformation procedure for the prediction of extreme values.展开更多
The analysis of survival data is a major focus of statistics. Interval censored data reflect uncertainty as to the exact times the units failed within an interval. This type of data frequently comes from tests or situ...The analysis of survival data is a major focus of statistics. Interval censored data reflect uncertainty as to the exact times the units failed within an interval. This type of data frequently comes from tests or situations where the objects of interest are not constantly monitored. Thus events are known only to have occurred between the two observation periods. Interval censoring has become increasingly common in the areas that produce failure time data. This paper explores the statistical analysis of interval-censored failure time data with applications. Three different data sets, namely Breast Cancer, Hemophilia, and AIDS data were used to illustrate the methods during this study. Both parametric and nonparametric methods of analysis are carried out in this study. Theory and methodology of fitted models for the interval-censored data are described. Fitting of parametric and non-parametric models to three real data sets are considered. Results derived from different methods are presented and also compared.展开更多
Fragility curves are commonly used in civil engineering to assess the vulnerability of structures to earthquakes. The probability of failure associated with a prescribed criterion (e.g., the maximal inter-storey drif...Fragility curves are commonly used in civil engineering to assess the vulnerability of structures to earthquakes. The probability of failure associated with a prescribed criterion (e.g., the maximal inter-storey drift of a building exceeding a certain threshold) is represented as a function of the intensity of the earthquake ground motion (e.g., peak ground acceleration or spectral acceleration). The classical approach relies on assuming a lognormal shape of the fragility curves; it is thus parametric. In this paper, we introduce two non-parametric approaches to establish the fragility curves without employing the above assumption, namely binned Monte Carlo simulation and kernel density estimation. As an illustration, we compute the fragility curves for a three-storey steel frame using a large number of synthetic ground motions. The curves obtained with the non-parametric approaches are compared with respective curves based on the lognormal assumption. A similar comparison is presented for a case when a limited number of recorded ground motions is available. It is found that the accuracy of the lognormal curves depends on the ground motion intensity measure, the failure criterion and most importantly, on the employed method for estimating the parameters of the lognormal shape.展开更多
基金国家教育部人文社会科学青年基金项目<农业保险差别化费率与补贴理论方法与实证研究>(11YJC630267)辽宁省教育厅人文社会科学一般项目<政策性农业保险费率厘定模型与实证研究>(W2011151)+3 种基金辽宁对外经贸学院博士启动基金项目<农村金融组织创新对农民收入的影响研究>(2013XJLXBSJJ007)中美农业部国际合作项目Tarleton State UniversityTexas A&M University System Biography>(No.53-3151-2-00017)
基金the National Science Foundation of China(No.70673072)the National Social Science Foundation of China(No.10JZD013)for financial support
文摘Extending the income dynamics approach in Quah (2003), the present paper studies the enlarging income inequality in China over the past three decades from the viewpoint of rural-urban migration and economic transition. We establish non-parametric estimations of rural and urban income distribution functions in China, and aggregate a population- weighted, nationwide income distribution function taking into account rural-urban differences in technological progress and price indexes. We calculate 12 inequality indexes through non-parametric estimation to overcome the biases in existingparametric estimation and, therefore, provide more accurate measurement of income inequalitY. Policy implications have been drawn based on our research.
文摘Short-term traffic flow is one of the core technologies to realize traffic flow guidance. In this article, in view of the characteristics that the traffic flow changes repeatedly, a short-term traffic flow forecasting method based on a three-layer K-nearest neighbor non-parametric regression algorithm is proposed. Specifically, two screening layers based on shape similarity were introduced in K-nearest neighbor non-parametric regression method, and the forecasting results were output using the weighted averaging on the reciprocal values of the shape similarity distances and the most-similar-point distance adjustment method. According to the experimental results, the proposed algorithm has improved the predictive ability of the traditional K-nearest neighbor non-parametric regression method, and greatly enhanced the accuracy and real-time performance of short-term traffic flow forecasting.
文摘The study investigates long-term changes in annual and seasonal rainfall patterns in the Indira Sagar Region of Madhya Pradesh, India, from 1901 to 2010. Agriculture sustainability, food supply, natural resource development, and hydropower system reliability in the region rely heavily on monsoon rainfall. Monthly rainfall data from three stations (East Nimar, Barwani, and West Nimar) were analyzed. Initially, the pre-whitening method was applied to eliminate serial correlation effects from the rainfall data series. Subsequently, statistical trends in annual and seasonal rainfall were assessed using both parametric (student-t test) and non-parametric tests [Mann-Kendall, Sen’s slope estimator, and Cumulative Sum (CUSUM)]. The magnitude of the rainfall trend was determined using Theil-Sen’s slope estimator. Spatial analysis of the Mann-Kendall test on an annual basis revealed a statistically insignificant decreasing trend for Barwani and East Nimar and an increasing trend for West Nimar. On a seasonal basis, the monsoon season contributes a significant percentage (88.33%) to the total annual rainfall. The CUSUM test results indicated a shift change detection in annual rainfall data for Barwani in 1997, while shifts were observed in West and East Nimar stations in 1929. These findings offer valuable insights into regional rainfall behavior, aiding in the planning and management of water resources and ecological systems.
文摘This study aimed to examine the performance of the Siegel-Tukey and Savage tests on data sets with heterogeneous variances. The analysis, considering Normal, Platykurtic, and Skewed distributions and a standard deviation ratio of 1, was conducted for both small and large sample sizes. For small sample sizes, two main categories were established: equal and different sample sizes. Analyses were performed using Monte Carlo simulations with 20,000 repetitions for each scenario, and the simulations were evaluated using SAS software. For small sample sizes, the I. type error rate of the Siegel-Tukey test generally ranged from 0.045 to 0.055, while the I. type error rate of the Savage test was observed to range from 0.016 to 0.041. Similar trends were observed for Platykurtic and Skewed distributions. In scenarios with different sample sizes, the Savage test generally exhibited lower I. type error rates. For large sample sizes, two main categories were established: equal and different sample sizes. For large sample sizes, the I. type error rate of the Siegel-Tukey test ranged from 0.047 to 0.052, while the I. type error rate of the Savage test ranged from 0.043 to 0.051. In cases of equal sample sizes, both tests generally had lower error rates, with the Savage test providing more consistent results for large sample sizes. In conclusion, it was determined that the Savage test provides lower I. type error rates for small sample sizes and that both tests have similar error rates for large sample sizes. These findings suggest that the Savage test could be a more reliable option when analyzing variance differences.
基金The Innovation of Science and Technology Commission of Shenzhen Municipality(JCYJ20120617115926138)Scientific and Technological Project for Regional Meteorological Center in South China,Chinese Meteorological Administration(GRMC2012M15)
文摘A non-parametric method is used in this study to analyze and predict short-term rainfall due to tropical cyclones(TCs) in a coastal meteorological station. All 427 TCs during 1953-2011 which made landfall along the Southeast China coast with a distance less than 700 km to a certain meteorological station- Shenzhen are analyzed and grouped according to their landfalling direction, distance and intensity. The corresponding daily rainfall records at Shenzhen Meteorological Station(SMS) during TCs landfalling period(a couple of days before and after TC landfall) are collected. The maximum daily rainfall(R-24) and maximum 3-day accumulative rainfall(R-72) records at SMS for each TC category are analyzed by a non-parametric statistical method, percentile estimation. The results are plotted by statistical boxplots, expressing in probability of precipitation. The performance of the statistical boxplots is evaluated to forecast the short-term rainfall at SMS during the TC seasons in 2012 and 2013. Results show that the boxplot scheme can be used as a valuable reference to predict the short-term rainfall at SMS due to TCs landfalling along the Southeast China coast.
基金the Center for Accessibility and Safety for an Aging Population at Florida State UniversityFlorida A&M UniversityUniversity of North Florida for funding support in research
文摘Travel time reliability(TTR) modeling has gain attention among researchers’ due to its ability to represent road user satisfaction as well as providing a predictability of a trip travel time.Despite this significant effort,its impact on the severity of a crash is not well explored.This study analyzes the effect of TTR and other variables on the probability of the crash severity occurring on arterial roads.To address the unobserved heterogeneity problem,two random-effect regressions were applied;the Dirichlet random-effect(DRE)and the traditional random-effect(TRE) logistic regression.The difference between the two models is that the random-effect in the DRE is non-parametrically specified while in the TRE model is parametrically specified.The Markov Chain Monte Carlo simulations were adopted to infer the parameters’ posterior distributions of the two developed models.Using four-year police-reported crash data and travel speeds from Northeast Florida,the analysis of goodness-of-fit found the DRE model to best fit the data.Hence,it was used in studying the influence of TTR and other variables on crash severity.The DRE model findings suggest that TTR is statistically significant,at 95 percent credible intervals,influencing the severity level of a crash.A unit increases in TTR reduces the likelihood of a severe crash occurrence by 25 percent.Moreover,among the significant variables,alcohol/drug impairment was found to have the highest impact in influencing the occurrence of severe crashes.Other significant factors included traffic volume,weekends,speed,work-zone,land use,visibility,seatbelt usage,segment length,undivided/divided highway,and age.
文摘In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.
文摘Background: Bivariate count data are commonly encountered in medicine, biology, engineering, epidemiology and many other applications. The Poisson distribution has been the model of choice to analyze such data. In most cases mutual independence among the variables is assumed, however this fails to take into accounts the correlation between the outcomes of interests. A special bivariate form of the multivariate Lagrange family of distribution, names Generalized Bivariate Poisson Distribution, is considered in this paper. Objectives: We estimate the model parameters using the method of maximum likelihood and show that the model fits the count variables representing components of metabolic syndrome in spousal pairs. We use the likelihood local score to test the significance of the correlation between the counts. We also construct confidence interval on the ratio of the two correlated Poisson means. Methods: Based on a random sample of pairs of count data, we show that the score test of independence is locally most powerful. We also provide a formula for sample size estimation for given level of significance and given power. The confidence intervals on the ratio of correlated Poisson means are constructed using the delta method, the Fieller’s theorem, and the nonparametric bootstrap. We illustrate the methodologies on metabolic syndrome data collected from 4000 spousal pairs. Results: The bivariate Poisson model fitted the metabolic syndrome data quite satisfactorily. Moreover, the three methods of confidence interval estimation were almost identical, meaning that they have the same interval width.
基金Supported by the Marine Engineering Equipment Scientific Research Project of Ministry of Industry and Information Technology of PRCthe National Science and Technology Major Project of China(Grant No.2016ZX05057020)National Natural Science Foundation of China(Grant No.51809067)
文摘The probability distributions of wave characteristics from three groups of sampled ocean data with different significant wave heights have been analyzed using two transformation functions estimated by non-parametric and parametric methods. The marginal wave characteristic distribution and the joint density of wave properties have been calculated using the two transformations, with the results and accuracy of both transformations presented here. The two transformations deviate slightly between each other for the calculation of the crest and trough height marginal wave distributions, as well as the joint densities of wave amplitude with other wave properties. The transformation methods for the calculation of the wave crest and trough height distributions are shown to provide good agreement with real ocean data. Our work will help in the determination of the most appropriate transformation procedure for the prediction of extreme values.
文摘The analysis of survival data is a major focus of statistics. Interval censored data reflect uncertainty as to the exact times the units failed within an interval. This type of data frequently comes from tests or situations where the objects of interest are not constantly monitored. Thus events are known only to have occurred between the two observation periods. Interval censoring has become increasingly common in the areas that produce failure time data. This paper explores the statistical analysis of interval-censored failure time data with applications. Three different data sets, namely Breast Cancer, Hemophilia, and AIDS data were used to illustrate the methods during this study. Both parametric and nonparametric methods of analysis are carried out in this study. Theory and methodology of fitted models for the interval-censored data are described. Fitting of parametric and non-parametric models to three real data sets are considered. Results derived from different methods are presented and also compared.
文摘Fragility curves are commonly used in civil engineering to assess the vulnerability of structures to earthquakes. The probability of failure associated with a prescribed criterion (e.g., the maximal inter-storey drift of a building exceeding a certain threshold) is represented as a function of the intensity of the earthquake ground motion (e.g., peak ground acceleration or spectral acceleration). The classical approach relies on assuming a lognormal shape of the fragility curves; it is thus parametric. In this paper, we introduce two non-parametric approaches to establish the fragility curves without employing the above assumption, namely binned Monte Carlo simulation and kernel density estimation. As an illustration, we compute the fragility curves for a three-storey steel frame using a large number of synthetic ground motions. The curves obtained with the non-parametric approaches are compared with respective curves based on the lognormal assumption. A similar comparison is presented for a case when a limited number of recorded ground motions is available. It is found that the accuracy of the lognormal curves depends on the ground motion intensity measure, the failure criterion and most importantly, on the employed method for estimating the parameters of the lognormal shape.