In this paper,we investigate the limiting spectral distribution of a high-dimensional Kendall’s rank correlation matrix.The underlying population is allowed to have a general dependence structure.The result no longer...In this paper,we investigate the limiting spectral distribution of a high-dimensional Kendall’s rank correlation matrix.The underlying population is allowed to have a general dependence structure.The result no longer follows the generalized Marcenko-Pastur law,which is brand new.It is the first result on rank correlation matrices with dependence.As applications,we study Kendall’s rank correlation matrix for multivariate normal distributions with a general covariance matrix.From these results,we further gain insights into Kendall’s rank correlation matrix and its connections with the sample covariance/correlation matrix.展开更多
We show large deviation expansions for sums of independent and bounded from above random variables. Our moderate deviation expansions are similar to those of Cram′er(1938), Bahadur and Ranga Rao(1960), and Sakhanenko...We show large deviation expansions for sums of independent and bounded from above random variables. Our moderate deviation expansions are similar to those of Cram′er(1938), Bahadur and Ranga Rao(1960), and Sakhanenko(1991). In particular, our results extend Talagrand's inequality from bounded random variables to random variables having finite(2 + δ)-th moments, where δ∈(0, 1]. As a consequence,we obtain an improvement of Hoeffding's inequality. Applications to linear regression, self-normalized large deviations and t-statistic are also discussed.展开更多
This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoe...This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection.展开更多
Classification,using the decision tree algorithm,is a widely studied problem in data streams.The challenge is when to split a decision node into multiple leaves.Concentration inequalities,that exploit variance informa...Classification,using the decision tree algorithm,is a widely studied problem in data streams.The challenge is when to split a decision node into multiple leaves.Concentration inequalities,that exploit variance information such as Bernstein's and Bennett's inequalities,are often substantially strict as compared with Hoeffding's bound which disregards variance.Many machine learning algorithms for stream classification such as very fast decision tree(VFDT) learner,AdaBoost and support vector machines(SVMs),use the Hoeffding's bound as a performance guarantee.In this paper,we propose a new algorithm based on the recently proposed empirical Bernstein's bound to achieve a better probabilistic bound on the accuracy of the decision tree.Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.展开更多
When addressing various financial problems,such as estimating stock portfolio risk,it is necessary to derive the distribution of the sum of the dependent random variables.Although deriving this distribution requires i...When addressing various financial problems,such as estimating stock portfolio risk,it is necessary to derive the distribution of the sum of the dependent random variables.Although deriving this distribution requires identifying the joint distribution of these random variables,exact estimation of the joint distribution of dependent random variables is difficult.Therefore,in recent years,studies have been conducted on the bound of the sum of dependent random variables with dependence uncertainty.In this study,we obtain an improved Hoeffding inequality for dependent bounded variables.Further,we expand the above result to the case of sub-Gaussian random variables.展开更多
We investigate Hoeffding's inequality for both discrete-time Markov chains and continuous-time Markov processes on a general state space. Our results relax the usual aperiodicity restriction in the literature, and...We investigate Hoeffding's inequality for both discrete-time Markov chains and continuous-time Markov processes on a general state space. Our results relax the usual aperiodicity restriction in the literature, and the explicit upper bounds in the inequalities are obtained via the solution of Poisson's equation. The results are further illustrated with applications to queueing theory and reective diffusion processes.展开更多
针对重现概念漂移检测中的概念表征和分类器选择问题,提出了一种适用于含重现概念漂移的数据流分类的算法——基于主要特征抽取的概念聚类和预测算法(Conceptual clustering and prediction through main feature extraction,MFCCP)。MF...针对重现概念漂移检测中的概念表征和分类器选择问题,提出了一种适用于含重现概念漂移的数据流分类的算法——基于主要特征抽取的概念聚类和预测算法(Conceptual clustering and prediction through main feature extraction,MFCCP)。MFCCP通过计算不同批次样本的主要特征及影响因子的差异度以识别重复出现的概念,为每个概念维持且及时更新一个分类器,并依据Hoeffding不等式选择最合适的分类器对当前样本集实施分类,以提高对概念漂移的反应能力。在3个数据集上的实验表明:MFCCP在含重现概念漂移的数据集上的分类准确率,对概念漂移的反应能力及对概念漂移检测的准确率均明显优于其他4种对比算法,且MFCCP也适用于对不含重现概念漂移的数据流进行分类。展开更多
基金supported by National Natural Science Foundation of China(Grant Nos.12031005 and 12101292)supported by National Natural Science Foundation of China(Grant No.12031005),supported by National Natural Science Foundation of China(Grant No.12171099)Natural Science Foundation of Shanghai(Grant No.21ZR1432900)。
文摘In this paper,we investigate the limiting spectral distribution of a high-dimensional Kendall’s rank correlation matrix.The underlying population is allowed to have a general dependence structure.The result no longer follows the generalized Marcenko-Pastur law,which is brand new.It is the first result on rank correlation matrices with dependence.As applications,we study Kendall’s rank correlation matrix for multivariate normal distributions with a general covariance matrix.From these results,we further gain insights into Kendall’s rank correlation matrix and its connections with the sample covariance/correlation matrix.
基金supported by National Natural Science Foundation of China (Grant Nos. 11601375 and 11626250)
文摘We show large deviation expansions for sums of independent and bounded from above random variables. Our moderate deviation expansions are similar to those of Cram′er(1938), Bahadur and Ranga Rao(1960), and Sakhanenko(1991). In particular, our results extend Talagrand's inequality from bounded random variables to random variables having finite(2 + δ)-th moments, where δ∈(0, 1]. As a consequence,we obtain an improvement of Hoeffding's inequality. Applications to linear regression, self-normalized large deviations and t-statistic are also discussed.
文摘This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection.
基金the National Natural Science Foundation of China(Nos.60873108,61175047 and 61152001)the Fundamental Research Funds for the Central Universities of China(No.SWJTU11ZT08)
文摘Classification,using the decision tree algorithm,is a widely studied problem in data streams.The challenge is when to split a decision node into multiple leaves.Concentration inequalities,that exploit variance information such as Bernstein's and Bennett's inequalities,are often substantially strict as compared with Hoeffding's bound which disregards variance.Many machine learning algorithms for stream classification such as very fast decision tree(VFDT) learner,AdaBoost and support vector machines(SVMs),use the Hoeffding's bound as a performance guarantee.In this paper,we propose a new algorithm based on the recently proposed empirical Bernstein's bound to achieve a better probabilistic bound on the accuracy of the decision tree.Experimental results on four synthetic and two real world data sets demonstrate the performance gain of our proposed technique.
基金This work was supported by JSPS Grant-in-Aid for Young Scientists(Grant No.18K12873)Waseda University Grants for Special Research Projects(“Tokutei Kadai”)(Grant No.2019C-688).
文摘When addressing various financial problems,such as estimating stock portfolio risk,it is necessary to derive the distribution of the sum of the dependent random variables.Although deriving this distribution requires identifying the joint distribution of these random variables,exact estimation of the joint distribution of dependent random variables is difficult.Therefore,in recent years,studies have been conducted on the bound of the sum of dependent random variables with dependence uncertainty.In this study,we obtain an improved Hoeffding inequality for dependent bounded variables.Further,we expand the above result to the case of sub-Gaussian random variables.
基金This work was supported in part by the National Natural Science Foundation of China(Grant Nos.11971486,11771452)the Natural Science Foundation of Hunan Province(Grant Nos.2019JJ40357,2020JJ4674)the Innovation Program of Central South University(Grant No.2020zzts039).
文摘We investigate Hoeffding's inequality for both discrete-time Markov chains and continuous-time Markov processes on a general state space. Our results relax the usual aperiodicity restriction in the literature, and the explicit upper bounds in the inequalities are obtained via the solution of Poisson's equation. The results are further illustrated with applications to queueing theory and reective diffusion processes.
文摘针对重现概念漂移检测中的概念表征和分类器选择问题,提出了一种适用于含重现概念漂移的数据流分类的算法——基于主要特征抽取的概念聚类和预测算法(Conceptual clustering and prediction through main feature extraction,MFCCP)。MFCCP通过计算不同批次样本的主要特征及影响因子的差异度以识别重复出现的概念,为每个概念维持且及时更新一个分类器,并依据Hoeffding不等式选择最合适的分类器对当前样本集实施分类,以提高对概念漂移的反应能力。在3个数据集上的实验表明:MFCCP在含重现概念漂移的数据集上的分类准确率,对概念漂移的反应能力及对概念漂移检测的准确率均明显优于其他4种对比算法,且MFCCP也适用于对不含重现概念漂移的数据流进行分类。