类偏斜问题(class i mbalance problem)是数据挖掘领域的常见问题之一,人们提出了各种策略来处理这个问题.当训练样本存在类偏斜问题时,kNN分类器会将小类中的样本错分到大类,导致分类的宏F1指标下降.针对kNN存在的这个缺陷,提出了文本...类偏斜问题(class i mbalance problem)是数据挖掘领域的常见问题之一,人们提出了各种策略来处理这个问题.当训练样本存在类偏斜问题时,kNN分类器会将小类中的样本错分到大类,导致分类的宏F1指标下降.针对kNN存在的这个缺陷,提出了文本训练集的临界点(critical point,CP)的概念并对其性质进行了探讨,给出了求CP,CP的下近似值LA、上近似值UA的算法.之后,根据LA或UA及训练样本数对传统的kNN决策函数进行修改,这就是自适应的加权kNN文本分类.为了验证自适应的加权kNN文本分类的有效性,设计了2组实验进行对比:一组为不同的收缩因子间进行对比,可看做是与Tan的工作进行对比,同时用来证实在LA或UA上分类器的宏F1较好;另一组则是与随机重取样进行实验对比,其中,传统kNN方法作为对比的基线.实验表明,所提的自适应加权kNN文本分类优于随机重取样,使得宏F1指标明显上升.该方法有点类似于代价相关学习.展开更多
Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are g...Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are generated,reflecting the interaction between the TBM system and surrounding rock,and these data can be used to evaluate the rock mass quality.This study proposed a stacking ensemble classifier for the real-time prediction of the rock mass classification using TBM operation data.Based on the Songhua River water conveyance project,a total of 7538 TBM tunnelling cycles and the corresponding rock mass classes are obtained after data preprocessing.Then,through the tree-based feature selection method,10 key TBM operation parameters are selected,and the mean values of the 10 selected features in the stable phase after removing outliers are calculated as the inputs of classifiers.The preprocessed data are randomly divided into the training set(90%)and test set(10%)using simple random sampling.Besides stacking ensemble classifier,seven individual classifiers are established as the comparison.These classifiers include support vector machine(SVM),k-nearest neighbors(KNN),random forest(RF),gradient boosting decision tree(GBDT),decision tree(DT),logistic regression(LR)and multilayer perceptron(MLP),where the hyper-parameters of each classifier are optimised using the grid search method.The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers,and it shows a more powerful learning and generalisation ability for small and imbalanced samples.Additionally,a relative balance training set is obtained by the synthetic minority oversampling technique(SMOTE),and the influence of sample imbalance on the prediction performance is discussed.展开更多
This article reviews the theory of fairness in AI-frommachine learning to federated learning,where the constraints on precision AI fairness and perspective solutions are also discussed.For a reliable and quantitative ...This article reviews the theory of fairness in AI-frommachine learning to federated learning,where the constraints on precision AI fairness and perspective solutions are also discussed.For a reliable and quantitative evaluation of AI fairness,many associated concepts have been proposed,formulated and classified.However,the inexplicability of machine learning systems makes it almost impossible to include all necessary details in the modelling stage to ensure fairness.The privacy worries induce the data unfairness and hence,the biases in the datasets for evaluating AI fairness are unavoidable.The imbalance between algorithms’utility and humanization has further reinforced suchworries.Even for federated learning systems,these constraints on precision AI fairness still exist.Aperspective solution is to reconcile the federated learning processes and reduce biases and imbalances accordingly.展开更多
Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in u...Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in underperforming classification for the minority activities which hold importance.Existing works have not addressed class imbalance and use traditional machine learning techniques,e.g.,Random Forest(RF).We investigated Deep Learning(DL)models,namely,Long Short Term Memory(LSTM)and Bidirectional LSTM(BLSTM),appropriate for sequential data,from imbalanced data.Two data sets were collected in normal grazing conditions using jaw-mounted and earmounted sensors.Novel to this study,alongside typical single classes,e.g.,walking,depending on the behaviours,data samples were labelled with compound classes,e.g.,walking_-grazing.The number of steps a sheep performed in the observed 10 s time window was also recorded and incorporated in the models.We designed several multi-class classification studies with imbalance being addressed using synthetic data.DL models achieved superior performance to traditional ML models,especially with augmented data(e.g.,4-Class+Steps:LSTM 88.0%,RF 82.5%).DL methods showed superior generalisability on unseen sheep(i.e.,F1-score:BLSTM 0.84,LSTM 0.83,RF 0.65).LSTM,BLSTM and RF achieved sub-millisecond average inference time,making them suitable for real-time applications.The results demonstrate the effectiveness of DL models for sheep behaviour classification in grazing conditions.The results also demonstrate the DL techniques can generalise across different sheep.The study presents a strong foundation of the development of such models for real-time animal monitoring.展开更多
Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes over...Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.展开更多
针对网络流量分类中类不均衡问题,提出一种基于K均值和k近邻的流量分类算法(traffic classification based on K-means and k nearest neighbor,KMk NN);以KMk NN为基础设计了一种集成分类器(ensemble classifier based on KMk NN,KKEC...针对网络流量分类中类不均衡问题,提出一种基于K均值和k近邻的流量分类算法(traffic classification based on K-means and k nearest neighbor,KMk NN);以KMk NN为基础设计了一种集成分类器(ensemble classifier based on KMk NN,KKEC)。首先通过抽取不同的输入特征子集分别进行训练,获得不同的分类器,进而采取绝对多数与相对多数相结合的投票方式产生集成输出结果,最后采用非平衡数据集进行测试。理论分析和实验结果都表明,算法在面对非均衡协议流时提高了小类流的识别率。展开更多
文摘类偏斜问题(class i mbalance problem)是数据挖掘领域的常见问题之一,人们提出了各种策略来处理这个问题.当训练样本存在类偏斜问题时,kNN分类器会将小类中的样本错分到大类,导致分类的宏F1指标下降.针对kNN存在的这个缺陷,提出了文本训练集的临界点(critical point,CP)的概念并对其性质进行了探讨,给出了求CP,CP的下近似值LA、上近似值UA的算法.之后,根据LA或UA及训练样本数对传统的kNN决策函数进行修改,这就是自适应的加权kNN文本分类.为了验证自适应的加权kNN文本分类的有效性,设计了2组实验进行对比:一组为不同的收缩因子间进行对比,可看做是与Tan的工作进行对比,同时用来证实在LA或UA上分类器的宏F1较好;另一组则是与随机重取样进行实验对比,其中,传统kNN方法作为对比的基线.实验表明,所提的自适应加权kNN文本分类优于随机重取样,使得宏F1指标明显上升.该方法有点类似于代价相关学习.
基金funded by the National Natural Science Foundation of China(Grant No.41941019)the State Key Laboratory of Hydroscience and Engineering(Grant No.2019-KY-03)。
文摘Real-time prediction of the rock mass class in front of the tunnel face is essential for the adaptive adjustment of tunnel boring machines(TBMs).During the TBM tunnelling process,a large number of operation data are generated,reflecting the interaction between the TBM system and surrounding rock,and these data can be used to evaluate the rock mass quality.This study proposed a stacking ensemble classifier for the real-time prediction of the rock mass classification using TBM operation data.Based on the Songhua River water conveyance project,a total of 7538 TBM tunnelling cycles and the corresponding rock mass classes are obtained after data preprocessing.Then,through the tree-based feature selection method,10 key TBM operation parameters are selected,and the mean values of the 10 selected features in the stable phase after removing outliers are calculated as the inputs of classifiers.The preprocessed data are randomly divided into the training set(90%)and test set(10%)using simple random sampling.Besides stacking ensemble classifier,seven individual classifiers are established as the comparison.These classifiers include support vector machine(SVM),k-nearest neighbors(KNN),random forest(RF),gradient boosting decision tree(GBDT),decision tree(DT),logistic regression(LR)and multilayer perceptron(MLP),where the hyper-parameters of each classifier are optimised using the grid search method.The prediction results show that the stacking ensemble classifier has a better performance than individual classifiers,and it shows a more powerful learning and generalisation ability for small and imbalanced samples.Additionally,a relative balance training set is obtained by the synthetic minority oversampling technique(SMOTE),and the influence of sample imbalance on the prediction performance is discussed.
基金the National Academy of Sciences India(NASI),Allahabad,India for the support and to the DirectorNational Institute of Advanced Studies(NIAS),Bengaluru,India for providing the infrastructure facilities to carry out this worksupported by the Shanghai High-Level Base-Building Project for Industrial Technology Innovation.
文摘This article reviews the theory of fairness in AI-frommachine learning to federated learning,where the constraints on precision AI fairness and perspective solutions are also discussed.For a reliable and quantitative evaluation of AI fairness,many associated concepts have been proposed,formulated and classified.However,the inexplicability of machine learning systems makes it almost impossible to include all necessary details in the modelling stage to ensure fairness.The privacy worries induce the data unfairness and hence,the biases in the datasets for evaluating AI fairness are unavoidable.The imbalance between algorithms’utility and humanization has further reinforced suchworries.Even for federated learning systems,these constraints on precision AI fairness still exist.Aperspective solution is to reconcile the federated learning processes and reduce biases and imbalances accordingly.
文摘Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in underperforming classification for the minority activities which hold importance.Existing works have not addressed class imbalance and use traditional machine learning techniques,e.g.,Random Forest(RF).We investigated Deep Learning(DL)models,namely,Long Short Term Memory(LSTM)and Bidirectional LSTM(BLSTM),appropriate for sequential data,from imbalanced data.Two data sets were collected in normal grazing conditions using jaw-mounted and earmounted sensors.Novel to this study,alongside typical single classes,e.g.,walking,depending on the behaviours,data samples were labelled with compound classes,e.g.,walking_-grazing.The number of steps a sheep performed in the observed 10 s time window was also recorded and incorporated in the models.We designed several multi-class classification studies with imbalance being addressed using synthetic data.DL models achieved superior performance to traditional ML models,especially with augmented data(e.g.,4-Class+Steps:LSTM 88.0%,RF 82.5%).DL methods showed superior generalisability on unseen sheep(i.e.,F1-score:BLSTM 0.84,LSTM 0.83,RF 0.65).LSTM,BLSTM and RF achieved sub-millisecond average inference time,making them suitable for real-time applications.The results demonstrate the effectiveness of DL models for sheep behaviour classification in grazing conditions.The results also demonstrate the DL techniques can generalise across different sheep.The study presents a strong foundation of the development of such models for real-time animal monitoring.
基金The authors would like to extend their gratitude to Universiti Teknologi PETRONAS (Malaysia)for funding this research through grant number (015LA0-037).
文摘Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.
文摘针对网络流量分类中类不均衡问题,提出一种基于K均值和k近邻的流量分类算法(traffic classification based on K-means and k nearest neighbor,KMk NN);以KMk NN为基础设计了一种集成分类器(ensemble classifier based on KMk NN,KKEC)。首先通过抽取不同的输入特征子集分别进行训练,获得不同的分类器,进而采取绝对多数与相对多数相结合的投票方式产生集成输出结果,最后采用非平衡数据集进行测试。理论分析和实验结果都表明,算法在面对非均衡协议流时提高了小类流的识别率。