Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,...Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.展开更多
Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketin...Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketing, for web personalization, to predict web access sequence etc. In this paper, a new agglomerative clustering technique is proposed to identify users with similar interest, and to determine the motivation for visiting a website. Using this approach, web usage mining is done through different stages namely data cleaning, preprocessing, pattern discovery and pattern analysis. Results are given to explain how this approach produces tight usage clusters than the existing web usage mining techniques. Rather than traditional distance based clustering, the similarity measure is considered during clustering process in order to reduce computational complexity. This paper also deals with the problem of assessing the quality of user session clusters and cluster validity is measured by using statistical test, which measures the distances of clusters distributions to infer their dissimilarity and distinguish level. Using such statistical measures, it is proved that cluster accuracy is improved to the extent of 0.83, over existing k-means clustering with validity measure 0.26, FCM (Fuzzy C Means) clustering with validity measure 0.56. Rough set based clustering with validity measure 0.54 Generation of dense clusters is essential for finding interesting patterns needed for further mining and analysis.展开更多
Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of th...Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of this paper is to propose a new cluster validity index(ARSD index)that works well on all types of data sets.Design/methodology/approach-The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster.A novel penalty function is proposed for determining the distinctness measure of clusters.Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index.The values of the six indices are computed for all nc ranging from(nc_(min),nc_(max))to obtain the optimal number of clusters present in a data set.The data sets used in the experiments include shaped,Gaussian-like and real data sets.Findings-Through extensive experimental study,it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices.This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results.Originality/value-The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.展开更多
基金supported by the Open Project of Xiangjiang Laboratory(No.22XJ02003)the National Natural Science Foundation of China(No.62122093).
文摘Time series clustering is a challenging problem due to the large-volume,high-dimensional,and warping characteristics of time series data.Traditional clustering methods often use a single criterion or distance measure,which may not capture all the features of the data.This paper proposes a novel method for time series clustering based on evolutionary multi-tasking optimization,termed i-MFEA,which uses an improved multifactorial evolutionary algorithm to optimize multiple clustering tasks simultaneously,each with a different validity index or distance measure.Therefore,i-MFEA can produce diverse and robust clustering solutions that satisfy various preferences of decision-makers.Experiments on two artificial datasets show that i-MFEA outperforms single-objective evolutionary algorithms and traditional clustering methods in terms of convergence speed and clustering quality.The paper also discusses how i-MFEA can address two long-standing issues in time series clustering:the choice of appropriate similarity measure and the number of clusters.
文摘Web log mining is analysis of web log files with web page sequences. Discovering user access patterns from web access are necessary for building adaptive web servers, to improve e-commerce, to carry out cross-marketing, for web personalization, to predict web access sequence etc. In this paper, a new agglomerative clustering technique is proposed to identify users with similar interest, and to determine the motivation for visiting a website. Using this approach, web usage mining is done through different stages namely data cleaning, preprocessing, pattern discovery and pattern analysis. Results are given to explain how this approach produces tight usage clusters than the existing web usage mining techniques. Rather than traditional distance based clustering, the similarity measure is considered during clustering process in order to reduce computational complexity. This paper also deals with the problem of assessing the quality of user session clusters and cluster validity is measured by using statistical test, which measures the distances of clusters distributions to infer their dissimilarity and distinguish level. Using such statistical measures, it is proved that cluster accuracy is improved to the extent of 0.83, over existing k-means clustering with validity measure 0.26, FCM (Fuzzy C Means) clustering with validity measure 0.56. Rough set based clustering with validity measure 0.54 Generation of dense clusters is essential for finding interesting patterns needed for further mining and analysis.
文摘Purpose-The most commonly used approaches for cluster validation are based on indices but the majority of the existing cluster validity indices do not work well on data sets of different complexities.The purpose of this paper is to propose a new cluster validity index(ARSD index)that works well on all types of data sets.Design/methodology/approach-The authors introduce a new compactness measure that depicts the typical behaviour of a cluster where more points are located around the centre and lesser points towards the outer edge of the cluster.A novel penalty function is proposed for determining the distinctness measure of clusters.Random linear search-algorithm is employed to evaluate and compare the performance of the five commonly known validity indices and the proposed validity index.The values of the six indices are computed for all nc ranging from(nc_(min),nc_(max))to obtain the optimal number of clusters present in a data set.The data sets used in the experiments include shaped,Gaussian-like and real data sets.Findings-Through extensive experimental study,it is observed that the proposed validity index is found to be more consistent and reliable in indicating the correct number of clusters compared to other validity indices.This is experimentally demonstrated on 11 data sets where the proposed index has achieved better results.Originality/value-The originality of the research paper includes proposing a novel cluster validity index which is used to determine the optimal number of clusters present in data sets of different complexities.