摘要
数据热点发现的目标是找出数据集中的区域,并以易于人理解的方式将其展示出来.本文针对同时包含数值型特征和类别型特征的多维数据设计了数据热点发现算法,该算法的核心是改进CLTree设计的聚类算法CLTree+.本文改进了CLTree,使其能够直接对同时包含数值型特征和类别型特征的数据进行聚类,并提升了具有周期性性质的数值型特征的聚类效果.除此之外,相比CLTree,CLTree+还大幅度提升了计算效率,使其可以用于处理大规模数据. CLTree+被应用于某大型互联网公司的业务数据,成功找出了若干个数据热点,并以易于理解的特征取值组合的方式将这些信息展示出来.
Hotspot detection in data aims at finding out those areas with high density of data,and presenting these areas in a interpretable way. In this work,hotspot detecting algorithm is designed to deal with multi-dimensional data containing numerical features as well as categorical features. The core of the algorithm is the clustering algorithm CLTree +,a significant improvement over the baseline CLTree. CLTree + is able to deal with numerical features and categorical features,and the clustering result of numerical features with periodical characteristics is also improved. Besides,the computational efficiency of CLTree + is also improved. CLTree + is applied to transaction data of large Internet businesses and find out a fewareas with high density of data,and these areas are presented as the easy to interpret combinations of attributes and its values.
作者
邹磊
朱晶
聂晓辉
苏亚
裴丹
孙宇
ZOU Lei;ZHU Jing;NIE Xiao-hui;SU Ya;PEI Dan;SUN Yu(Department of Compute Science and Techology,Tsinghua University,Beijing 100084,China;Beijing Didi Chuxing Company Limited,Beijing 100193,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2019年第3期465-471,共7页
Journal of Chinese Computer Systems
关键词
热点发现
聚类
数据挖掘
决策树
多维数据分析
Hotspot detection
clustering
data mining
unsupervised decision tree
multi-dimensional data analysis