With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data ...With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data mining and knowledge discovery in databases. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining and knowledge discovery techniques to understand user behavior better, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a comprehensive survey on the data mining and knowledge discovery techniques developed recently, and introduce some real application systems as well. In conclusion, this article also lists some problems and challenges for further research.展开更多
大多数以规则为基础的分类不能直接处理像血压这一类连续数据.离散化数据预处理可以将连续的数据转变成分类格式.现有的离散化算法没有考虑到数据集中连续变量的多模态分类密度,这可能会降低以规则为基础的分类器性能.提出一种新的基于...大多数以规则为基础的分类不能直接处理像血压这一类连续数据.离散化数据预处理可以将连续的数据转变成分类格式.现有的离散化算法没有考虑到数据集中连续变量的多模态分类密度,这可能会降低以规则为基础的分类器性能.提出一种新的基于高斯混合模型的离散化算法(Discretization Algorithm based on Gaussian Mixture Model,DAGMM),通过考虑连续变量的多峰分布以保留数据的原始模式.DAGMM算法的有效性通过4个公开可用的医疗数据集进行验证.实验结果表明,在产生的规则数和关联分类算法的分类准确度方面,DAGMM算法优于其它6个静态离散化算法.因此,在临床专家系统中运用此方法,有潜力提高以规则为基础的分类器的性能.展开更多
The amount of data for decision making has increased tremendously in the age of the digital economy. Decision makers who fail to proficiently manipulate the data produced may make incorrect decisions and therefore har...The amount of data for decision making has increased tremendously in the age of the digital economy. Decision makers who fail to proficiently manipulate the data produced may make incorrect decisions and therefore harm their business. Thus, the task of extracting and classifying the useful information efficiently and effectively from huge amounts of computational data is of special importance. In this paper, we consider that the attributes of data could be both crisp and fuzzy. By examining the suitable partial data, segments with different classes are formed, then a multithreaded computation is performed to generate crisp rules (if possible), and finally, the fuzzy partition technique is employed to deal with the fuzzy attributes for classification. The rules generated in classifying the overall data can be used to gain more knowledge from the data collected.展开更多
This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the...This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node's records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The other features used for splitting the parent nodes are also taken into consideration.展开更多
文摘With massive amounts of data stored in databases, mining information and knowledge in databases has become an important issue in recent research. Researchers in many different fields have shown great interest in data mining and knowledge discovery in databases. Several emerging applications in information providing services, such as data warehousing and on-line services over the Internet, also call for various data mining and knowledge discovery techniques to understand user behavior better, to improve the service provided, and to increase the business opportunities. In response to such a demand, this article is to provide a comprehensive survey on the data mining and knowledge discovery techniques developed recently, and introduce some real application systems as well. In conclusion, this article also lists some problems and challenges for further research.
文摘大多数以规则为基础的分类不能直接处理像血压这一类连续数据.离散化数据预处理可以将连续的数据转变成分类格式.现有的离散化算法没有考虑到数据集中连续变量的多模态分类密度,这可能会降低以规则为基础的分类器性能.提出一种新的基于高斯混合模型的离散化算法(Discretization Algorithm based on Gaussian Mixture Model,DAGMM),通过考虑连续变量的多峰分布以保留数据的原始模式.DAGMM算法的有效性通过4个公开可用的医疗数据集进行验证.实验结果表明,在产生的规则数和关联分类算法的分类准确度方面,DAGMM算法优于其它6个静态离散化算法.因此,在临床专家系统中运用此方法,有潜力提高以规则为基础的分类器的性能.
文摘The amount of data for decision making has increased tremendously in the age of the digital economy. Decision makers who fail to proficiently manipulate the data produced may make incorrect decisions and therefore harm their business. Thus, the task of extracting and classifying the useful information efficiently and effectively from huge amounts of computational data is of special importance. In this paper, we consider that the attributes of data could be both crisp and fuzzy. By examining the suitable partial data, segments with different classes are formed, then a multithreaded computation is performed to generate crisp rules (if possible), and finally, the fuzzy partition technique is employed to deal with the fuzzy attributes for classification. The rules generated in classifying the overall data can be used to gain more knowledge from the data collected.
文摘This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node's records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The other features used for splitting the parent nodes are also taken into consideration.