Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than ...Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.展开更多
This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics ...This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).展开更多
Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massi...Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massive image collection. For an image, user’s definition or description is subjective where it could belong to different categories as defined by different users. Human based categorization and computer-based categorization might produce different results due to different categorization criteria that rely on dataset structure and the clustering techniques. This paper is aimed to exhibit an idea for planning the dataset structure and choosing the clustering algorithm for CBIR implementation. There are 5 sections arranged in this paper;CBIR and QBE concepts are introduced in Section 1, related image categorization research is listed in Section 2, the 5 type of image clustering are described in Section 3, comparative analysis in Section 4, and Section 5 conclude this study. Outcome of this paper will be benefiting CBIR developer for various applications.展开更多
In the healthcare system,a surgical team is a unit of experienced personnel who provide medical care to surgical patients during surgery.Selecting a surgical team is challenging for a multispecialty hospital as the pe...In the healthcare system,a surgical team is a unit of experienced personnel who provide medical care to surgical patients during surgery.Selecting a surgical team is challenging for a multispecialty hospital as the performance of its members affects the efficiency and reliability of the hospital’s patient care.The effectiveness of a surgical team depends not only on its individual members but also on the coordination among them.In this paper,we addressed the challenges of surgical team selection faced by a multispecialty hospital and proposed a decision-making framework for selecting the optimal list of surgical teams for a given patient.The proposed framework focused on improving the existing surgical history management system by arranging surgery-bound patients into optimal subgroups based on similar characteristics and selecting an optimal list of surgical teams for a new surgical patient based on the patient’s subgroups.For this end,two population-based meta-heuristic algorithms for clustering of mixed datasets and multi-objective optimization were proposed.The proposed algorithms were tested using different datasets and benchmark functions.Furthermore,the proposed framework was validated through a case study of a real postoperative surgical dataset obtained from the orthopedic surgery department of a multispecialty hospital in India.The results revealed that the proposed framework was efficient in arranging patients in optimal groups as well as selecting optimal surgical teams for a given patient.展开更多
文摘Clustering is one of the most widely used techniques for exploratory data analysis. Spectral clustering algorithm, a popular modern cluslering algorithm, has been shown to be more effective in detecting clusters than many traditional algorithms. It has applications ranging from computer vision and information retrieval to social sienee and biology. With the size of databases soaring, cluostering algorithms bare saling computational time and memory use. In this paper, we propose a parallel spectral elustering implementation based on MapRednee. Both the computation and data storage are dislributed, which solves the sealability problems for most existing algorithms. We empirically analyze the proposed implementation on both benchmark net- works and a real social network dataset of about two million vertices and two billion edges crawled from Sina Weibo. It is shown that the proposed implementation scales well, speeds up the clustering without sacrificing quality, and processes massive datasets efficiently on commodity machine clusters.
文摘This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).
文摘Content Based Image Retrieval, CBIR, performed an automated classification task for a queried image. It could relieve a user from the laborious and time-consuming metadata assigning for an image while working on massive image collection. For an image, user’s definition or description is subjective where it could belong to different categories as defined by different users. Human based categorization and computer-based categorization might produce different results due to different categorization criteria that rely on dataset structure and the clustering techniques. This paper is aimed to exhibit an idea for planning the dataset structure and choosing the clustering algorithm for CBIR implementation. There are 5 sections arranged in this paper;CBIR and QBE concepts are introduced in Section 1, related image categorization research is listed in Section 2, the 5 type of image clustering are described in Section 3, comparative analysis in Section 4, and Section 5 conclude this study. Outcome of this paper will be benefiting CBIR developer for various applications.
文摘In the healthcare system,a surgical team is a unit of experienced personnel who provide medical care to surgical patients during surgery.Selecting a surgical team is challenging for a multispecialty hospital as the performance of its members affects the efficiency and reliability of the hospital’s patient care.The effectiveness of a surgical team depends not only on its individual members but also on the coordination among them.In this paper,we addressed the challenges of surgical team selection faced by a multispecialty hospital and proposed a decision-making framework for selecting the optimal list of surgical teams for a given patient.The proposed framework focused on improving the existing surgical history management system by arranging surgery-bound patients into optimal subgroups based on similar characteristics and selecting an optimal list of surgical teams for a new surgical patient based on the patient’s subgroups.For this end,two population-based meta-heuristic algorithms for clustering of mixed datasets and multi-objective optimization were proposed.The proposed algorithms were tested using different datasets and benchmark functions.Furthermore,the proposed framework was validated through a case study of a real postoperative surgical dataset obtained from the orthopedic surgery department of a multispecialty hospital in India.The results revealed that the proposed framework was efficient in arranging patients in optimal groups as well as selecting optimal surgical teams for a given patient.