Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. A...Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na/ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to overall performance as well as high PD and low PF. use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF.展开更多
Personal credit risk assessment is an important part of the development of financial enterprises. Big data credit investigation is an inevitable trend of personal credit risk assessment, but some data are missing and ...Personal credit risk assessment is an important part of the development of financial enterprises. Big data credit investigation is an inevitable trend of personal credit risk assessment, but some data are missing and the amount of data is small, so it is difficult to train. At the same time, for different financial platforms, we need to use different models to train according to the characteristics of the current samples, which is time-consuming. <span style="font-family:Verdana;">In view of</span><span style="font-family:Verdana;"> these two problems, this paper uses the idea of transfer learning to build a transferable personal credit risk model based on Instance-based Transfer Learning (Instance-based TL). The model balances the weight of the samples in the source domain, and migrates the existing large dataset samples to the target domain of small samples, and finds out the commonness between them. At the same time, we have done a lot of experiments on the selection of base learners, including traditional machine learning algorithms and ensemble learning algorithms, such as decision tree, logistic regression, </span><span style="font-family:Verdana;">xgboost</span> <span style="font-family:Verdana;">and</span><span style="font-family:Verdana;"> so on. The datasets are from P2P platform and bank, the results show that the AUC value of Instance-based TL is 24% higher than that of the traditional machine learning model, which fully proves that the model in this paper has good application value. The model’s evaluation uses AUC, prediction, recall, F1. These criteria prove that this model has good application value from many aspects. At present, we are trying to apply this model to more fields to improve the robustness and applicability of the model;on the other hand, we are trying to do more in-depth research on domain adaptation to enrich the model.</span>展开更多
In this paper, we propose two weighted learning methods for the construction of single hidden layer feedforward neural networks. Both methods incorporate weighted least squares. Our idea is to allow the training insta...In this paper, we propose two weighted learning methods for the construction of single hidden layer feedforward neural networks. Both methods incorporate weighted least squares. Our idea is to allow the training instances nearer to the query to offer bigger contributions to the estimated output. By minimizing the weighted mean square error function, optimal networks can be obtained. The results of a number of experiments demonstrate the effectiveness of our proposed methods.展开更多
Schema matching is a critical step in the integration of heterogeneous web service,which contains various types of web services and multi-version services of the same type.Mapping loss or mismatch usually occurs due t...Schema matching is a critical step in the integration of heterogeneous web service,which contains various types of web services and multi-version services of the same type.Mapping loss or mismatch usually occurs due to schema differences in structure and content and the variety in concept definition and organization.Current instance schema matching methods are not mature enough for heterogeneous web service because they cannot deal with the instance data in web service domain and capture all the semantics,especially metadata semantics.The metadata-based and the instance-based matching methods,in the case of being employed individually,are not efficient to determine the concept relationships,which are crucial for finding high-quality matches between schema attributes.In this paper,we propose an improved schema matching method,based on the combination of instance and metadata(CIM)matcher.The main method of our approach is to utilize schema structure,element labels,and the corresponding instance data information.The matching process is divided into two phases.In the first phase,the metadata-based matchers are used to compute the element label similarity of multi-version open geospatial consortium web service schema,and the generated matching results are raw mappings,which will be reused in the next instance matching phase.In the second phase,the designed instance matching algorithms are employed to the instance data of the raw mappings and fine mappings are generated.Finally,the raw mappings and the fine mappings are combined,and the final mappings are obtained.Our experiments are executed on different versions of web coverage service and web feature service instance data deployed in Geoserver.The results indicate that,the CIM method can obtain more accurate matching results and is flexible enough to handle the web service instance data.展开更多
Due to the complexity of data,interpretation of pattern or extraction of information becomes difficult;therefore application of machine learning is used to teach machines how to handle data more efficiently.With the i...Due to the complexity of data,interpretation of pattern or extraction of information becomes difficult;therefore application of machine learning is used to teach machines how to handle data more efficiently.With the increase of datasets,various organizations now apply machine learning applications and algorithms.Many industries apply machine learning to extract relevant information for analysis purposes.Many scholars,mathematicians and programmers have carried out research and applied several machine learning approaches in order to find solution to problems.In this paper,we focus on general review of machine learning including various machine learning techniques.These techniques can be applied to different fields like image processing,data mining,predictive analysis and so on.The paper aims at reviewing machine learning techniques and algorithms.The research methodology is based on qualitative analysis where various literatures is being reviewed based on machine learning.展开更多
文摘Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na/ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to overall performance as well as high PD and low PF. use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF.
文摘Personal credit risk assessment is an important part of the development of financial enterprises. Big data credit investigation is an inevitable trend of personal credit risk assessment, but some data are missing and the amount of data is small, so it is difficult to train. At the same time, for different financial platforms, we need to use different models to train according to the characteristics of the current samples, which is time-consuming. <span style="font-family:Verdana;">In view of</span><span style="font-family:Verdana;"> these two problems, this paper uses the idea of transfer learning to build a transferable personal credit risk model based on Instance-based Transfer Learning (Instance-based TL). The model balances the weight of the samples in the source domain, and migrates the existing large dataset samples to the target domain of small samples, and finds out the commonness between them. At the same time, we have done a lot of experiments on the selection of base learners, including traditional machine learning algorithms and ensemble learning algorithms, such as decision tree, logistic regression, </span><span style="font-family:Verdana;">xgboost</span> <span style="font-family:Verdana;">and</span><span style="font-family:Verdana;"> so on. The datasets are from P2P platform and bank, the results show that the AUC value of Instance-based TL is 24% higher than that of the traditional machine learning model, which fully proves that the model in this paper has good application value. The model’s evaluation uses AUC, prediction, recall, F1. These criteria prove that this model has good application value from many aspects. At present, we are trying to apply this model to more fields to improve the robustness and applicability of the model;on the other hand, we are trying to do more in-depth research on domain adaptation to enrich the model.</span>
基金supported by the NSC under Grant No.NSC-100-2221-E-110-083-MY3 and NSC-101-2622-E-110-011-CC3"Aim for the Top University Plan"of the National Sun-Yat-Sen University and Ministry of Education
文摘In this paper, we propose two weighted learning methods for the construction of single hidden layer feedforward neural networks. Both methods incorporate weighted least squares. Our idea is to allow the training instances nearer to the query to offer bigger contributions to the estimated output. By minimizing the weighted mean square error function, optimal networks can be obtained. The results of a number of experiments demonstrate the effectiveness of our proposed methods.
基金This work was supported by the National Natural Science Foundation of China[grant number 41201393]the Open Research Fund of State Key Laboratory of Information Engineering in Surveying,Mapping and Remote Sensing of Wuhan University[grant number 14I03].
文摘Schema matching is a critical step in the integration of heterogeneous web service,which contains various types of web services and multi-version services of the same type.Mapping loss or mismatch usually occurs due to schema differences in structure and content and the variety in concept definition and organization.Current instance schema matching methods are not mature enough for heterogeneous web service because they cannot deal with the instance data in web service domain and capture all the semantics,especially metadata semantics.The metadata-based and the instance-based matching methods,in the case of being employed individually,are not efficient to determine the concept relationships,which are crucial for finding high-quality matches between schema attributes.In this paper,we propose an improved schema matching method,based on the combination of instance and metadata(CIM)matcher.The main method of our approach is to utilize schema structure,element labels,and the corresponding instance data information.The matching process is divided into two phases.In the first phase,the metadata-based matchers are used to compute the element label similarity of multi-version open geospatial consortium web service schema,and the generated matching results are raw mappings,which will be reused in the next instance matching phase.In the second phase,the designed instance matching algorithms are employed to the instance data of the raw mappings and fine mappings are generated.Finally,the raw mappings and the fine mappings are combined,and the final mappings are obtained.Our experiments are executed on different versions of web coverage service and web feature service instance data deployed in Geoserver.The results indicate that,the CIM method can obtain more accurate matching results and is flexible enough to handle the web service instance data.
文摘Due to the complexity of data,interpretation of pattern or extraction of information becomes difficult;therefore application of machine learning is used to teach machines how to handle data more efficiently.With the increase of datasets,various organizations now apply machine learning applications and algorithms.Many industries apply machine learning to extract relevant information for analysis purposes.Many scholars,mathematicians and programmers have carried out research and applied several machine learning approaches in order to find solution to problems.In this paper,we focus on general review of machine learning including various machine learning techniques.These techniques can be applied to different fields like image processing,data mining,predictive analysis and so on.The paper aims at reviewing machine learning techniques and algorithms.The research methodology is based on qualitative analysis where various literatures is being reviewed based on machine learning.