Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This...Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect po- tential errors. Especially, we focus on automatically discov- ering minimal CCFDs. In this paper, we present context rela- tivity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs. Moreover, we prove that discover- ing minimal CCFDs is NP-complete and we design the pre- cise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity accord- ing to data distribution for suggestion. The repairing results are approved more accuracy, even evidenced by our empirical evaluation.展开更多
Current Conditional Functional Dependency(CFD)discovery algorithms always need a well-prepared training dataset.This condition makes them difficult to apply on large and low-quality datasets.To handle the volume issue...Current Conditional Functional Dependency(CFD)discovery algorithms always need a well-prepared training dataset.This condition makes them difficult to apply on large and low-quality datasets.To handle the volume issue of big data,we develop the sampling algorithms to obtain a small representative training set.We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data.We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms.Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.展开更多
文摘Conditional functional dependencies(CFDs) are important techniques for data consistency. However, CFDs are limited to 1) provide the reasonable values for consistency repairing and 2) detect potential errors. This paper presents context-aware conditional functional dependencies(CCFDs) which contribute to provide reasonable values and detect po- tential errors. Especially, we focus on automatically discov- ering minimal CCFDs. In this paper, we present context rela- tivity to measure the relationship of CFDs. The overlap of the related CFDs can provide reasonable values which result in more accuracy consistency repairing, and some related CFDs are combined into CCFDs. Moreover, we prove that discover- ing minimal CCFDs is NP-complete and we design the pre- cise method and the heuristic method. We also present the dominating value to facilitate the process in both the precise method and the heuristic method. Additionally, the context relativity of the CFDs affects the cleaning results. We will give an approximate threshold of context relativity accord- ing to data distribution for suggestion. The repairing results are approved more accuracy, even evidenced by our empirical evaluation.
基金partially supported by the National Key R&D Program of China(No.2018YFB1004700)the National Natural Science Foundation of China(Nos.U1509216,U1866602,and 61602129)Microsoft Research Asia.
文摘Current Conditional Functional Dependency(CFD)discovery algorithms always need a well-prepared training dataset.This condition makes them difficult to apply on large and low-quality datasets.To handle the volume issue of big data,we develop the sampling algorithms to obtain a small representative training set.We design the fault-tolerant rule discovery and conflict-resolution algorithms to address the low-quality issue of big data.We also propose parameter selection strategy to ensure the effectiveness of CFD discovery algorithms.Experimental results demonstrate that our method can discover effective CFD rules on billion-tuple data within a reasonable period.