摘要
带有关联关系的数据在社网平台、电子商务平台、科学数据库等环境中普遍存在,对其进行相似性查询是在各种应用中常见的操作。随着社网、电子商务、云计算等技术的发展和普及,具有关联关系的数据飞速增长,对这种类型的数据进行相似性查询成为数据库领域的一个研究热点。在此应用背景下,提出了一种基于决策树的面向关联关系型数据的分布式相似性查询方法。该方法依据属性的重要度计算相似性,计算过程中达到一定的准确度时可以结束计算,从而在保证准确性的情况下减少了计算量。同时提出了两种分布式环境下面向大数据量的决策树计算方法,该方法具有较少的通信代价,并且有概率理论保证其准确度。最后通过大量的实验证明了方法的有效性。
Data with relation information are ubiquitous in kinds of environments, such as social network, e-commerce and science database, etc. With the development and popularization of the technology of social network, e-commerce and cloud computing, data with relation information grow explosively, it becomes a hot research topic to process similarity query on the data in the database field. So this paper proposes a distributed similarity query method on data with relation information, which is based on decision tree. This method can compute the similarity according to the importance of attributes, and stop the computation when the precision is achieved, so as to reduce the computation cost. And this paper also proposes two algorithms of computing decision tree on large data, which cause less communication cost than existing methods and have accuracy guarantee. Lots of experiments verify the effectiveness and efficiency of the algorithms.
出处
《计算机科学与探索》
CSCD
2014年第7期778-789,共12页
Journal of Frontiers of Computer Science and Technology
基金
国家自然科学基金(60973021
61003060)
国家重点基础研究发展计划(973计划)(2012CB316201)~~