摘要
现代数据管理必须处理来源不同、质量各异的数据,因此从系统层面支持数据溯源,让用户了解数据的来源及派生过程成为当前至关重要的一个研究课题.基于标注的方法是支持数据溯源的基本方法之一.这种方法的主要问题是存储空间开销,因为溯源信息可能会超过实际数据的大小.在该文中,作者提出了一个用与查询结构匹配的溯源树来表达和存储溯源信息从而避免数据派生过程中冗余存储的基本框架.基于这个框架,作者提出了一系列针对关系型查询的存储优化方法,选择查询树部分节点来存储溯源信息.这些优化算法对于查询大小是多项式时间,对于溯源信息大小是线性时间,在溯源信息的跟踪和优化方面均不会产生巨大的开销.这一框架是数据溯源研究的一个新思路,有着广泛的应用前景.
Modern data management has to deal with data from different sources with different quality,therefore,supporting data provenance in the system level and allowing users to know where data comes from and how it was derived have become a critical research topic.Annotation is one of approaches to track provenance.However,storing fine-grained annotations can be expensive as the complete annotations for the data may outsize the storage space required for the data itself.In this paper,we propose a framework for storing provenance information relating to data derived via relational queries,using provenance trees which match the query structure to avoid redundant storage of information about the derivation process.Within this framework,we come up with a series of storage optimization methods against the relational queries to make good choices of query tree nodes where provenance information should be stored.Our optimization algorithms run in time polynomial in the query size and linear in the size of the provenance,thus enabling provenance tracking and optimization without incurring large overheads.This framework is a new idea for the data tracing study and has a wide range of applications.
出处
《计算机学报》
EI
CSCD
北大核心
2011年第10期1863-1875,共13页
Chinese Journal of Computers
基金
教育部博士点新教师基金(200804861067)
澳洲研究院(ARC)项目基金(LP0882957)资助~~
关键词
溯源树
溯源表
存储优化
最优削剪
规则I&II削剪
provenance tree
provenance table
storage optimization
optimal reduction
rules I&II reduction