期刊文献+

Goldfish:基于矩阵分解的大规模RDF数据存储与查询系统 被引量:11

Goldfish:A Large Scale Semantic Data Store and Query System Based on Boolean Matrix Factorization
下载PDF
导出
摘要 随着互联网应用的迅猛发展和语义网技术研究的深入,语义数据呈现出爆炸性增长趋势.一方面,对于语义数据实现高效存储和查询是语义网应用的重要基础,越来越多的语义应用可以依赖于此以提供更好的服务;另一方面,语义数据的爆炸性增长,对大数据环境下的语义数据的存储与查询技术提出了新的挑战.传统的基于关系型数据库的语义数据与查询系统已难以满足大规模语义数据的存储与查询需求.该文针对大规模RDF数据的存储与查询问题,以OpenRDF Sesame框架为基础,采用分布式分层式存储架构,提出并实现了属性表存储结构来进行语义数据的存储.在此基础上,针对布尔矩阵分解算法在对大规模语义数据构造属性表较慢的问题,基于Spark分布式计算框架提出并实现了并行化频繁项集挖掘算法求解大规模矩阵分解,以加速属性表的构造过程.并且,在查询层增加了基于哈希转换等查询优化.最后,基于该文所提出的索引结构和优化方法设计实现了原型系统Goldfish,并在大规模合成和真实数据集上进行了实验对比.结果表明,Goldfish原型系统比Rainbow系统查询性能平均提升约6倍,比Jena-HBase查询性能平均提升约500倍,比基于MapReduce的RDF查询系统SHARD性能平均提升约1200倍. With the rapid development of the Internet applications and the semantic web technology, the amount of the semantic data is exploding. On one hand, it is significant to store and query semantic data efficiently, as many applications can provide better services based on this. On the other hand, the rapid increase of the semantic data brings new challenges on efficient storing and querying semantic data in big data era. The traditional ways for semantic data management is to store and query the data in relational database management systems. As the data increases, the traditional ways can hardly handle big data. To address this problem, this paper proposed a distributed hierarchical storage architecture to store and query large-scale semantic data based on the OpenRDF Sesame framework. The RDF storage mechanism is optimized by adopting the attribute tame to replace the RDF triple store. Considering the big semantic data, a parallel frequent item set mining algorithm with Spark framework is proposed to generate the index of the attribute table. Moreover, a layer of optimized hash conversion is proposed to avoid wasting time in frequent hash table search during query stage. To evaluate the efficiency of the proposed approach in this paper, we implement a prototype system called Goldfish, and conduct a comparison use large-scale synthetic dataset and real dataset. Experiment results show that Goldfish is around 8 times faster than Rainbow, 500 times faster than Jena-HBase and 1200 times faster than the MapReduce based RDF querying system SHARD.
作者 顾荣 仇红剑 杨文家 胡伟 袁春风 黄宜华 GU Rong QIU Hong-Jian YANG Wen-Jia HU Wei YUAN Chun-Feng HUANG Yi-Hua(State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093 Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210093)
出处 《计算机学报》 EI CSCD 北大核心 2017年第10期2212-2230,共19页 Chinese Journal of Computers
基金 国家自然科学基金专项基金(61223003) 国家自然科学基金(61370019) 江苏省科技支撑计划项目(BE2014131)资助~~
关键词 大规模RDF存储 矩阵分解 分层式存储 大数据 语义网 SPARK large scale RDF store matrix factorization hierarchical storage big data semantic web Spark
  • 相关文献

参考文献6

二级参考文献58

共引文献154

同被引文献74

引证文献11

二级引证文献28

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部