期刊文献+

基于Spark SQL的分布式全文检索框架的设计与实现 被引量:5

Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL
下载PDF
导出
摘要 随着信息化的深入,大数据在各个领域产生了巨大的价值,海量数据的存储和快速分析成为新的挑战。传统的关系型数据库由于性能、扩展性的不足以及价格昂贵等方面的缺点,难以满足大数据的存储和分析需求。Spark SQL是基于大数据处理框架Spark的数据分析工具,目前已支持TPC-DS基准,成为大数据背景下传统数据仓库的替代解决方案。全文检索作为一种文本搜索的有效方式,能够与一般的查询操作结合使用,提供更加丰富的查询和分析操作。目前,Spark SQL仅支持简单的查询操作,不支持全文检索。为了满足传统业务迁移和现有业务的使用需求,提出了分布式全文检索框架,涵盖了SQL文法、SQL翻译转换框架、全文检索并行化、检索优化4个模块,并在Spark SQL上进行了实现。实验结果表明相比于传统的数据库,在两种检索优化策略下,该框架的索引构建时间、查询时间分别减少到传统数据库的0.6%/0.5%和1%/10%,索引存储量减少为传统数据库的55.0%。 With the development of information technology,big data has generated great value in various fields.Huge data storage and rapid analysis have become new challenges.The traditional relational database is difficult to meet the needs of big data storage and analysis because of its shortcomings in terms of performance,scalability and high cost.Spark SQL is a data analysis tool based on Spark,which is a big data processing framework.Spark SQL currently supports the TPC-DS benchmark and has become an alternative solution to the traditional data warehouse under the background of big data.Full-text search,as a kind of effective method of text search,can be used in combination with general query operation to provide richer queries and analysis operations.Spark SQL doesn’t support full-text search now.In order to meet the needs of traditional business migration and existing business,this paper proposed a Spark SQL distributed text retrieval framework,covering the design and implementation of 4 modules including SQL grammar,SQL translation framework,full-text search parallelism and search optimization.The results of experiment show that,under the two search optimization strategies,index construction time and query time of this framework are reduced to 0.6%/0.5%and 1%/10%respectively compared with the traditional database,and index storage volume is reduced to 55.0%.
作者 崔光范 许利杰 刘杰 叶丹 钟华 CUI Guang-fan;XU Li-jie;LIU Jie;YE Dan;ZHONG Hua(University of Chinese Academy of Sciences,Beijing 100049,China;Institute of Software,Chinese Academy of Sciences,Beijing 100049,China)
出处 《计算机科学》 CSCD 北大核心 2018年第9期104-112,145,共10页 Computer Science
基金 北京市科技重大项目(D171100003417002)资助
关键词 SPARK SQL 全文检索 翻译转换框架 检索并行化 检索优化 Spark SQL Full-text search Translation framework Search parallelism Search optimization
  • 相关文献

参考文献3

二级参考文献184

  • 1梅立军,周强,臧路,陈祖舜.知网与同义词词林的信息融合研究[J].中文信息学报,2005,19(1):63-70. 被引量:28
  • 2董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9. 被引量:98
  • 3Nature. Big Data [EB/OL]. [2012-10-02]. http,//www. nature, com/news/specials/bigdata/index, html. 被引量:1
  • 4Bryant R E, Katz R H, Lazowska E D. Big-Data computing : Creating revolutionary breakthroughs in commerce, science, and society [R]. [2012-10-02]. http:// www. cra. org/ccc/docs/init/Big_Data, pdf. 被引量:1
  • 5Science. Special online collection: Dealing with data [EB/OL]. [2012-10-02]. http://www, sciencemag, org/site/ special/data/, 2011. 被引量:1
  • 6Agrawal D, Bernstein P, Bertino E, et al. Challenges and opportunities with big data A community white paper developed by leading researchers across the United States [R/OL]. [2012-10-02]. http://cra, org/ccc/docs/init/bigdata whitepaper, pdf. 被引量:1
  • 7Manyika J, Chui M, Brown B, et al. Big data: The next frontier for innovation, competition, and productivity [R/OL]. [ 2012-10-02 ]. http://www, mekinsey, corn/ Insights]MGI[Research/Teehnology _ and _ Innovation]Big _ data The next frontier for innovation. 被引量:1
  • 8World Economic Forum. Big data, big impact: New possibilities for international development [R/OL]. [2012- 10-02]. http://www3, weforum, org/docs/WEF TC MFS BigDataBigImpact_Briefing 2012. pdf. 被引量:1
  • 9Big Data Across the Federal Government [EB/OL]. [2012-10-02]. http://www, whitehouse, gov/sites/default/ files/microsites/ostp/big_data fact sheet_final_ 1. pdf. 被引量:1
  • 10UN Global Pulse. Big Data for Development:Challenges Opportunities [R/OL]. [ 2012-10-02 ]. http://www. unglobalpulse, org/proj ects/BigDataforDevelopment. 被引量:1

共引文献3254

同被引文献60

引证文献5

二级引证文献11

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部