摘要
随着信息化的深入,大数据在各个领域产生了巨大的价值,海量数据的存储和快速分析成为新的挑战。传统的关系型数据库由于性能、扩展性的不足以及价格昂贵等方面的缺点,难以满足大数据的存储和分析需求。Spark SQL是基于大数据处理框架Spark的数据分析工具,目前已支持TPC-DS基准,成为大数据背景下传统数据仓库的替代解决方案。全文检索作为一种文本搜索的有效方式,能够与一般的查询操作结合使用,提供更加丰富的查询和分析操作。目前,Spark SQL仅支持简单的查询操作,不支持全文检索。为了满足传统业务迁移和现有业务的使用需求,提出了分布式全文检索框架,涵盖了SQL文法、SQL翻译转换框架、全文检索并行化、检索优化4个模块,并在Spark SQL上进行了实现。实验结果表明相比于传统的数据库,在两种检索优化策略下,该框架的索引构建时间、查询时间分别减少到传统数据库的0.6%/0.5%和1%/10%,索引存储量减少为传统数据库的55.0%。
With the development of information technology,big data has generated great value in various fields.Huge data storage and rapid analysis have become new challenges.The traditional relational database is difficult to meet the needs of big data storage and analysis because of its shortcomings in terms of performance,scalability and high cost.Spark SQL is a data analysis tool based on Spark,which is a big data processing framework.Spark SQL currently supports the TPC-DS benchmark and has become an alternative solution to the traditional data warehouse under the background of big data.Full-text search,as a kind of effective method of text search,can be used in combination with general query operation to provide richer queries and analysis operations.Spark SQL doesn’t support full-text search now.In order to meet the needs of traditional business migration and existing business,this paper proposed a Spark SQL distributed text retrieval framework,covering the design and implementation of 4 modules including SQL grammar,SQL translation framework,full-text search parallelism and search optimization.The results of experiment show that,under the two search optimization strategies,index construction time and query time of this framework are reduced to 0.6%/0.5%and 1%/10%respectively compared with the traditional database,and index storage volume is reduced to 55.0%.
作者
崔光范
许利杰
刘杰
叶丹
钟华
CUI Guang-fan;XU Li-jie;LIU Jie;YE Dan;ZHONG Hua(University of Chinese Academy of Sciences,Beijing 100049,China;Institute of Software,Chinese Academy of Sciences,Beijing 100049,China)
出处
《计算机科学》
CSCD
北大核心
2018年第9期104-112,145,共10页
Computer Science
基金
北京市科技重大项目(D171100003417002)资助