The inverted index is a key component for search engines to manage billions of documents and quickly respond to users' queries. Whereas substantial effort has been devoted to reducing space occupancy and decoding ...The inverted index is a key component for search engines to manage billions of documents and quickly respond to users' queries. Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed, networks.展开更多
Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different spac...Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different spacetime characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve,flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index,namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size.However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(n log n)time and O(n)space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.展开更多
1 Introduction Recently,k-ary search tree is gaining popularity as one infrastructure in search engines.Due to its intrinsic cache-and SIMD-friendly capabilities,k-ary search tree is efficient in compression and query...1 Introduction Recently,k-ary search tree is gaining popularity as one infrastructure in search engines.Due to its intrinsic cache-and SIMD-friendly capabilities,k-ary search tree is efficient in compression and query processing when combined with inverted index[1-3].In a k-ary tree,each node is composed of k-1 entries,which evenly partitions its range into k subranges(subnodes).By aligning the node size with buffer size of faster cache,the data is expected to be better utilized before evicted out,and fewer cache misses are triggered as well.展开更多
文摘The inverted index is a key component for search engines to manage billions of documents and quickly respond to users' queries. Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed, networks.
基金the Natural Science Foundation of Hunan Province(No.2016JJ2007)for their financial support
文摘Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different spacetime characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve,flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index,namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size.However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(n log n)time and O(n)space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.
文摘1 Introduction Recently,k-ary search tree is gaining popularity as one infrastructure in search engines.Due to its intrinsic cache-and SIMD-friendly capabilities,k-ary search tree is efficient in compression and query processing when combined with inverted index[1-3].In a k-ary tree,each node is composed of k-1 entries,which evenly partitions its range into k subranges(subnodes).By aligning the node size with buffer size of faster cache,the data is expected to be better utilized before evicted out,and fewer cache misses are triggered as well.