The World Wide Web has become the primary means for information dissemination. Due to the limited resources of the network bandwidth, users always suffer from long time waiting. Web prefetching and web caching are the...The World Wide Web has become the primary means for information dissemination. Due to the limited resources of the network bandwidth, users always suffer from long time waiting. Web prefetching and web caching are the primary approaches to reducing the user perceived access latency and improving the quality of services. In this paper, a Stochastic Petri Nets (SPN) based integrated web prefetching and caching model (IWPCM) is presented and the performance evaluation of IWPCM is made. The performance metrics, access latency, throughput, HR (hit ratio) and BHR (byte hit ratio) are analyzed and discussed. Simulations show that compared with caching only model (CM), IWPCM can further improve the throughput, HR and BHR efficiently and reduce the access latency. The performance evaluation based on the SPN model can provide a basis for implementation of web prefetching and caching and the combination of web prefetching and caching holds the promise of improving the QoS of web systems.展开更多
World Wide Web(WWW) services have grown to levels where significant delays are expected to happen.Technology like prefetching are likely to help users to personalize their needs ,reducing their waiting times. This pap...World Wide Web(WWW) services have grown to levels where significant delays are expected to happen.Technology like prefetching are likely to help users to personalize their needs ,reducing their waiting times. This paperfirstly describes the architecture of prefetching,then classifies them into three types: based on branch model,based ontree model and others and presents profoundly the basic ideas of some existing prefetching algorithms. Next,severalmodels for controlling the prefetching are introduced. At last ,the trend and course concerning the prefetching algo-rithms are concluded.展开更多
Hardware prefetching and replacement policies are two techniques to improve the performance of the memory subsystem.While prefetching hides memory latency and improves performance,interactions take place with the cach...Hardware prefetching and replacement policies are two techniques to improve the performance of the memory subsystem.While prefetching hides memory latency and improves performance,interactions take place with the cache replacement policies,thereby introducing performance variability in the application.To improve the accuracy of reuse of cache blocks in the presence of hardware prefetching,we propose Prefetch-Adaptive Intelligent Cache Replacement Policy(PAIC).PAIC is designed with separate predictors for prefetch and demand requests,and uses machine learning to optimize reuse prediction in the presence of prefetching.By distinguishing reuse predictions for prefetch and demand requests,PAIC can better combine the performance benefits from prefetching and replacement policies.We evaluate PAIC on a set of 27 memory-intensive programs from the SPEC 2006 and SPEC 2017.Under single-core configuration,PAIC improves performance over Least Recently Used(LRU)replacement policy by 37.22%,compared with improvements of 32.93%for Signature-based Hit Predictor(SHiP),34.56%for Hawkeye,and 34.43%for Glider.Under the four-core configuration,PAIC improves performance over LRU by 20.99%,versus 13.23%for SHiP,17.89%for Hawkeye and 15.50%for Glider.展开更多
As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfo...As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfortunately, the predominant indexes -B^+-trees and T-trees -- have been shown to utilize cache poorly, which triggers the development of many cache-conscious indexes, such as CSB^+-trees and pB^+-trees. Most of these cache-conscious indexes are variants of conventional B^+-trees, and have better cache performance than B^+-trees. In this paper, we develop a novel J^+-tree index, inspired by the Judy structure which is an associative array data structure, and propose a more cacheoptimized index -- Prefetching J^+-tree (pJ^+-tree), which applies prefetching to J^+-tree to accelerate range scan operations. The J^+-tree stores all the keys in its leaf nodes and keeps the reference values of leaf nodes in a Judy structure, which makes J^+-tree not only hold the advantages of Judy (such as fast single value search) but also outperform it in other aspects. For example, J^+-trees can achieve better performance on range queries than Judy. The pJ^+-tree index exploits prefetching techniques to further improve the cache behavior of J^+-trees and yields a speedup of 2.0 on range scans. Compared with B^+-trees, CSB^+-trees, pB^+-trees and T-trees, our extensive experimental Study shows that pJ^+-trees can provide better performance on both time (search, scan, update) and space aspects.展开更多
Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support,...Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.展开更多
Off-chip replacement (capacity and conflict) and coherent read misses in a distributed shared memory system cause execution to stall for hundreds of cycles. These off-chip replacement and coherent read misses are re...Off-chip replacement (capacity and conflict) and coherent read misses in a distributed shared memory system cause execution to stall for hundreds of cycles. These off-chip replacement and coherent read misses are recurring and forming sequences of two or more misses called streams. Prior streaming techniques ignored reordering of misses and not-recently-accessed streams while streaming data. In this paper, we present stream prefetcher design that can deal with both problems. Our stream prefetcher design utilizes stream waiting rooms to store not-recently-accessed streams. Stream waiting rooms help remove more off-chip misses. Using trace based simulation% our stream prefetcher design can remove 8% to 66% (on average 40%) and 17% to 63% (on average 39%) replacement and coherent read misses, respectively. Using cycle-accurate full-system simulation, our design gives speedups from 1.00 to 1.17 of princeton application repository for shared-memory computers (PARSEC) workloads running on a distributed shared memory system with the exception of dedup and swaptions workloads.展开更多
A major overhead in software DSM (Distributed Shared Memory) is the cost of remote memory accesses necessitated by the protocol as well as induced by false sharing. This paper introduces a dynamic prefetching method i...A major overhead in software DSM (Distributed Shared Memory) is the cost of remote memory accesses necessitated by the protocol as well as induced by false sharing. This paper introduces a dynamic prefetching method implemented in the JIAJIA software DSM to reduce system overhead caused by remote accesses. The prefetching method records the interleaving string of INV (invalidation) and GETP (getting a remote page) operations for each cached page and analyzes the periodicity of the string when a page is invalidated on a lock or barrier. A prefetching request is issued after the lock or barrier if the periodicity analysis indicates that GETP will be the next operation in the string. Multiple prefetching requests are merged into the same message if they are to the same host. Performance evaluation with eight well-accepted benchmarks in a cluster of sixteen PowerPC workstations shows that the prefetching scheme can significantly reduce the page fault overhead and as a result achieves a performance increase of 15%-20% in three benchmarks and around 8%-10% in another three. The average extra traffic caused by useless prefetches is only 7%-13% in the evaluation.展开更多
In this paper, we present a comparative study between informed and predictive prefetching mechanisms that were presented to leverage the performance gap between I/O storage systems and CPU. In particular, we will focu...In this paper, we present a comparative study between informed and predictive prefetching mechanisms that were presented to leverage the performance gap between I/O storage systems and CPU. In particular, we will focus on transparent informed prefetching (TIP) and predictive prefetching using probability graph approach (PG). Our main objective is to show the main features, motivations, and implementation overview of each mechanism. We also conducted a performance evaluation discussion that shows a comparison between both mechanisms performance when using different cache size values.展开更多
基金Supported by the National Natural Science Foundation of China under Grant No. 60472044. The authors would like to thank research fellow Dr. Yun Shi of China State Post Bureau and Professor Dr. Jun Zou of Tsinghua University for their helpful and constructive comments.
文摘The World Wide Web has become the primary means for information dissemination. Due to the limited resources of the network bandwidth, users always suffer from long time waiting. Web prefetching and web caching are the primary approaches to reducing the user perceived access latency and improving the quality of services. In this paper, a Stochastic Petri Nets (SPN) based integrated web prefetching and caching model (IWPCM) is presented and the performance evaluation of IWPCM is made. The performance metrics, access latency, throughput, HR (hit ratio) and BHR (byte hit ratio) are analyzed and discussed. Simulations show that compared with caching only model (CM), IWPCM can further improve the throughput, HR and BHR efficiently and reduce the access latency. The performance evaluation based on the SPN model can provide a basis for implementation of web prefetching and caching and the combination of web prefetching and caching holds the promise of improving the QoS of web systems.
文摘World Wide Web(WWW) services have grown to levels where significant delays are expected to happen.Technology like prefetching are likely to help users to personalize their needs ,reducing their waiting times. This paperfirstly describes the architecture of prefetching,then classifies them into three types: based on branch model,based ontree model and others and presents profoundly the basic ideas of some existing prefetching algorithms. Next,severalmodels for controlling the prefetching are introduced. At last ,the trend and course concerning the prefetching algo-rithms are concluded.
基金supported by the Natural Science Foundation of Beijing under Grant No.4192007the National Natural Science Foundation of China under Grant No.61202076.
文摘Hardware prefetching and replacement policies are two techniques to improve the performance of the memory subsystem.While prefetching hides memory latency and improves performance,interactions take place with the cache replacement policies,thereby introducing performance variability in the application.To improve the accuracy of reuse of cache blocks in the presence of hardware prefetching,we propose Prefetch-Adaptive Intelligent Cache Replacement Policy(PAIC).PAIC is designed with separate predictors for prefetch and demand requests,and uses machine learning to optimize reuse prediction in the presence of prefetching.By distinguishing reuse predictions for prefetch and demand requests,PAIC can better combine the performance benefits from prefetching and replacement policies.We evaluate PAIC on a set of 27 memory-intensive programs from the SPEC 2006 and SPEC 2017.Under single-core configuration,PAIC improves performance over Least Recently Used(LRU)replacement policy by 37.22%,compared with improvements of 32.93%for Signature-based Hit Predictor(SHiP),34.56%for Hawkeye,and 34.43%for Glider.Under the four-core configuration,PAIC improves performance over LRU by 20.99%,versus 13.23%for SHiP,17.89%for Hawkeye and 15.50%for Glider.
基金supported by a grant from HP Lab China,and the National Natural Science Foundation of China under Grant Nos.60496325 and 60573092
文摘As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfortunately, the predominant indexes -B^+-trees and T-trees -- have been shown to utilize cache poorly, which triggers the development of many cache-conscious indexes, such as CSB^+-trees and pB^+-trees. Most of these cache-conscious indexes are variants of conventional B^+-trees, and have better cache performance than B^+-trees. In this paper, we develop a novel J^+-tree index, inspired by the Judy structure which is an associative array data structure, and propose a more cacheoptimized index -- Prefetching J^+-tree (pJ^+-tree), which applies prefetching to J^+-tree to accelerate range scan operations. The J^+-tree stores all the keys in its leaf nodes and keeps the reference values of leaf nodes in a Judy structure, which makes J^+-tree not only hold the advantages of Judy (such as fast single value search) but also outperform it in other aspects. For example, J^+-trees can achieve better performance on range queries than Judy. The pJ^+-tree index exploits prefetching techniques to further improve the cache behavior of J^+-trees and yields a speedup of 2.0 on range scans. Compared with B^+-trees, CSB^+-trees, pB^+-trees and T-trees, our extensive experimental Study shows that pJ^+-trees can provide better performance on both time (search, scan, update) and space aspects.
基金supported in part by the National Science Foundation of USA under Grant Nos.EIA-0224377,CNS-0406328,CNS-0509118,and CCF-0621435.
文摘Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream. While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.
基金supported by Higher Education Commission(Pakistan)National High Technology Research and Development Program of China(863 Program)(No.2008AA01A201)+1 种基金Natural Science Foundation of China(Nos.60833004 and 60970002)TNList Cross-discipline Foundation
文摘Off-chip replacement (capacity and conflict) and coherent read misses in a distributed shared memory system cause execution to stall for hundreds of cycles. These off-chip replacement and coherent read misses are recurring and forming sequences of two or more misses called streams. Prior streaming techniques ignored reordering of misses and not-recently-accessed streams while streaming data. In this paper, we present stream prefetcher design that can deal with both problems. Our stream prefetcher design utilizes stream waiting rooms to store not-recently-accessed streams. Stream waiting rooms help remove more off-chip misses. Using trace based simulation% our stream prefetcher design can remove 8% to 66% (on average 40%) and 17% to 63% (on average 39%) replacement and coherent read misses, respectively. Using cycle-accurate full-system simulation, our design gives speedups from 1.00 to 1.17 of princeton application repository for shared-memory computers (PARSEC) workloads running on a distributed shared memory system with the exception of dedup and swaptions workloads.
基金the National Natural Science Foundation of China (No.60073018).
文摘A major overhead in software DSM (Distributed Shared Memory) is the cost of remote memory accesses necessitated by the protocol as well as induced by false sharing. This paper introduces a dynamic prefetching method implemented in the JIAJIA software DSM to reduce system overhead caused by remote accesses. The prefetching method records the interleaving string of INV (invalidation) and GETP (getting a remote page) operations for each cached page and analyzes the periodicity of the string when a page is invalidated on a lock or barrier. A prefetching request is issued after the lock or barrier if the periodicity analysis indicates that GETP will be the next operation in the string. Multiple prefetching requests are merged into the same message if they are to the same host. Performance evaluation with eight well-accepted benchmarks in a cluster of sixteen PowerPC workstations shows that the prefetching scheme can significantly reduce the page fault overhead and as a result achieves a performance increase of 15%-20% in three benchmarks and around 8%-10% in another three. The average extra traffic caused by useless prefetches is only 7%-13% in the evaluation.
文摘In this paper, we present a comparative study between informed and predictive prefetching mechanisms that were presented to leverage the performance gap between I/O storage systems and CPU. In particular, we will focus on transparent informed prefetching (TIP) and predictive prefetching using probability graph approach (PG). Our main objective is to show the main features, motivations, and implementation overview of each mechanism. We also conducted a performance evaluation discussion that shows a comparison between both mechanisms performance when using different cache size values.