针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立...针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。展开更多
The appearance and wide use of memory hardware bring significant changes to the conventional vertical memory hierarchy that fails to handle contentions for shared hardware resources and expensive data movements.To dea...The appearance and wide use of memory hardware bring significant changes to the conventional vertical memory hierarchy that fails to handle contentions for shared hardware resources and expensive data movements.To deal with these problems,existing schemes have to rely on inefficient scheduling strategies that also cause extra temporal,spatial and bandwidth overheads.Based on the insights that the shared hardware resources trend to be uniformly and hierarchically offered to the requests for co-located applications in memory systems,we present an efficient abstraction of memory hierarchies,called Label,which is used to establish the connection between the application layer and underlying hardware layer.Based on labels,our paper proposes LaMem,a labeled,resource-isolated and cross-tiered memory system by leveraging the way-based partitioning technique for shared resources to guarantee QoS demands of applications,while supporting fast and low-overhead cache repartitioning technique.Besides,we customize LaMem for the learned index that fundamentally replaces storage structures with computation models as a case study to verify the applicability of LaMem.Experimental results demonstrate the efficiency and efficacy of LaMem.展开更多
Data access delay is a major bottleneck in utilizing current high-end computing (HEC) machines. Prefetching, where data is fetched before CPU demands for it, has been considered as an effective solution to masking d...Data access delay is a major bottleneck in utilizing current high-end computing (HEC) machines. Prefetching, where data is fetched before CPU demands for it, has been considered as an effective solution to masking data access delay. However, current client-initiated prefetching strategies, where a computing processor initiates prefetching instructions, have many limitations. They do not work well for applications with complex, non-contiguous data access patterns. While technology advances continue to increase the gap between computing and data access performance, trading computing power for reducing data access delay has become a natural choice. In this paper, we present a serverbased data-push approach and discuss its associated implementation mechanisms. In the server-push architecture, a dedicated server called Data Push Server (DPS) initiates and proactively pushes data closer to the client in time. Issues, such as what data to fetch, when to fetch, and how to push are studied. The SimpleScalar simulator is modified with a dedicated prefetching engine that pushes data for another processor to test DPS based prefetching. Simulation results show that L1 Cache miss rate can be reduced by up to 97% (71% on average) over a superscalar processor for SPEC CPU2000 benchmarks that have high cache miss rates.展开更多
As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfo...As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfortunately, the predominant indexes -B^+-trees and T-trees -- have been shown to utilize cache poorly, which triggers the development of many cache-conscious indexes, such as CSB^+-trees and pB^+-trees. Most of these cache-conscious indexes are variants of conventional B^+-trees, and have better cache performance than B^+-trees. In this paper, we develop a novel J^+-tree index, inspired by the Judy structure which is an associative array data structure, and propose a more cacheoptimized index -- Prefetching J^+-tree (pJ^+-tree), which applies prefetching to J^+-tree to accelerate range scan operations. The J^+-tree stores all the keys in its leaf nodes and keeps the reference values of leaf nodes in a Judy structure, which makes J^+-tree not only hold the advantages of Judy (such as fast single value search) but also outperform it in other aspects. For example, J^+-trees can achieve better performance on range queries than Judy. The pJ^+-tree index exploits prefetching techniques to further improve the cache behavior of J^+-trees and yields a speedup of 2.0 on range scans. Compared with B^+-trees, CSB^+-trees, pB^+-trees and T-trees, our extensive experimental Study shows that pJ^+-trees can provide better performance on both time (search, scan, update) and space aspects.展开更多
This research proposes a phase-change memory (PCM) based main memory system with an effective combi- nation of a superblock-based adaptive buffering structure and its associated set divisible last-level cache (LLC...This research proposes a phase-change memory (PCM) based main memory system with an effective combi- nation of a superblock-based adaptive buffering structure and its associated set divisible last-level cache (LLC). To achieve high performance similar to that of dynamic random-access memory (DRAM) based main memory, the superblock-based adaptive buffer (SABU) is comprised of dual DRAM buffers, i.e., an aggressive superblock-based pre-fetching buffer (SBPB) and an adaptive sub-block reusing buffer (SBRB), and a set divisible LLC based on a cache space optimization scheme. According to our experiment, the longer PCM access latency can typically be hidden using our proposed SABU, which can significantly reduce the number of writes over the PCM main memory by 26.44%. The SABU approach can reduce PCM access latency up to 0.43 times, compared with conventional DRAM main memory. Meanwhile, the average memory energy consumption can be reduced by 19.7%.展开更多
文摘针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。
基金supported in part by National Natural Science Foundation of China(62125202).
文摘The appearance and wide use of memory hardware bring significant changes to the conventional vertical memory hierarchy that fails to handle contentions for shared hardware resources and expensive data movements.To deal with these problems,existing schemes have to rely on inefficient scheduling strategies that also cause extra temporal,spatial and bandwidth overheads.Based on the insights that the shared hardware resources trend to be uniformly and hierarchically offered to the requests for co-located applications in memory systems,we present an efficient abstraction of memory hierarchies,called Label,which is used to establish the connection between the application layer and underlying hardware layer.Based on labels,our paper proposes LaMem,a labeled,resource-isolated and cross-tiered memory system by leveraging the way-based partitioning technique for shared resources to guarantee QoS demands of applications,while supporting fast and low-overhead cache repartitioning technique.Besides,we customize LaMem for the learned index that fundamentally replaces storage structures with computation models as a case study to verify the applicability of LaMem.Experimental results demonstrate the efficiency and efficacy of LaMem.
基金This research was supported in part by the National Science Foundation of U.S.A.under NSF Grant Nos. EIA-0224377,CNS-0406328,CNS-0509118,and CCF-0621435.
文摘Data access delay is a major bottleneck in utilizing current high-end computing (HEC) machines. Prefetching, where data is fetched before CPU demands for it, has been considered as an effective solution to masking data access delay. However, current client-initiated prefetching strategies, where a computing processor initiates prefetching instructions, have many limitations. They do not work well for applications with complex, non-contiguous data access patterns. While technology advances continue to increase the gap between computing and data access performance, trading computing power for reducing data access delay has become a natural choice. In this paper, we present a serverbased data-push approach and discuss its associated implementation mechanisms. In the server-push architecture, a dedicated server called Data Push Server (DPS) initiates and proactively pushes data closer to the client in time. Issues, such as what data to fetch, when to fetch, and how to push are studied. The SimpleScalar simulator is modified with a dedicated prefetching engine that pushes data for another processor to test DPS based prefetching. Simulation results show that L1 Cache miss rate can be reduced by up to 97% (71% on average) over a superscalar processor for SPEC CPU2000 benchmarks that have high cache miss rates.
基金supported by a grant from HP Lab China,and the National Natural Science Foundation of China under Grant Nos.60496325 and 60573092
文摘As the speed gap between main memory and modern processors continues to widen, the cache behavior becomes more important for main memory database systems (MMDBs). Indexing technique is a key component of MMDBs. Unfortunately, the predominant indexes -B^+-trees and T-trees -- have been shown to utilize cache poorly, which triggers the development of many cache-conscious indexes, such as CSB^+-trees and pB^+-trees. Most of these cache-conscious indexes are variants of conventional B^+-trees, and have better cache performance than B^+-trees. In this paper, we develop a novel J^+-tree index, inspired by the Judy structure which is an associative array data structure, and propose a more cacheoptimized index -- Prefetching J^+-tree (pJ^+-tree), which applies prefetching to J^+-tree to accelerate range scan operations. The J^+-tree stores all the keys in its leaf nodes and keeps the reference values of leaf nodes in a Judy structure, which makes J^+-tree not only hold the advantages of Judy (such as fast single value search) but also outperform it in other aspects. For example, J^+-trees can achieve better performance on range queries than Judy. The pJ^+-tree index exploits prefetching techniques to further improve the cache behavior of J^+-trees and yields a speedup of 2.0 on range scans. Compared with B^+-trees, CSB^+-trees, pB^+-trees and T-trees, our extensive experimental Study shows that pJ^+-trees can provide better performance on both time (search, scan, update) and space aspects.
文摘This research proposes a phase-change memory (PCM) based main memory system with an effective combi- nation of a superblock-based adaptive buffering structure and its associated set divisible last-level cache (LLC). To achieve high performance similar to that of dynamic random-access memory (DRAM) based main memory, the superblock-based adaptive buffer (SABU) is comprised of dual DRAM buffers, i.e., an aggressive superblock-based pre-fetching buffer (SBPB) and an adaptive sub-block reusing buffer (SBRB), and a set divisible LLC based on a cache space optimization scheme. According to our experiment, the longer PCM access latency can typically be hidden using our proposed SABU, which can significantly reduce the number of writes over the PCM main memory by 26.44%. The SABU approach can reduce PCM access latency up to 0.43 times, compared with conventional DRAM main memory. Meanwhile, the average memory energy consumption can be reduced by 19.7%.