针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立...针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。展开更多
在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导...在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰,因此多核结构下线程WCET需要考虑并行线程间在共享Cache上的干扰.在现有的简单地址映射干扰分析基础上,考虑了指令取指执行时序因素对干扰的影响,提出了非干扰状态的充分不必要条件,根据指令的取指执行时序范畴判断线程在共享Cache上的干扰状态.通过排除非干扰状态,可以进一步精确多核结构中线程的WCET估值.理论分析证明了该方法的有效性.实验结果表明,与当前现有的考虑执行周期和基于逻辑访问先后顺序的方法相比,基于时序方法下的WCET估值分别可以提高12%和7%的精确度.展开更多
Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to acc...Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.展开更多
提出的访存时间最优Cache划分(OMTP,optimal memory time Cache partitioning)方法通过特征获取部件来获取不同应用程序的平均失效开销和Cache命中的路分布情况,以此作为划分依据来给竞争程序分配合适的Cache空间,达到优化程序整体执行...提出的访存时间最优Cache划分(OMTP,optimal memory time Cache partitioning)方法通过特征获取部件来获取不同应用程序的平均失效开销和Cache命中的路分布情况,以此作为划分依据来给竞争程序分配合适的Cache空间,达到优化程序整体执行性能的目的。实验结果表明,OMTP方法相比基于利用率的Cache划分(UCP)方法吞吐率平均提高3.1%,加权加速比平均提高1.3%,整体性能更优。展开更多
文摘针对多核处理器性能优化问题,文中深入研究多核处理器上共享Cache的管理策略,提出了基于缓存时间公平性与吞吐率的共享Cache划分算法MT-FTP(Memory Time based Fair and Throughput Partitioning)。以公平性和吞吐率两个评价性指标建立数学模型,并分析了算法的划分流程。仿真实验结果表明,MT-FTP算法在系统吞吐率方面表现较好,其平均IPC(Instructions Per Cycles)值比UCP(Use Case Point)算法高1.3%,比LRU(Least Recently Used)算法高11.6%。MT-FTP算法对应的系统平均公平性比LRU算法的系统平均公平性高17%,比UCP算法的平均公平性高16.5%。该算法实现了共享Cache划分公平性并兼顾了系统的吞吐率。
文摘在多核结构中,获得并行应用线程的安全、精确的最坏情况执行时间(worst case execution time,WCET)的最大挑战之一在于共享资源的竞争冲突检测.在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰,因此多核结构下线程WCET需要考虑并行线程间在共享Cache上的干扰.在现有的简单地址映射干扰分析基础上,考虑了指令取指执行时序因素对干扰的影响,提出了非干扰状态的充分不必要条件,根据指令的取指执行时序范畴判断线程在共享Cache上的干扰状态.通过排除非干扰状态,可以进一步精确多核结构中线程的WCET估值.理论分析证明了该方法的有效性.实验结果表明,与当前现有的考虑执行周期和基于逻辑访问先后顺序的方法相比,基于时序方法下的WCET估值分别可以提高12%和7%的精确度.
文摘Modern shared-memory multi-core processors typically have shared Level 2(L2)or Level 3(L3)caches.Cache bottlenecks and replacement strategies are the main problems of such architectures,where multiple cores try to access the shared cache simultaneously.The main problem in improving memory performance is the shared cache architecture and cache replacement.This paper documents the implementation of a Dual-Port Content Addressable Memory(DPCAM)and a modified Near-Far Access Replacement Algorithm(NFRA),which was previously proposed as a shared L2 cache layer in a multi-core processor.Standard Performance Evaluation Corporation(SPEC)Central Processing Unit(CPU)2006 benchmark workloads are used to evaluate the benefit of the shared L2 cache layer.Results show improved performance of the multicore processor’s DPCAM and NFRA algorithms,corresponding to a higher number of concurrent accesses to shared memory.The new architecture significantly increases system throughput and records performance improvements of up to 8.7%on various types of SPEC 2006 benchmarks.The miss rate is also improved by about 13%,with some exceptions in the sphinx3 and bzip2 benchmarks.These results could open a new window for solving the long-standing problems with shared cache in multi-core processors.
文摘提出的访存时间最优Cache划分(OMTP,optimal memory time Cache partitioning)方法通过特征获取部件来获取不同应用程序的平均失效开销和Cache命中的路分布情况,以此作为划分依据来给竞争程序分配合适的Cache空间,达到优化程序整体执行性能的目的。实验结果表明,OMTP方法相比基于利用率的Cache划分(UCP)方法吞吐率平均提高3.1%,加权加速比平均提高1.3%,整体性能更优。