Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip ...Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.展开更多
Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles d...Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataftow architecture achieves a 16.2% average efficiency improvement over that without the optimization.展开更多
Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data appli...Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.展开更多
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.展开更多
In recent years, the advent of emerging computing applications, such as cloud computing, artificial intelligence, and the Internet of Things, has led to three common requirements in computer system design: high utili...In recent years, the advent of emerging computing applications, such as cloud computing, artificial intelligence, and the Internet of Things, has led to three common requirements in computer system design: high utilization, high throughput, and low latency. Herein, these are referred to as the requirements of 'high-throughput computing (HTC)'. We further propose a new indicator called 'sysentropy' for measuring the degree of chaos and uncertainty within a computer system. We argue that unlike the designs of traditional computing systems that pursue high performance and low power consumption, HTC should aim at achieving low sysentropy. However, from the perspective of computer architecture, HTC faces two major challenges that relate to (1) the full exploitation of the application's data parallelism and execution concurrency to achieve high throughput, and (2) the achievement of low latency, even in the cases at which severe contention occurs in data paths with high utilization. To overcome these two challenges, we introduce two techniques: on-chip data flow architecture and labeled von Neumann architecture. We build two prototypes that can achieve high throughput and low latency, thereby significantly reducing sysentropy.展开更多
The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and p...The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency,especially of applications used in high-performance computing(HPC).Importantly,the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner.Therefore,a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts.Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection,or a handshake mechanism in which the data is only sent after acknowledgments are received.However,these mechanisms are characterized by both inefficient data transfer and increased area cost.Good performance of the dataflow architecture depends on the efficiency of data transfer.In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost,we propose a Look-Ahead Acknowledgment(LAA)mechanism.LAA accelerates the execution flow by speculatively acknowledging ahead without penalties.Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%,with a reduction in the average execution time by 17.4%and an increase in the average power efficiency of dataflow processors by 22.4%.Crucially,our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%.In conclusion,the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.展开更多
基金This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.
文摘Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, the National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04, and the Beijing Municipal Science and Technology Commission under Grant No. Z15010101009.
文摘Double buffering is an effective mechanism to hide the latency of data transfers between on-chip and off-chip memory. However, in dataflow architecture, the swapping of two buffers during the execution of many tiles decreases the performance because of repetitive filling and draining of the dataflow accelerator. In this work, we propose a non-stop double buffering mechanism for dataflow architecture. The proposed non-stop mechanism assigns tiles to the processing element array without stopping the execution of processing elements through optimizing control logic in dataflow architecture. Moreover, we propose a work-flow program to cooperate with the non-stop double buffering mechanism. After optimizations both on control logic and on work-flow program, the filling and draining of the array needs to be done only once across the execution of all tiles belonging to the same dataflow graph. Experimental results show that the proposed double buffering mechanism for dataftow architecture achieves a 16.2% average efficiency improvement over that without the optimization.
基金supported in part by the U.S.National Science Foundation under Grant Nos.CCF-2008907 and CCF-2029014the Chinese Academy of Sciences Project for Young Scientists in Basic Research under Grant No.YSBR-029the Chinese Academy of Sciences Project for Youth Innovation Promotion Association.
文摘Graph processing is a vital component of many AI and big data applications.However,due to its poor locality and complex data access patterns,graph processing is also a known performance killer of AI and big data applications.In this work,we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations.We first identify that memory accesses to the offset,edge,and state array have distinct locality and impact on performance.We then introduce the Skyway architecture,which consists of two primary components:1)a dedicated direct data path between the core and memory to transfer state array elements efficiently,and 2)a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path.The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead.We evaluate Skyway on a set of diverse algorithms using large real-world graphs.On a simulated fourcore system,Skyway improves the performance by 23%on average over the best-performing graph-specialized hardware optimizations.
基金This work was supported by the National Key Research and Development Program of China under Grant No. 2016YFB0200501, tile National Natural Science Foundation of China under Grant Nos. 61332009 and 61521092, the Open Project Program of State Key Laboratory of Mathematical Engineering and Advanced Computing under Grant No. 2016A04 and tile Beijing Municipal Science and Technology Commission under Grant No. Z15010101009, the Open Project Program of State Key Laboratory of Computer Architecture under Grant No. CARCH201503, China Scholarship Council, and Beijing Advanced hmovation Center for hnaging Technology.
文摘With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.
基金Project supported by the National Key R&D Program of China(No.2016YFB1000201)the National Natural Science Foundation of China(No.61420106013)the Youth Innovation Promotion Association of Chinese Academy of Sciences
文摘In recent years, the advent of emerging computing applications, such as cloud computing, artificial intelligence, and the Internet of Things, has led to three common requirements in computer system design: high utilization, high throughput, and low latency. Herein, these are referred to as the requirements of 'high-throughput computing (HTC)'. We further propose a new indicator called 'sysentropy' for measuring the degree of chaos and uncertainty within a computer system. We argue that unlike the designs of traditional computing systems that pursue high performance and low power consumption, HTC should aim at achieving low sysentropy. However, from the perspective of computer architecture, HTC faces two major challenges that relate to (1) the full exploitation of the application's data parallelism and execution concurrency to achieve high throughput, and (2) the achievement of low latency, even in the cases at which severe contention occurs in data paths with high utilization. To overcome these two challenges, we introduce two techniques: on-chip data flow architecture and labeled von Neumann architecture. We build two prototypes that can achieve high throughput and low latency, thereby significantly reducing sysentropy.
基金This work was supported by the Project of the State Grid Corporation of China in 2020"Integration Technology Research and Prototype Development for High End Controller Chip"under Grant No.5700-202041264A-0-0-00.
文摘The dataflow architecture,which is characterized by a lack of a redundant unified control logic,has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency,especially of applications used in high-performance computing(HPC).Importantly,the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner.Therefore,a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts.Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection,or a handshake mechanism in which the data is only sent after acknowledgments are received.However,these mechanisms are characterized by both inefficient data transfer and increased area cost.Good performance of the dataflow architecture depends on the efficiency of data transfer.In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost,we propose a Look-Ahead Acknowledgment(LAA)mechanism.LAA accelerates the execution flow by speculatively acknowledging ahead without penalties.Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%,with a reduction in the average execution time by 17.4%and an increase in the average power efficiency of dataflow processors by 22.4%.Crucially,our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%.In conclusion,the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.