By Mobile Edge Computing(MEC), computation-intensive tasks are offloaded from mobile devices to cloud servers, and thus the energy consumption of mobile devices can be notably reduced. In this paper, we study task off...By Mobile Edge Computing(MEC), computation-intensive tasks are offloaded from mobile devices to cloud servers, and thus the energy consumption of mobile devices can be notably reduced. In this paper, we study task offloading in multi-user MEC systems with heterogeneous clouds, including edge clouds and remote clouds. Tasks are forwarded from mobile devices to edge clouds via wireless channels, and they can be further forwarded to remote clouds via the Internet. Our objective is to minimize the total energy consumption of multiple mobile devices, subject to bounded-delay requirements of tasks. Based on dynamic programming, we propose an algorithm that minimizes the energy consumption, by jointly allocating bandwidth and computational resources to mobile devices. The algorithm is of pseudo-polynomial complexity. To further reduce the complexity, we propose an approximation algorithm with energy discretization, and its total energy consumption is proved to be within a bounded gap from the optimum. Simulation results show that, nearly 82.7% energy of mobile devices can be saved by task offloading compared with mobile device execution.展开更多
As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining mo...As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining more and more attention. At first,the heterogeneous computing model is presented. And the different tightly coupled single chip heterogeneous architectures and their application domain are introduced. Then,task partitioning methods are described. Several programming model technology are analyzed and discussed. Finally,main challenges and future perspective of High Performance Embedded Computing(HPEC) are summarized.展开更多
Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization ...Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural disparities.Additionally,the creation of such systems requires verbose and unsafe programming models.Recent developments in domain-specific and unified shading languages aim to mitigate these issues.Yet,current programming models primarily address data layout consistency,neglecting other persistent challenges.In this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering systems.Recognizing the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were homogeneous.This model allows for early detection and prevention of errors due to system heterogeneity at compile-time.Furthermore,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous systems.Developers can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.展开更多
As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In...As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.展开更多
OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several ac...OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.展开更多
基金the National Key R&D Program of China 2018YFB1800804the Nature Science Foundation of China (No. 61871254,No. 61861136003,No. 91638204)Hitachi Ltd.
文摘By Mobile Edge Computing(MEC), computation-intensive tasks are offloaded from mobile devices to cloud servers, and thus the energy consumption of mobile devices can be notably reduced. In this paper, we study task offloading in multi-user MEC systems with heterogeneous clouds, including edge clouds and remote clouds. Tasks are forwarded from mobile devices to edge clouds via wireless channels, and they can be further forwarded to remote clouds via the Internet. Our objective is to minimize the total energy consumption of multiple mobile devices, subject to bounded-delay requirements of tasks. Based on dynamic programming, we propose an algorithm that minimizes the energy consumption, by jointly allocating bandwidth and computational resources to mobile devices. The algorithm is of pseudo-polynomial complexity. To further reduce the complexity, we propose an approximation algorithm with energy discretization, and its total energy consumption is proved to be within a bounded gap from the optimum. Simulation results show that, nearly 82.7% energy of mobile devices can be saved by task offloading compared with mobile device execution.
基金supported by National Natural Science Foundation of China(Grant No.50305035)。
文摘As increasingly widening gap of computing demand and performance in embedded computing domain,heterogeneous computing architecture which delivers better performance as well as lower power in limited size is gaining more and more attention. At first,the heterogeneous computing model is presented. And the different tightly coupled single chip heterogeneous architectures and their application domain are introduced. Then,task partitioning methods are described. Several programming model technology are analyzed and discussed. Finally,main challenges and future perspective of High Performance Embedded Computing(HPEC) are summarized.
基金funded by China Scholarship Council(2020091-10135).
文摘Real-time rendering applications leverage heterogeneous computing to optimize performance.However,software development across multiple devices presents challenges,including data layout inconsistencies,synchronization issues,resource management complexities,and architectural disparities.Additionally,the creation of such systems requires verbose and unsafe programming models.Recent developments in domain-specific and unified shading languages aim to mitigate these issues.Yet,current programming models primarily address data layout consistency,neglecting other persistent challenges.In this paper,we introduce RenderKernel,a programming model designed to simplify the development of real-time rendering systems.Recognizing the need for a high-level approach,RenderKernel addresses the specific challenges of real-time rendering,enabling development on heterogeneous systems as if they were homogeneous.This model allows for early detection and prevention of errors due to system heterogeneity at compile-time.Furthermore,RenderKernel enables the use of common programming patterns from homogeneous environments,freeing developers from the complexities of underlying heterogeneous systems.Developers can focus on coding unique application features,thereby enhancing productivity and reducing the cognitive load associated with real-time rendering system development.
基金Project supported by the National Key Research and Development Program of China(No.2021YFB0300101)the National Natural Science Foundation of China(No.61972408)the UK Royal Society International Collaboration Grant。
文摘As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these systems.In this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor organization.To assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL compiler.Our low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming standard.We detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal accelerators.Our programming models have been deployed in the production environment of an exascale prototype system.
基金Project supported by the National Natural Science Foundation of China(Nos.61033008,61272145,60903041,and 61103080)the Research Fund for the Doctoral Program of Higher Education of China(No.20104307110002)+1 种基金the Hunan Provincial Innovation Foundation for Postgraduate(No.CX2010B028)the Fund of Innovation in Graduate School of NUDT(Nos.B100603 and B120605),China
文摘OpenCL programming provides full code portability between different hardware platforms,and can serve as a good programming candidate for heterogeneous systems,which typically consist of a host processor and several accelerators.However,to make full use of the computing capacity of such a system,programmers are requested to manage diverse OpenCL-enabled devices explicitly,including distributing the workload between different devices and managing data transfer between multiple devices.All these tedious jobs pose a huge challenge for programmers.In this paper,a distributed shared OpenCL memory(DSOM) is presented,which relieves users of having to manage data transfer explicitly,by supporting shared buffers across devices.DSOM allocates shared buffers in the system memory and treats the on-device memory as a software managed virtual cache buffer.To support fine-grained shared buffer management,we designed a kernel parser in DSOM for buffer access range analysis.A basic modified,shared,invalid cache coherency is implemented for DSOM to maintain coherency for cache buffers.In addition,we propose a novel strategy to minimize communication cost between devices by launching each necessary data transfer as early as possible.This strategy enables overlap of data transfer with kernel execution.Our experimental results show that the applicability of our method for buffer access range analysis is good,and the efficiency of DSOM is high.