摘要
由于网格资源的分布性、流动性和异构性,计算故障在网格计算环境中发生的概率比传统机群系统要高,而且结点故障的发生具有不确定性,检测和恢复更加困难。为了在网格计算环境中实现应用程序的可靠执行,提出了一种基于分布式错误检测技术的容错网格体系结构,研究了在结点故障、网络故障和进程故障时,应用程序恢复执行的方法。针对网格环境下上述三种故障发生的特性,研究了不同的应用程序恢复执行机制,其目标是以较小代价获得应用的可靠执行。
For the distribution, variability and heterogeneity of Grid resources, the faults probability in grid is much higher than in cluster systems, especially, for the uncertainty of nodes fault, it's more difficult for faults detection and recovery. In this paper, we study the techniques of fault-tolerance in grid computing environment and propose a faulttolerant grid architecture. Based on the HBM in Globus, we describe faults detection and recovery of network, grid node and processes, and establish the fault-tolerant grid structure oriented parallel computing. Using these strategies, users can recover or adjust computing with small cost and high performance.
出处
《微电子学与计算机》
CSCD
北大核心
2005年第7期99-102,106,共5页
Microelectronics & Computer
基金
国家自然科学基金项目(60273085)
国家863计划项目(2001AA111081)
教育部ChinaGrid计划项目
关键词
容错计算
网格计算
可靠性
错误检测
故障恢复
Fault-tolerance computing, Grid computing, Reliability, Fault detection, Fault recovery