期刊文献+

一种优化的分布式系统的失效恢复策略 被引量:2

LOW OVERHEAD CHECKPOINTING AND ROLLBACK RECOVERY IN DISTRIBUTED COMPUTING SYSTEMS
下载PDF
导出
摘要 本文对确定性进程组成的分布式系统的失效(包括处理机崩溃和进程出错)恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法.从而允许每个备查点操作只需对那些充分必要的数据进行合法性检测与备份,这种方法使引入的备查点操作的附加时空消耗降到最低.本文还对因进程通信所产生的备查点间隔的依赖关系做了深入讨论,得出了进程错误的最大可能影响范围定理及出错后系统一致性状态的构造定理,从而可以把错误对整个系统的影响限制在一个可控的最小范围内,这不仅可减小因错误而造成的计算损失,而且将直接降低失效恢复过程中的通信开销.在理论分析之后,我们给出了相应的异步备查点与卷回算法,最后通过比较得出本文的算法在空间性能上优越于已发表的几个具有代表性的算法.本文提出的理论及其算法可以应用于以有限自动机为模型的分布式系统的容错设计方法中. In this paper, a deep study on failure recovery strategy for distributed systems comprising of deterministic processes is given. A new DFA based method to compute statically the minimal checkpointing data set is presented, so that only those necessary data would be checked and backupped into stable storage, which makes the overhead of both time and space of out strategy to be minimal. Besides, this paper also discusses the dependency between the checkpointing intervals of the processes, and gives the theorems about the maximal possible effecting range and consistent global checkpoints. By applying these theorems, the affection of an error can be limited into a small controlled set of processes, which in turn leads to less computation loss and communication overhead upon appearance of processor crash or process error. Based on these theorems, the data structure and the corresponding asynchronous algorithm are presented. At last, a comparison on the algorithm and some other existing algorithms with respect to their space overhead is resulted. The strategy presented in this paper can be applied widely in various distributed systems in which the processes can be modeled by the Extended Finite State Machine.
出处 《计算机学报》 EI CSCD 北大核心 1999年第3期249-257,共9页 Chinese Journal of Computers
关键词 分布式系统 容错 备查点 卷回 进程 Distributed systems, fault tolerance, checkpointing and rollback, data flow analysis, dependency between checkpointing intervals.
  • 相关文献

参考文献4

  • 1刘云龙,北京邮电大学国家重点实验室内部技术报告,1997年 被引量:1
  • 2Wang Yiming,IEEE Trans Comput,1997年,46卷,4期,456页 被引量:1
  • 3Chiu Geming,IEEE Trans Parallel Distributed Systems,1996年,7卷,6期,565页 被引量:1
  • 4陈火旺,编译原理(第2版),1984年 被引量:1

同被引文献9

引证文献2

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部