摘要
本文对确定性进程组成的分布式系统的失效(包括处理机崩溃和进程出错)恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法.从而允许每个备查点操作只需对那些充分必要的数据进行合法性检测与备份,这种方法使引入的备查点操作的附加时空消耗降到最低.本文还对因进程通信所产生的备查点间隔的依赖关系做了深入讨论,得出了进程错误的最大可能影响范围定理及出错后系统一致性状态的构造定理,从而可以把错误对整个系统的影响限制在一个可控的最小范围内,这不仅可减小因错误而造成的计算损失,而且将直接降低失效恢复过程中的通信开销.在理论分析之后,我们给出了相应的异步备查点与卷回算法,最后通过比较得出本文的算法在空间性能上优越于已发表的几个具有代表性的算法.本文提出的理论及其算法可以应用于以有限自动机为模型的分布式系统的容错设计方法中.
In this paper, a deep study on failure recovery strategy for distributed systems comprising of deterministic processes is given. A new DFA based method to compute statically the minimal checkpointing data set is presented, so that only those necessary data would be checked and backupped into stable storage, which makes the overhead of both time and space of out strategy to be minimal. Besides, this paper also discusses the dependency between the checkpointing intervals of the processes, and gives the theorems about the maximal possible effecting range and consistent global checkpoints. By applying these theorems, the affection of an error can be limited into a small controlled set of processes, which in turn leads to less computation loss and communication overhead upon appearance of processor crash or process error. Based on these theorems, the data structure and the corresponding asynchronous algorithm are presented. At last, a comparison on the algorithm and some other existing algorithms with respect to their space overhead is resulted. The strategy presented in this paper can be applied widely in various distributed systems in which the processes can be modeled by the Extended Finite State Machine.
出处
《计算机学报》
EI
CSCD
北大核心
1999年第3期249-257,共9页
Chinese Journal of Computers
关键词
分布式系统
容错
备查点
卷回
进程
Distributed systems, fault tolerance, checkpointing and rollback, data flow analysis, dependency between checkpointing intervals.