摘要
针对分布式仿真的需求,在网格的基础上构建了通用的分布式仿真容错系统。该系统由三部分组成:仿真资源状态监控模块、数据保存模块及错误恢复模块。其中仿真资源状态监控基于网格的MDS实现;数据保存(包括进程空间、进程间交互关系的保存)及错误恢复基于检查点机制在用户空间实现。就所增加的容错机制跟仿真系统原有功能模块的关系进行了分析。最后,基于网格及上述容错模块设计并实现了一个C/S模式的容错代理,用来实现仿真系统的自动容错。
Aiming at the demand of the distributed simulation system, this paper has built a common grid-based fault tolerance system. The system consists of three parts: simulation resource monitoring module, data saving module, and error recovery module. The implementation of monitoring module is built on top of grid's MDS, while data saving module, including the saving of the process space and the iterative relationship between processes, and fault recovery are realized based on checkpoint mechanism in the user space. In addition, we analyze the relationship between these three modules and the existing function modules in simulation system. In the end, we design and implement a fault tolerance broker in Client/Sever mode to automate the fault tolerance.
出处
《国防科技大学学报》
EI
CAS
CSCD
北大核心
2005年第1期35-38,共4页
Journal of National University of Defense Technology
基金
国家部委基金资助项目(51404010403KG0155)
关键词
HLA
容错
网格
HLA
fault-tolerance
grid