
并行计算机高可用性分析与设计 被引量:2

Analysis and Design of the High Availability of Parallel Computers
摘要 随着并行计算机系统规模的不断增大,系统的失效率呈线性增长。如何保证大规模并行系统能够提供持续不断的服务,即提高系统的可用性,达到高可用的目标,已成为并行系统设计的重要方面。系统级容错的概念目前已经提出,但系统可用性的度量仍然需要深入研究。本文运用组合模型和马尔科夫过程模型,对系统可靠性和可用性进行了建模和分析,推导了基于马尔科夫过程的可用性度量公式,得出运用高可用技术可以提高系统的可用性。在此基础上,还给出了一个大规模并行计算机系统的高可用系统结构。 As the scale of the parallel computer system increases, its failure rate increases linearly. Guaranteeing high a-vailability in such a large-scale parallel system becomes a primary requirement to ensure continuous services . In order to achieve high availability, a variety of fault-tolerance and high availability technologies have been applied. A new concept of system-level fault tolerance has been proposed. It means that people design a fault-tolerant model from the whole system point of view and integrate different levels of fault tolerance technologies to improve the fault tolerance capability of the whole system . Meanwhile, what degree of availability can be attained is still unknown . In the paper , we model and analyze the reliability and availability of a parallel system by a compositional model and a Markov process model . Availability expressions are deduced from the Markov process model. The conclusion is that the availability of a system can be increased by high availability technologies . Then we propose a high availability architecture of the large-scale parallel computer system and demonstrate how to get high availability from the architecture .
作者 刘睿涛
出处 《计算机工程与科学》 CSCD 2005年第5期104-107,110,共5页 Computer Engineering & Science
基金 国家杰出青年科学基金资助项目(60025206)
关键词 并行计算机 高可用性分析 设计 可靠性 马尔科夫过程 parallel computer reliability availability compositional model Markov process system-level fault tolerance
  • 相关文献


  • 1黄凯 徐志伟.可扩展并行计算:技术、结构与编程[M].北京:机械工业出版社,2000.. 被引量:1
  • 2DPSiewiorek RSSwarz 杨孝宗 曹泽瀚 译.可靠系统的设计理论与实践[M].北京:科学出版社,1988.. 被引量:1
  • 3汪容鑫.随机过程[M].西安:西安交通大学出版社,1987.. 被引量:1
  • 4JieWu.分布式系统设计[M].北京:机械工业出版社,2001.. 被引量:6
  • 5陈左宁 岳霖霖.并行机系统级容错技术[J].南京大学学报,2001,37(2):287-293. 被引量:2



  • 1吕友波,李东.高性能并行程序设计性能分析工具:VM-1[J].哈尔滨商业大学学报(自然科学版),2003,19(4):437-438. 被引量:2
  • 2李金才,龚西平,赵文涛.数值天气预报全球谱模式并行计算研究[J].计算机工程与科学,2004,26(11):71-74. 被引量:3
  • 3张波,府伟灵,毛琼国,陈庆海,蒋天伦,张雪,陈鸣,汤万里,俞凡.压电石英晶体甲胎蛋白免疫传感器的实验研究[J].第三军医大学学报,2003,25(13):1174-1177. 被引量:5
  • 4余欣,杨明,王敏,姜恺,袁俊.基于MPI的黄河下游二维水沙数学模型并行计算研究[J].人民黄河,2005,27(3):49-50. 被引量:9
  • 5Shende S S, Malony A D. The TAU Parallel Performance System [J]. International Journal of High Performance Computing Applications, SAGE Publications, 2006, 20 (2) :287-331. 被引量:1
  • 6Spear W, Shende S S, Malony A D, et al. Making Per formance Analysis and Tuning Part of the Software Devel opment Cycle[C]// High Performance Computing Moderni zation Program Users Group Conference, 2009 :430-437. 被引量:1
  • 7Jagode H, Dongarra J, Alam S, et al. A Holistic Approach for Performance Measurement and Analysis for Petaseale Applications[C]//Proc of the 9th International Conference on Computational Science, 2009:686-695. 被引量:1
  • 8Shende S, Malony A D, Morris A, et al. Performance E valuation of Adaptive Scientific Applications Using TAU [C]//Proc of the International Conference on Parallel Computational Fluid Dynamics, 2005:421-428. 被引量:1
  • 9Cowles G W. Parallelization of the FVCOM Coastal Ocean Model[J]. International Journal of High Performance Computing Applications, 2008, 22(2):177-193. 被引量:1
  • 10Wang P, Song Y T, Chao Y, et al. Parallel Computation of the Regional Ocean Modeling System[J]. International Journal of High Performance Computing Applications, 2005, 19 (4) :375-385. 被引量:1










使用帮助 返回顶部