期刊文献+

实现可靠计算的容错网格结构 被引量:7

Study of a Fault-tolerant Grid Framework for Dependable Grid Computing
下载PDF
导出
摘要 由于网格资源的分布性、流动性和异构性,计算故障在网格计算环境中发生的概率比传统机群系统要高,而且结点故障的发生具有不确定性,检测和恢复更加困难。为了在网格计算环境中实现应用程序的可靠执行,提出了一种基于分布式错误检测技术的容错网格体系结构,研究了在结点故障、网络故障和进程故障时,应用程序恢复执行的方法。针对网格环境下上述三种故障发生的特性,研究了不同的应用程序恢复执行机制,其目标是以较小代价获得应用的可靠执行。 For the distribution, variability and heterogeneity of Grid resources, the faults probability in grid is much higher than in cluster systems, especially, for the uncertainty of nodes fault, it's more difficult for faults detection and recovery. In this paper, we study the techniques of fault-tolerance in grid computing environment and propose a faulttolerant grid architecture. Based on the HBM in Globus, we describe faults detection and recovery of network, grid node and processes, and establish the fault-tolerant grid structure oriented parallel computing. Using these strategies, users can recover or adjust computing with small cost and high performance.
作者 邱敏 桂小林
出处 《微电子学与计算机》 CSCD 北大核心 2005年第7期99-102,106,共5页 Microelectronics & Computer
基金 国家自然科学基金项目(60273085) 国家863计划项目(2001AA111081) 教育部ChinaGrid计划项目
关键词 容错计算 网格计算 可靠性 错误检测 故障恢复 Fault-tolerance computing, Grid computing, Reliability, Fault detection, Fault recovery
  • 相关文献

参考文献13

  • 1查礼,徐志伟,林国璋,刘玉树,刘东华,李伟.基于LDAP的网格监控系统[J].计算机研究与发展,2002,39(8):930-936. 被引量:49
  • 2D K Pradhan. Fault-Tolerant Computing: Theory and Techniques. Prentice Hall, 1995, 1. 被引量:1
  • 3Sun Sup So, Sung Deok Cha, Timothy J Shimeall, Yong Rae Kwon. An Empirical Evaluation of Six Methods to Detect Faults in Software. Jounal of Software Testing, Verification, and Reliability, May 2002, 12. 被引量:1
  • 4Paul Stelling, Ian Foster, Carl Kesselman, Craig Lee, Gregorvon Laszewski. A Fault Detection Service For Wide Area Distributed Computations. In proceedings of the 7th IEEE syrup on high performance distributed computing. 被引量:1
  • 5Franck Cappello, Samir Djilali, Gilles Fedak, Vincent N'eft, Thomas H'erault. RPC_V: Toward on Fault Tolerant RPC Design for the Grid with Volatile Nodes. Conferenceon high performance networking and computing. Proceedings of the ACM/IEEE, 2002. 被引量:1
  • 6Barbara B Simons, Alfred Z Spectot. Fault Tolerant Distributed Computing, Lecture Notes in Computer Science 448 Springer, 1990. 被引量:1
  • 7Tevfik Kosar, George Kola, Miron Livny. A Framework for Self-optimizing, Fault-tolerant, High Performance Bulk Data Transfers in a Heterogeneous Grid Environment, Parallel and Distributed Computing. In proceedings on second international symposium, 2003. 被引量:1
  • 8.[EB/OL].www.globus.org/hbm,. 被引量:1
  • 9Olivia Das, C M Woodside. Failure Detection and Recovery Modelling For Multi-layered Service Systems. Fifth international workshop on performability modeling of computer and communication systems. 被引量:1
  • 10Priya Narasimhan, Austin Fath, Chuck Fox, et al. A Distributed Fault-tolerant Architecture. www.ece.cmu.edu. 被引量:1

二级参考文献9

  • 1[1]L Smarr, C Catlett. Metacomputing. Communications of the ACM, 1992, 35(6): 44~52 被引量:1
  • 2[2]Ian Foster, Carl Kesselman. The grid: Blueprint for a new computing infrastructure. San Francisco, CA: Morgan Kaufmann, 1999 被引量:1
  • 3[3]J Case, R Mundy, D Partain et al. Introduction to version 3 of the Internet-standard network management framework. IETF, RFC 2570, 1999. http://www.ietf.org/rfc/rfc2570.txt 被引量:1
  • 4[4]Rajkumar Buyya. PARMON: A portable and scalable monitoring system for clusters. Software-Practice and Experience, 2000, 30(7): 723~739 被引量:1
  • 5[5]P Uthayopas, S Phaisithbenchapol, K Chongbarirux. Building a resources monitoring system for SMILE Beowulf cluster. In: Proc of High Performance Computing, Asia'99. Singapore, 1999. http://prg.cpe.ku.ac.th/publications/hpcasia.pdf 被引量:1
  • 6[6]W Yeong, T Howes, S Kille. Lightweight directory access protocol. RFC 1777, 1995. http://www.ietf.org/rfc/rfc1777.txt 被引量:1
  • 7[7]S Fitzgerald, I Foster, C Kesselman et al. A directory service for configuring high-performance distributed computations. In: The 6th IEEE Int'l Symp on High Performance Distributed Computing. Portland, U S, 1997. 365~375 被引量:1
  • 8[8]B Tierney, B Crowley, D Gunter et al. A monitoring sensor management system for grid environment. In: The 9th Int'l Symp on High Performance Distributed Computing (HPDC-9 2000). Pittsburgh, Pennsylvania, 2000. 97~104 被引量:1
  • 9[9]Brian Tierney, Ruth Aydt, Dan Gunter et al. A Grid monitoring architecture. In: Performance Working Group of Grid Forum, 2001. http://www-didc.lbl.gov/GGF-PERF/GMA-WG/papers/GWD-GP-16-1.pdf 被引量:1

共引文献48

同被引文献53

引证文献7

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部