摘要
在研制遥感图像并行处理系统的过程中,发现当前的机群监控不能为高性能计算的调试和运行提供足够支持。通过将MPI库编写的并行程序的运行状态反映在机群监控界面上,能为图像处理模块的调试和运行监测提供了方便。基本原理是将被监控项目记录在一个单独的动态汇聚表中,通过动态增删监测条目来记录需要交换的数据包括对某个特定的MPI进程的监控数据,同时,与Ganglia中Gmon实现的gmond相比较,由于所有的数据传输都是由上层节点的订阅发起的,在没有用户使用监控界面的时候,可以减少数据传输量和网络占用。
During the design and programming of the parallel image processing system, investigation showed that the existing cluster monitoring systems can not provide enough information for debugging and running of high performance computing programs. Practice proved that the debugging and status monitoring of the image processing module are facilitated if the status of process programs are embedded in the GUI interface of the cluster monitoring system. The basic principle is to record all monitored metrics in a separate aggregation table, which stores all monitoring data including data about a specified MPI process. In addition, all data transferred are initialized by subscription of the upper nodes. That assures only necessary data are transferred on the net, which decreases net occupation in comparison with the Gmon system of Ganglia when monitoring GUI is not used by any user.
出处
《计算机工程与设计》
CSCD
北大核心
2008年第1期190-192,212,共4页
Computer Engineering and Design
关键词
机群
监控系统
高性能科学计算
订阅
动态汇聚表
cluster
monitoring system
high performance computing
subscribe
dynamic aggregation table