摘要
Lustre的重启恢复算法需要集群中所有客户端在指定的恢复时间窗口内与服务器重新建立连接,客户端重传未提交的事务请求,服务器严格按照事务序列号重放所有未提交的事务,要求过于严格。针对Lustre可恢复性不强的缺点,提出了基于版本的恢复和共享时提交算法,它们分别对Lustre现有的元数据更新和恢复机制进行了改进和扩展,根据事务之间的依赖关系,允许客户端在更为宽松的条件下进行恢复并加入到集群而不被驱逐,提高了Lustre文件系统的可用性和可恢复性。最后通过一系列实验对改进后的算法的性能进行了评估。
Lustre reboot recovery algorithm needs that all clients reconnect to the server in a special recovery ttme win- dow, and then clients resend uncommitted transactional requests and the server replays these requests strictly in the transaction number order. The recovery conditions are too strict. To improve Lustre' s recoverability and availability, this paper proposed version based recovery and commit on share algorithms. They extend Lustre's metadata update al- gorithm and recovery algorithm respectively, and allow clients rejoin in the cluster by recovery under a more relaxed condition according to the dependence between transactions. At last, the performance of improved recovery algorithms was evaluated via a series of experiments.
出处
《计算机科学》
CSCD
北大核心
2015年第9期177-182,共6页
Computer Science
基金
国家973计划资助项目(2009CB723803)
国家自然科学基金资助项目(60873120)资助