Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models ...Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.展开更多
基金supported in part by the National Natural Science Foundation of China under Grant Nos.62025208,U21A20473,U21A20513,62076154,and 62302512the State Administration of Science,Technology,and Industry for National Defense of China under Grant No.WDZC20235250118.
文摘Deep learning has become the cornerstone of artificial intelligence,playing an increasingly important role in human production and lifestyle.However,as the complexity of problem-solving increases,deep learning models become increasingly intricate,resulting in a proliferation of large language models with an astonishing number of parameters.Pipeline model parallelism(PMP)has emerged as one of the mainstream approaches to addressing the significant challenge of training“big models”.This paper presents a comprehensive review of PMP.It covers the basic concepts and main challenges of PMP.It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches,and discusses the main techniques to achieve load balance for both intra-node and inter-node training.Furthermore,the main techniques to optimize computation,storage,and communication are presented,with potential research directions being discussed.