摘要
Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data.Since stream processing systems(SPSs)usually require distributed deployment on clusters of servers in face of large-scale of data,it is especially common to meet with failures of processing nodes or communication networks,but should be handled seriously considering service quality.A failed system may produce wrong results or become unavailable,resulting in a decline in user experience or even significant financial loss.Hence,a large amount of fault tolerance approaches have been proposed for SPSs.These approaches often have their own priorities on specific performance concerns,e.g.,runtime overhead and recovery efficiency.Nevertheless,there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs,which will become an obstacle for the development of SPSs.Therefore,we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs.Furthermore,we propose an evaluation framework tailored for fault tolerance,demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs.Finally,we specify future research directions in this domain.
基金
The work was supported by the National Key Research and Development Plan Project(2018YFB1003404)。