Efficient Fault Tolerance for Massively Parallel Dataflow Systems Sergey Dudoladov supervised by Prof. Volker Markl Technische Universität Berlin firstname.lastname@tu-berlin.de Abstract of computation to durable storage, and, in the case of preemption, Dataflow systems provide fault tolerance by combining checkpoint- restarts the computation from the most recent checkpoint. ing and lineage but leave it up to a data scientist to decide on This scheme, while widely used, is not cost effective when it when and how to checkpoint. This leads to job plans that are in- comes to modern applications such as machine learning (ML). efficient during failure-free execution or recovery, e.g., if a data We identify two problems that make cloud clients waste money. scientist forgets to checkpoint expensive operators that need to be First, checkpoints waste resources users pay for. Persisting the ap- re-executed after a failure. In this work, we aim to (1) increase ef- plication state keeps preemptible instances busy performing non- ficiency of checkpointing transparently to the data scientist and (2) productive work, which increases expenses users want to minimize. automate placement of checkpoints and other fault tolerance mech- Given that the state of an ML algorithm may contain tens of giga- anism. First, we show how to reduce checkpoint size for machine bytes of data [9], this may cause considerable overhead both in learning algorithms using qpoints, a compressed representation of terms of resources and money. Avoiding checkpoints is not an op- the algorithms’ parameters. Qpoints enable the algorithms to run tion either, because recomputing application state from scratch af- faster by spending less time on checkpointing. Second, we show ter each preemption delays the results. Even worse, an application how to place checkpoints optimally for a given cluster without user without checkpoints may fail to terminate under high preemption intervention using smartpoints, our framework for building fault rates. tolerance optimizers. Smartpoints free data scientists from making The second, arguably more general, issue is that fault tolerance tedious decisions about fault tolerance while retaining reasonable requires manual tuning. For example, Apache Spark offers eleven performance guarantees in case of failure. persistence options to tweak its lineage-based version of rollback recovery [1]; examples include checkpointing to memory of two cluster nodes or storing the dataset entirely on disk of a single ma- 1. INTRODUCTION chine. Making end users manually tune numerous fault tolerance The interest in Big Data has advanced the development of options - such as the placement or type of checkpoints in Spark dataflow systems for large scale analytics such as Apache Flink and - is likely to be problematic because it requires understanding of Apache Spark. Users run these systems in either private or cloud- the system’s internals. In the case of Spark, a data scientist has to based clusters that are often virtualized. Such clusters tend to be decide where a lineage chain becomes too expensive to recompute failure prone: the commodity hardware used to build them exhibits (e.g., because of too many CPU-intensive operators) and insert a high failure rates when used in large quantities [11]. checkpoint of a suitable type. Given that many users of data ana- Another source of failures for cloud-based clusters, regardless of lytics systems are in fact non experts in distributed systems [12], their size, is preemption. Cloud providers offer certain instances of such performance tweaking becomes ineffective [4], i.e., leads to virtual machines with large discounts to utilize otherwise idle hard- job plans that take more time than necessary to (re)compute. This ware [2, 15]. The downside is that the cloud can preempt (reclaim) extra time combined with the working time of data scientists, who such an instance at any moment when it needs the resources back. tweak fault tolerance, also increases expenses of cloud clients. For an application, preemption looks like a failure: the preempted In this research project, we aim to make fault tolerance for instance disappears, causing the loss of the application state kept dataflow systems more efficient and usable. We propose: in memory. In this paper, we use preemption as a running exam- 1. Qpoints, a technique for reducing checkpointing time by per- ple; our research project applies to other failure models as well, for sisting the compressed state of ML algorithms. instance, classic fail-stop failures. 2. Smartpoints, a framework for building fault tolerance opti- To handle preemption, cloud providers advise on employing roll- mizers that aims to automate decisions on when and how to back recovery [3]. This approach to fault tolerance is conceptu- checkpoint. ally simple: a system periodically checkpoints the current result 2. QPOINTS In Section 2.1, we describe the robustness of machine learning algorithms against approximation of parameter values, a property vital for qpoints. In Section 2.2, we employ the property to reduce size of checkpoints for ML by compressing them into qpoints. 2.1 Approximation in Machine Learning Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New Many machine learning algorithms are robust against approxi- Delhi, India. Copyright (c) 2016 for this paper by its authors. Copying permitted for mation, that is, they can tolerate partial loss of state or some in- private and academic purposes. accuracy in parameter values [7]. In this research project, we will exploit the robustness to approximate parameter values of several the prediction accuracy [13, 16]. The guarantees, however, differ: practically important classes of algorithms such as deep neural net- linear models and convex optimizers in general provably reach the works and generalized linear models. unique optimum from any approximation, while neural networks For execution speed and coding convenience, practitioners en- may converge to different solutions due to their non-convex nature. code parameters such as weights in the generalized linear models The third condition rules out algorithms that require approximate as IEEE 754 double values, but the experimental evaluation of ML state to be consistent. An example of such an algorithm is PageR- algorithms shows that parameter values tend to cluster near zero, ank. This algorithm can converge using approximate parameter so the full range of a double value is rarely used [13]. Given that weights, but it requires the weights to form a probability distri- large scale ML algorithms can contain billions of parameters [9], bution, i.e., to sum up to one. Simply restoring the weights from a suboptimal parameter representation can significantly increase the qpoint may be insufficient because, due to rounding issues, the sum memory footprint of an algorithm. of parameter values may deviate too far from one. Both neural nets Consider the recent Adam project1 for deep learning [9]. This and linear models have no such special consistency requirements. system is capable of training a neural network with 36 billion con- Despite these restrictions, qpoints bring two significant gains. nections between neurons where each connection is represented First, qpoints can reduce the checkpointing overhead transparently with a weight. Assuming an IEEE double representation of each to a user without sacrificing fault tolerance. The standard way weight, this amounts to 36 ∗ 109 ∗ 8 bytes of data, so 288 gigabytes to decrease checkpointing cost is to adjust frequency of check- are needed to represent only the parameters of a single model. points [22]. This strategy requires manual tuning from a user, Machine learning practitioners recognize the need to use less and proved problematic in real world deployments of ML algo- space per parameter [10, 13, 16]. Instead of the IEEE 754 floating rithms (e.g., [18]). Qpoints can reduce the time spent on fault toler- point format, they propose to use the Qn.m encoding (Q for quan- ance by saving less data per checkpoint: data scientists can run their tization), which can significantly reduce the memory footprint with algorithms with some default qpoint frequency and gain sufficient little loss in performance. This encoding represents a real number fault tolerance without paying the cost of full checkpoints. with n + m + 1 bits: n bits for the integral part, m bits for the Second, qpoints enable exploring a failure model that, to our fractional part and one bit for the sign [13]. For example, in [13] knowledge, is not currently discussed within the database commu- the authors use Q2.13 encoding in the training phase to halve the nity, namely a failure with a prior warning. Such failures corre- logistic regression memory consumption compared with the model spond to preemption in clouds that notify a client about the up- that represents parameters as floats. This saving costs only about coming preemption event. For instance, Google Cloud issues such 0.01% extra logistic loss in the testing phase. warning 30 seconds before the preemption [15]. After getting this notification, a client would naturally want to persist the current 2.2 Qpoints: Checkpoints in Qn.m Encoding computation state. However, starting standard checkpointing at this The robustness of ML to approximation hints that in many cases moment may fail to meet the hard time limit because of the data checkpointing the exact in-memory state may not be needed: an volume to persist. Qpoints have more chances to meet the limit be- algorithm neither uses most of this memory footprint during normal cause they have less data to save. Qpoints also enable progressive operation, nor does it need the full state to recover after preemption. checkpointing, that is, saving coarse approximation of parameters Since Qn.m encoding can save noticeable amount of memory with as little data as possible and then gradually refining the ap- without sacrificing the end performance, we suggest to checkpoint proximation up until the preemption takes place. this representation of ML models parameters instead of the stan- dard representation with doubles. We define a qpoint as a check- point that employs Qn.m encoding to compress and persist model parameters. For speed and convenience data scientists can still use 3. SMARTPOINTS doubles while developing algorithms; a system should automati- In Section 3.1, we outline the randomized weighted majority, the cally compress parameters while checkpointing. meta-algorithm we use to develop smartpoints. In Section 3.2, we Qpoints come with a cost: unlike checkpoints, qpoints are not describe smartpoints, our framework for building fault tolerance universally applicable. For qpoints to be beneficial, an ML algo- optimizers. rithm should (1) have large number of parameters of limited range, (2) be able to terminate from approximate state after recovery, and 3.1 Optimal Prediction from Expert Advice (3) lack additional consistency requirements. Consider the following situation: one has a pool of experts who The first condition means that algorithm parameters do not use need to predict future events, for example, if the price of a single the full range of IEEE 754 double data type. This condition holds stock will go up or down next day. Naturally, one would like to for at least two classes of machine learning algorithms: generalized select the best expert - the one who makes the least amount of mis- linear models and deep neural networks [10, 13, 16]. Generalized takes - and follow their predictions. The problem is, it is not known linear models are commonly used at scale for tasks like spam filter- beforehand which expert will perform best on a given sequence of ing or predicting ad click-through rates, e.g., [5, 14]. Deep neural future events, e.g., days. In such setting, the randomized weighted networks are used for tasks such as visual object recognition where majority algorithm (RWM) enables to perform provably close to human experts struggle to extract meaningful features from the in- the best expert without any apriori knowledge. put data [9]. The RWM assigns equal initial weights to all experts, each ex- The second condition selects algorithms that are can reach the pert essentially being a function with the {0, 1} range (say, 0 means desired level of performance starting from approximate state. For the price will go down and 1 means it will go up). The algorithm both neural nets and linear models, low precision parameter repre- then proceeds in a sequence of trials. At each trial, the algorithm sentation – about a quarter of the bits of the IEEE double format – chooses an expert at random, with probability proportional to the is sufficient for training and running models and has little effect on current weight of the expert, and follows the prediction of this ex- 1 Adam is technically not a dataflow system. However, current pert. Once the true answer is revealed (e.g., the price went down), projects such as SparkNet [19] actively port deep learning to the algorithm punishes all experts who predicted wrongly by mul- dataflow systems. So these systems can be expected to run into the tiplying their weights by the penalty β, 0 ≤ β < 1; weights of the problem of deep learning memory requirements in the near future. correct experts stay intact. By doing so, the algorithm decreases the probability of choosing a mistaken expert in the next trial. Intu- The Algorithm 1 describes smartpoints more formally. In this al- itively, if an expert predicts wrongly, the algorithm trusts them less gorithm, fault tolerance policies effectively predict failures by their in the future. The RWM can guarantee the following property [6]: votes. That is, the decision to checkpoint can be rephrased as ”the next operator will fail”: if the next operator does not fail, we do Theorem 1. On any sequence of trials, the expected number of not need to checkpoint. If one of the subsequent operators (e.g., mistakes X made by the Randomized Weighted Majority algorithm the second next) fails, we ideally would like to checkpoint the im- satisfies: mediate predecessor of the operator-to-fail to avoid re-executing p any successful operators. Real failures become true labels used by X ≤ x + ln(y) + O( x ln(y)) RWM to penalize experts: if a policy voted to checkpoint, and a failure did not happen during the next operator, the algorithm re- where x is the number of mistakes of the expert who performed duces the weight of this policy. For the purposes of smartpoints, best so far, and y is the total number of experts. we define a RWM trial to consists of (a) an execution of a single operator, (b) a vote among checkpoint policies if the operator out- So, the expected number of mistakes of the algorithm is bounded put should be checkpointed, and (c) observing if a failure happens by the number of mistakes of the expert from the pool who per- during the execution of the next operator. forms best on a given sequence of trials. Intuitively, the RWM over- This algorithm ensures three properties. First, due to the prop- all performance is close to that of the best expert in the pool. The erties of RWM (see Theorem 1), for a given pool of policies Algo- bound from Theorem 1 holds for the worst case of input data with- rithm 1 checkpoints in a way provably close to the policy optimal out any probabilistic assumptions about the input or experts [17]. for a given cluster. In other words, assuming a well-designed pool of policies, smartpoints can automatically adapt to a wide range of 3.2 Smartpoints: Fault Tolerance via Ran- cluster environments. Second, the algorithm makes very few as- domized Weighted Majority sumptions: it does not require any specific knowledge about failure Fault tolerance mechanisms span a large spectrum ranging from distributions or cluster size or previous history of a system. In- usual checkpoints [3] to lineage [26] to less standard ideas such stead, we propose to encode this domain-specific knowledge into as qpoints. With that variety, choosing among the mechanism be- fault tolerance policies unique for a particular system. Finally, comes non-trivial even for expert users. Researches are well-aware the reuse of elements common in dataflow systems (checkpoints, of this problem: the recent work has shown that automatically blocking operators) combined with the simplicity of the algorithm choosing the most suitable mechanism (e.g., checkpoints) and its itself greatly reduces the implementation effort compared to exist- placement (for instance, checkpoint each third operator) is possi- ing fault-tolerance optimizers. For example, nothing in the algo- ble and does improve performance [23, 25]. We observe, however, rithm itself requires a special support from the runtime. that current approaches to fault tolerance optimization – such as the ones discussed in [20, 24, 25] – share common shortcomings. Algorithm 1 Smartpoints First, they require significant implementation effort, often at the 1: for each policy i do system runtime level. An example is the FTOpt optimizer [23], 2: wi = 1 which requires a system to have a special acknowledgement proto- 3: for each operator t do col to track tuples’ flow through the system; reimplementing this 4: U se policy i prediction with probability pi = Pwwi j protocol would greatly complicate system design and increase de- j velopment costs. 5: for each policy i do Second, current approaches tend to make assumptions that may 6: if policy i made a mistake then be difficult to fulfill. For example, optimizers proposed in [21, 23] 7: wi = β ∗ wi depend on accurate cost estimates that are hard to obtain in the presence of user-defined functions [4]. In this research project, we propose an approach to building fault 4. RELATED WORK tolerance optimizers with RWM that alleviates these problems. We Approximation in Machine Learning. The work of Bousquet define a fault tolerance policy to be a set of decisions on where and Bottou [7] provided the theoretical foundation for understand- to use which fault tolerance mechanism. Continuing our running ing this phenomenon; a series of recent papers [10, 13, 16] studied example of preemption, a policy can be a heuristic that advises the effect of approximating parameter values with Qn.m encoding on checkpointing before the end of each hour. To simplify termi- on the prediction accuracy of deep neural networks and general- nology, we will henceforth use the term checkpoint to denote any ized linear models. We plan to piggyback on this approximation fault tolerance mechanism, for instance, qpoints. Smartpoints as a tolerance to decrease checkpoint size with qpoints. framework should be capable of incorporating such mechanisms. Randomized Weighted Majority. Littlestone and Warmuth Intuitively, in our approach fault tolerance policies become ex- proposed the original idea and later summarized it in [17]; the perts who periodically vote according to the rules of RWM if a sys- follow up work [8] conducted extensive theoretical analysis and tem should checkpoint or not. For instance, the ”checkpoint noth- showed how to choose the penalty parameter β to minimize the ex- ing” policy would always vote against checkpointing and rather pected number of mistakes of the algorithm. The paper by Blum [6] rely on job re-execution to provide fault tolerance. The RWM en- provides an overview of the key results in the area. We adopt these sures that the most suitable policy for a given environment will results to develop the framework of smartpoints, which should eventually retain most weight. For example, the ”checkpoint noth- yield a family of fault tolerance optimizers capable of intelligently ing” policy should win under low preemption rates, because there choosing the optimal checkpoint policy at runtime. rare preemption events do not justify the cost of checkpointing. Fault tolerance optimization for dataflows. The FTOpt opti- Thus a dataflow system with RWM-based fault tolerance will mizer proposed in [23] employs geometric programming to reduce eventually employ the fault tolerance policy most suitable for its the overhead of checkpoints. Smartpoints differ from it in three particular cluster without any user involvement at the cost of few ways. First, they do not require a cost model and cost estimates for initial mistakes. For brevity, we use the term smartpoints to refer operators. Second, they do not require dedicated support from the to the idea of using RWM to select the best fault tolerance policy. runtime (FTOpt requires an ack protocol to track tuples). And third, smartpoints do not restrict job plans in any way besides forming a new-ec2-spot-instance-termination-notices/, directed acyclic graph of operators (FTOpt handles only job plans 2015. with aggregations at the top). The more recent optimizer from [21] [3] Amazon Web Services. Managing Interruption. probabilistically models the likelihood and impact of failures using aws.amazon.com/ec2/spot/spot-tutorials/, yet another cost model. Unlike [21], which assumes the Poisson 2015. distribution of failures, smartpoints do note make any assumptions [4] S. Babu. Towards Automatic Optimization of MapReduce about failure rates. Rather, they adjust to the actual failure rate at Programs. SoCC ’10, pages 137–142. runtime by selecting the checkpoint policy optimal for the rate. [5] M. Bilenko and M. Richardson. Predictive Client-side Profiles for Personalized Advertising. KDD ’11, pages 5. RESEARCH PLAN 413–421. We intend to implement and evaluate qpoints and smartpoints [6] A. Blum. On-line Algorithms in Machine Learning. In during the years 2016-2017. With qpoints we have to address three Developments from a June 1996 Seminar on Online key issues. First, current Qn.m encoding schemes rely either on Algorithms: The State of the Art, pages 306–325, 1998. custom hardware [10, 16] or on algorithms hand-crafted to repre- [7] O. Bousquet et al. The Tradeoffs of Large Scale Learning. In sent parameters with Qn.m values [13]. To keep our approach Advances in Neural Information Processing Systems 20, general, we cannot assume such support and have to come up with pages 161–168. 2008. a software encoder. [8] N. Cesa-Bianchi et al. How to Use Expert Advice. STOC Second, to make qpoints handle failures with a prior warning, we ’93, pages 382–391, 1993. have to meet hard real-time requirements of the warning. Given the [9] T. Chilimbi et al. Project Adam: Building an Efficient and limited number of IO operations per second, considerable memory Scalable Deep Learning Training System. OSDI ’14, pages footprint and the software Qn.m encoder, qpointing in a timely 571–582. manner requires finding a tradeoff between parameter memory us- [10] M. Courbariaux et al. Low precision arithmetic for deep age and accuracy of the final model. learning. CoRR, abs/1412.7024, 2014. Third, we have to avoid numerical issues. The Qn.m parameter [11] J. Dean. Lessons from Building Large Distributed Systems. representation needs (a) to have enough accuracy to represent small www.cs.cornell.edu/projects/ladis2009/ parameter values or updates commonly encountered in real world talks/dean-keynote-ladis2009.pdf, 2009. machine learning deployments and (b) to avoid introducing bias [12] J. Dean and S. Ghemawat. MapReduce: Simplified Data into parameter values. Processing on Large Clusters. Commun. ACM, With smartpoints, we need to solve two challenges, namely (a) 51(1):107–113, 2008. we have to adjust Randomized Weighted Majority to handle the [13] D. Golovin et al. Large-Scale Learning with Less RAM via specific case of roll-back recovery and (b) we have to preserve Randomization. ICML 2013, page 10. RWM guarantees during this adjustment. With respect to modi- [14] J. Goodman et al. Spam and the Ongoing Battle for the fications, three are absolutely necessary: designing the expert pool, Inbox. Commun. ACM, 50(2):24–33, Feb. 2007. adjusting the penalty rate β, and handling cases where persisting the data is compulsory, such as when a system cannot hold data in [15] Google Cloud Platform. Creating a Preemtible VM Instance. memory due to memory pressure. https://cloud.google.com/compute/docs/ Our baseline for both qpoints and smartpoints will be fault toler- instances/preemptible, 2015. ance policies commonly hard-coded into modern dataflow systems [16] S. Gupta et al. Deep Learning with Limited Numerical such as ”checkpoint everything” in Apache Hadoop. We will fix Precision. CoRR, abs/1502.02551, 2015. the placement and frequency of checkpoints, e.g., ”checkpoint the [17] N. Littlestone et al. The weighted majority algorithm. last operator of each third iteration”, and run jobs with such setting Information and computation, 108(2):212–261, 1994. to get the baseline median (out of five identical runs) wall-clock [18] Y. Low et al. Distributed GraphLab: A Framework for job execution time. We will then strive to decrease this time with Machine Learning and Data Mining in the Cloud. Proc. qpoints and smartpoints. With qpoints, the time should reduce be- VLDB Endow., 5(8):716–727, 2012. cause jobs will spend less time persisting qpoints than checkpoints [19] P. Moritz et al. SparkNet: Training Deep Networks in Spark. due to the smaller qpoint size. With smartpoints, we expect reduc- arXiv:1511.06051, 2016. tion in the execution time because the majority of jobs will check- [20] M. Pundir et al. Zorro: Zero-Cost Reactive Failure Recovery point optimally for the cluster they run in, while a fixed checkpoint in Distributed Graph Processing. SoCC ’15, pages 195–208. policy is likely to mismatch certain environments. For instance, [21] A. Salama et al. Cost-based Fault-tolerance for Parallel Data jobs under low preemption rates should on average complete faster Processing. SIGMOD ’15, pages 285–297. because smartpoints will automatically select the ”checkpoint noth- [22] S. Schelter et al. All Roads Lead to Rome: Optimistic ing” policy and remove entire checkpoint overhead. Recovery for Distributed Iterative Data Processing. CIKM ’13, pages 1919–1928. Acknowledgements. This work has been supported through the [23] P. Upadhyaya et al. A Latency and Fault-Tolerance grant by the German Ministry for Education and Research as Berlin Optimizer for Online Parallel Query Plans. SIGMOD ’11, Big Data Center BBDC (ref. 01IS14013A). pages 241–252. [24] C. Xu et al. Efficient Fault-Tolerance for Iterative Graph 6. REFERENCES Processing on Distributed Dataflow Systems. ICDE ’16. [1] RDD Persistence. [25] S. Yi et al. Monetary Cost-Aware Checkpointing and spark.apache.org/docs/latest/ Migration on Amazon Cloud Spot Instances. IEEE programming-guide.html#rdd-persistence, Transactions on Services Computing, 5(4):512–524, 2012. 2015. [26] M. Zaharia et al. Resilient Distributed Datasets: A [2] Amazon Web Services. EC2 Spot Instance Termination Fault-Tolerant Abstraction for In-Memory Cluster Notices. aws.amazon.com/blogs/aws/ Computing. NSDI’12, pages 2–2.