Efficient Fault Tolerance for
                            Massively Parallel Dataflow Systems

                                                       Sergey Dudoladov
                                                supervised by Prof. Volker Markl
                                                   Technische Universität Berlin
                                                firstname.lastname@tu-berlin.de

Abstract                                                                  of computation to durable storage, and, in the case of preemption,
Dataflow systems provide fault tolerance by combining checkpoint-         restarts the computation from the most recent checkpoint.
ing and lineage but leave it up to a data scientist to decide on             This scheme, while widely used, is not cost effective when it
when and how to checkpoint. This leads to job plans that are in-          comes to modern applications such as machine learning (ML).
efficient during failure-free execution or recovery, e.g., if a data      We identify two problems that make cloud clients waste money.
scientist forgets to checkpoint expensive operators that need to be       First, checkpoints waste resources users pay for. Persisting the ap-
re-executed after a failure. In this work, we aim to (1) increase ef-     plication state keeps preemptible instances busy performing non-
ficiency of checkpointing transparently to the data scientist and (2)     productive work, which increases expenses users want to minimize.
automate placement of checkpoints and other fault tolerance mech-         Given that the state of an ML algorithm may contain tens of giga-
anism. First, we show how to reduce checkpoint size for machine           bytes of data [9], this may cause considerable overhead both in
learning algorithms using qpoints, a compressed representation of         terms of resources and money. Avoiding checkpoints is not an op-
the algorithms’ parameters. Qpoints enable the algorithms to run          tion either, because recomputing application state from scratch af-
faster by spending less time on checkpointing. Second, we show            ter each preemption delays the results. Even worse, an application
how to place checkpoints optimally for a given cluster without user       without checkpoints may fail to terminate under high preemption
intervention using smartpoints, our framework for building fault          rates.
tolerance optimizers. Smartpoints free data scientists from making           The second, arguably more general, issue is that fault tolerance
tedious decisions about fault tolerance while retaining reasonable        requires manual tuning. For example, Apache Spark offers eleven
performance guarantees in case of failure.                                persistence options to tweak its lineage-based version of rollback
                                                                          recovery [1]; examples include checkpointing to memory of two
                                                                          cluster nodes or storing the dataset entirely on disk of a single ma-
1.    INTRODUCTION                                                        chine. Making end users manually tune numerous fault tolerance
   The interest in Big Data has advanced the development of               options - such as the placement or type of checkpoints in Spark
dataflow systems for large scale analytics such as Apache Flink and       - is likely to be problematic because it requires understanding of
Apache Spark. Users run these systems in either private or cloud-         the system’s internals. In the case of Spark, a data scientist has to
based clusters that are often virtualized. Such clusters tend to be       decide where a lineage chain becomes too expensive to recompute
failure prone: the commodity hardware used to build them exhibits         (e.g., because of too many CPU-intensive operators) and insert a
high failure rates when used in large quantities [11].                    checkpoint of a suitable type. Given that many users of data ana-
   Another source of failures for cloud-based clusters, regardless of     lytics systems are in fact non experts in distributed systems [12],
their size, is preemption. Cloud providers offer certain instances of     such performance tweaking becomes ineffective [4], i.e., leads to
virtual machines with large discounts to utilize otherwise idle hard-     job plans that take more time than necessary to (re)compute. This
ware [2, 15]. The downside is that the cloud can preempt (reclaim)        extra time combined with the working time of data scientists, who
such an instance at any moment when it needs the resources back.          tweak fault tolerance, also increases expenses of cloud clients.
For an application, preemption looks like a failure: the preempted           In this research project, we aim to make fault tolerance for
instance disappears, causing the loss of the application state kept       dataflow systems more efficient and usable. We propose:
in memory. In this paper, we use preemption as a running exam-                1. Qpoints, a technique for reducing checkpointing time by per-
ple; our research project applies to other failure models as well, for            sisting the compressed state of ML algorithms.
instance, classic fail-stop failures.                                         2. Smartpoints, a framework for building fault tolerance opti-
   To handle preemption, cloud providers advise on employing roll-                mizers that aims to automate decisions on when and how to
back recovery [3]. This approach to fault tolerance is conceptu-                  checkpoint.
ally simple: a system periodically checkpoints the current result
                                                                          2.    QPOINTS
                                                                             In Section 2.1, we describe the robustness of machine learning
                                                                          algorithms against approximation of parameter values, a property
                                                                          vital for qpoints. In Section 2.2, we employ the property to reduce
                                                                          size of checkpoints for ML by compressing them into qpoints.
                                                                          2.1    Approximation in Machine Learning
Proceedings of the VLDB 2016 PhD Workshop, September 9, 2016. New
                                                                            Many machine learning algorithms are robust against approxi-
Delhi, India.
Copyright (c) 2016 for this paper by its authors. Copying permitted for   mation, that is, they can tolerate partial loss of state or some in-
private and academic purposes.                                            accuracy in parameter values [7]. In this research project, we will
exploit the robustness to approximate parameter values of several         the prediction accuracy [13, 16]. The guarantees, however, differ:
practically important classes of algorithms such as deep neural net-      linear models and convex optimizers in general provably reach the
works and generalized linear models.                                      unique optimum from any approximation, while neural networks
    For execution speed and coding convenience, practitioners en-         may converge to different solutions due to their non-convex nature.
code parameters such as weights in the generalized linear models             The third condition rules out algorithms that require approximate
as IEEE 754 double values, but the experimental evaluation of ML          state to be consistent. An example of such an algorithm is PageR-
algorithms shows that parameter values tend to cluster near zero,         ank. This algorithm can converge using approximate parameter
so the full range of a double value is rarely used [13]. Given that       weights, but it requires the weights to form a probability distri-
large scale ML algorithms can contain billions of parameters [9],         bution, i.e., to sum up to one. Simply restoring the weights from a
suboptimal parameter representation can significantly increase the        qpoint may be insufficient because, due to rounding issues, the sum
memory footprint of an algorithm.                                         of parameter values may deviate too far from one. Both neural nets
    Consider the recent Adam project1 for deep learning [9]. This         and linear models have no such special consistency requirements.
system is capable of training a neural network with 36 billion con-          Despite these restrictions, qpoints bring two significant gains.
nections between neurons where each connection is represented             First, qpoints can reduce the checkpointing overhead transparently
with a weight. Assuming an IEEE double representation of each             to a user without sacrificing fault tolerance. The standard way
weight, this amounts to 36 ∗ 109 ∗ 8 bytes of data, so 288 gigabytes      to decrease checkpointing cost is to adjust frequency of check-
are needed to represent only the parameters of a single model.            points [22]. This strategy requires manual tuning from a user,
    Machine learning practitioners recognize the need to use less         and proved problematic in real world deployments of ML algo-
space per parameter [10, 13, 16]. Instead of the IEEE 754 floating        rithms (e.g., [18]). Qpoints can reduce the time spent on fault toler-
point format, they propose to use the Qn.m encoding (Q for quan-          ance by saving less data per checkpoint: data scientists can run their
tization), which can significantly reduce the memory footprint with       algorithms with some default qpoint frequency and gain sufficient
little loss in performance. This encoding represents a real number        fault tolerance without paying the cost of full checkpoints.
with n + m + 1 bits: n bits for the integral part, m bits for the            Second, qpoints enable exploring a failure model that, to our
fractional part and one bit for the sign [13]. For example, in [13]       knowledge, is not currently discussed within the database commu-
the authors use Q2.13 encoding in the training phase to halve the         nity, namely a failure with a prior warning. Such failures corre-
logistic regression memory consumption compared with the model            spond to preemption in clouds that notify a client about the up-
that represents parameters as floats. This saving costs only about        coming preemption event. For instance, Google Cloud issues such
0.01% extra logistic loss in the testing phase.                           warning 30 seconds before the preemption [15]. After getting this
                                                                          notification, a client would naturally want to persist the current
2.2    Qpoints: Checkpoints in Qn.m Encoding                              computation state. However, starting standard checkpointing at this
   The robustness of ML to approximation hints that in many cases         moment may fail to meet the hard time limit because of the data
checkpointing the exact in-memory state may not be needed: an             volume to persist. Qpoints have more chances to meet the limit be-
algorithm neither uses most of this memory footprint during normal        cause they have less data to save. Qpoints also enable progressive
operation, nor does it need the full state to recover after preemption.   checkpointing, that is, saving coarse approximation of parameters
   Since Qn.m encoding can save noticeable amount of memory               with as little data as possible and then gradually refining the ap-
without sacrificing the end performance, we suggest to checkpoint         proximation up until the preemption takes place.
this representation of ML models parameters instead of the stan-
dard representation with doubles. We define a qpoint as a check-
point that employs Qn.m encoding to compress and persist model
parameters. For speed and convenience data scientists can still use
                                                                          3.    SMARTPOINTS
doubles while developing algorithms; a system should automati-              In Section 3.1, we outline the randomized weighted majority, the
cally compress parameters while checkpointing.                            meta-algorithm we use to develop smartpoints. In Section 3.2, we
   Qpoints come with a cost: unlike checkpoints, qpoints are not          describe smartpoints, our framework for building fault tolerance
universally applicable. For qpoints to be beneficial, an ML algo-         optimizers.
rithm should (1) have large number of parameters of limited range,
(2) be able to terminate from approximate state after recovery, and       3.1    Optimal Prediction from Expert Advice
(3) lack additional consistency requirements.                                Consider the following situation: one has a pool of experts who
   The first condition means that algorithm parameters do not use         need to predict future events, for example, if the price of a single
the full range of IEEE 754 double data type. This condition holds         stock will go up or down next day. Naturally, one would like to
for at least two classes of machine learning algorithms: generalized      select the best expert - the one who makes the least amount of mis-
linear models and deep neural networks [10, 13, 16]. Generalized          takes - and follow their predictions. The problem is, it is not known
linear models are commonly used at scale for tasks like spam filter-      beforehand which expert will perform best on a given sequence of
ing or predicting ad click-through rates, e.g., [5, 14]. Deep neural      future events, e.g., days. In such setting, the randomized weighted
networks are used for tasks such as visual object recognition where       majority algorithm (RWM) enables to perform provably close to
human experts struggle to extract meaningful features from the in-        the best expert without any apriori knowledge.
put data [9].                                                                The RWM assigns equal initial weights to all experts, each ex-
   The second condition selects algorithms that are can reach the         pert essentially being a function with the {0, 1} range (say, 0 means
desired level of performance starting from approximate state. For         the price will go down and 1 means it will go up). The algorithm
both neural nets and linear models, low precision parameter repre-        then proceeds in a sequence of trials. At each trial, the algorithm
sentation – about a quarter of the bits of the IEEE double format –       chooses an expert at random, with probability proportional to the
is sufficient for training and running models and has little effect on    current weight of the expert, and follows the prediction of this ex-
1
  Adam is technically not a dataflow system. However, current             pert. Once the true answer is revealed (e.g., the price went down),
projects such as SparkNet [19] actively port deep learning to             the algorithm punishes all experts who predicted wrongly by mul-
dataflow systems. So these systems can be expected to run into the        tiplying their weights by the penalty β, 0 ≤ β < 1; weights of the
problem of deep learning memory requirements in the near future.          correct experts stay intact. By doing so, the algorithm decreases
the probability of choosing a mistaken expert in the next trial. Intu-      The Algorithm 1 describes smartpoints more formally. In this al-
itively, if an expert predicts wrongly, the algorithm trusts them less   gorithm, fault tolerance policies effectively predict failures by their
in the future. The RWM can guarantee the following property [6]:         votes. That is, the decision to checkpoint can be rephrased as ”the
                                                                         next operator will fail”: if the next operator does not fail, we do
Theorem 1. On any sequence of trials, the expected number of             not need to checkpoint. If one of the subsequent operators (e.g.,
mistakes X made by the Randomized Weighted Majority algorithm            the second next) fails, we ideally would like to checkpoint the im-
satisfies:                                                               mediate predecessor of the operator-to-fail to avoid re-executing
                                   p                                     any successful operators. Real failures become true labels used by
               X ≤ x + ln(y) + O( x ln(y))                               RWM to penalize experts: if a policy voted to checkpoint, and a
                                                                         failure did not happen during the next operator, the algorithm re-
where x is the number of mistakes of the expert who performed            duces the weight of this policy. For the purposes of smartpoints,
best so far, and y is the total number of experts.                       we define a RWM trial to consists of (a) an execution of a single
                                                                         operator, (b) a vote among checkpoint policies if the operator out-
   So, the expected number of mistakes of the algorithm is bounded       put should be checkpointed, and (c) observing if a failure happens
by the number of mistakes of the expert from the pool who per-           during the execution of the next operator.
forms best on a given sequence of trials. Intuitively, the RWM over-        This algorithm ensures three properties. First, due to the prop-
all performance is close to that of the best expert in the pool. The     erties of RWM (see Theorem 1), for a given pool of policies Algo-
bound from Theorem 1 holds for the worst case of input data with-        rithm 1 checkpoints in a way provably close to the policy optimal
out any probabilistic assumptions about the input or experts [17].       for a given cluster. In other words, assuming a well-designed pool
                                                                         of policies, smartpoints can automatically adapt to a wide range of
3.2    Smartpoints: Fault Tolerance via Ran-                             cluster environments. Second, the algorithm makes very few as-
       domized Weighted Majority                                         sumptions: it does not require any specific knowledge about failure
   Fault tolerance mechanisms span a large spectrum ranging from         distributions or cluster size or previous history of a system. In-
usual checkpoints [3] to lineage [26] to less standard ideas such        stead, we propose to encode this domain-specific knowledge into
as qpoints. With that variety, choosing among the mechanism be-          fault tolerance policies unique for a particular system. Finally,
comes non-trivial even for expert users. Researches are well-aware       the reuse of elements common in dataflow systems (checkpoints,
of this problem: the recent work has shown that automatically            blocking operators) combined with the simplicity of the algorithm
choosing the most suitable mechanism (e.g., checkpoints) and its         itself greatly reduces the implementation effort compared to exist-
placement (for instance, checkpoint each third operator) is possi-       ing fault-tolerance optimizers. For example, nothing in the algo-
ble and does improve performance [23, 25]. We observe, however,          rithm itself requires a special support from the runtime.
that current approaches to fault tolerance optimization – such as the
ones discussed in [20, 24, 25] – share common shortcomings.              Algorithm 1 Smartpoints
   First, they require significant implementation effort, often at the    1: for each policy i do
system runtime level. An example is the FTOpt optimizer [23],             2:     wi = 1
which requires a system to have a special acknowledgement proto-          3: for each operator t do
col to track tuples’ flow through the system; reimplementing this         4:     U se policy i prediction with probability pi = Pwwi j
protocol would greatly complicate system design and increase de-                                                                        j

velopment costs.                                                          5:     for each policy i do
   Second, current approaches tend to make assumptions that may           6:         if policy i made a mistake then
be difficult to fulfill. For example, optimizers proposed in [21, 23]     7:             wi = β ∗ wi
depend on accurate cost estimates that are hard to obtain in the
presence of user-defined functions [4].
   In this research project, we propose an approach to building fault    4.    RELATED WORK
tolerance optimizers with RWM that alleviates these problems. We            Approximation in Machine Learning. The work of Bousquet
define a fault tolerance policy to be a set of decisions on where        and Bottou [7] provided the theoretical foundation for understand-
to use which fault tolerance mechanism. Continuing our running           ing this phenomenon; a series of recent papers [10, 13, 16] studied
example of preemption, a policy can be a heuristic that advises          the effect of approximating parameter values with Qn.m encoding
on checkpointing before the end of each hour. To simplify termi-         on the prediction accuracy of deep neural networks and general-
nology, we will henceforth use the term checkpoint to denote any         ized linear models. We plan to piggyback on this approximation
fault tolerance mechanism, for instance, qpoints. Smartpoints as a       tolerance to decrease checkpoint size with qpoints.
framework should be capable of incorporating such mechanisms.               Randomized Weighted Majority. Littlestone and Warmuth
   Intuitively, in our approach fault tolerance policies become ex-      proposed the original idea and later summarized it in [17]; the
perts who periodically vote according to the rules of RWM if a sys-      follow up work [8] conducted extensive theoretical analysis and
tem should checkpoint or not. For instance, the ”checkpoint noth-        showed how to choose the penalty parameter β to minimize the ex-
ing” policy would always vote against checkpointing and rather           pected number of mistakes of the algorithm. The paper by Blum [6]
rely on job re-execution to provide fault tolerance. The RWM en-         provides an overview of the key results in the area. We adopt these
sures that the most suitable policy for a given environment will         results to develop the framework of smartpoints, which should
eventually retain most weight. For example, the ”checkpoint noth-        yield a family of fault tolerance optimizers capable of intelligently
ing” policy should win under low preemption rates, because there         choosing the optimal checkpoint policy at runtime.
rare preemption events do not justify the cost of checkpointing.            Fault tolerance optimization for dataflows. The FTOpt opti-
   Thus a dataflow system with RWM-based fault tolerance will            mizer proposed in [23] employs geometric programming to reduce
eventually employ the fault tolerance policy most suitable for its       the overhead of checkpoints. Smartpoints differ from it in three
particular cluster without any user involvement at the cost of few       ways. First, they do not require a cost model and cost estimates for
initial mistakes. For brevity, we use the term smartpoints to refer      operators. Second, they do not require dedicated support from the
to the idea of using RWM to select the best fault tolerance policy.      runtime (FTOpt requires an ack protocol to track tuples). And third,
smartpoints do not restrict job plans in any way besides forming a            new-ec2-spot-instance-termination-notices/,
directed acyclic graph of operators (FTOpt handles only job plans             2015.
with aggregations at the top). The more recent optimizer from [21]        [3] Amazon Web Services. Managing Interruption.
probabilistically models the likelihood and impact of failures using          aws.amazon.com/ec2/spot/spot-tutorials/,
yet another cost model. Unlike [21], which assumes the Poisson                2015.
distribution of failures, smartpoints do note make any assumptions        [4] S. Babu. Towards Automatic Optimization of MapReduce
about failure rates. Rather, they adjust to the actual failure rate at        Programs. SoCC ’10, pages 137–142.
runtime by selecting the checkpoint policy optimal for the rate.          [5] M. Bilenko and M. Richardson. Predictive Client-side
                                                                              Profiles for Personalized Advertising. KDD ’11, pages
5.    RESEARCH PLAN                                                           413–421.
   We intend to implement and evaluate qpoints and smartpoints            [6] A. Blum. On-line Algorithms in Machine Learning. In
during the years 2016-2017. With qpoints we have to address three             Developments from a June 1996 Seminar on Online
key issues. First, current Qn.m encoding schemes rely either on               Algorithms: The State of the Art, pages 306–325, 1998.
custom hardware [10, 16] or on algorithms hand-crafted to repre-          [7] O. Bousquet et al. The Tradeoffs of Large Scale Learning. In
sent parameters with Qn.m values [13]. To keep our approach                   Advances in Neural Information Processing Systems 20,
general, we cannot assume such support and have to come up with               pages 161–168. 2008.
a software encoder.                                                       [8] N. Cesa-Bianchi et al. How to Use Expert Advice. STOC
   Second, to make qpoints handle failures with a prior warning, we           ’93, pages 382–391, 1993.
have to meet hard real-time requirements of the warning. Given the        [9] T. Chilimbi et al. Project Adam: Building an Efficient and
limited number of IO operations per second, considerable memory               Scalable Deep Learning Training System. OSDI ’14, pages
footprint and the software Qn.m encoder, qpointing in a timely                571–582.
manner requires finding a tradeoff between parameter memory us-          [10] M. Courbariaux et al. Low precision arithmetic for deep
age and accuracy of the final model.                                          learning. CoRR, abs/1412.7024, 2014.
   Third, we have to avoid numerical issues. The Qn.m parameter          [11] J. Dean. Lessons from Building Large Distributed Systems.
representation needs (a) to have enough accuracy to represent small           www.cs.cornell.edu/projects/ladis2009/
parameter values or updates commonly encountered in real world                talks/dean-keynote-ladis2009.pdf, 2009.
machine learning deployments and (b) to avoid introducing bias           [12] J. Dean and S. Ghemawat. MapReduce: Simplified Data
into parameter values.                                                        Processing on Large Clusters. Commun. ACM,
   With smartpoints, we need to solve two challenges, namely (a)              51(1):107–113, 2008.
we have to adjust Randomized Weighted Majority to handle the
                                                                         [13] D. Golovin et al. Large-Scale Learning with Less RAM via
specific case of roll-back recovery and (b) we have to preserve
                                                                              Randomization. ICML 2013, page 10.
RWM guarantees during this adjustment. With respect to modi-
                                                                         [14] J. Goodman et al. Spam and the Ongoing Battle for the
fications, three are absolutely necessary: designing the expert pool,
                                                                              Inbox. Commun. ACM, 50(2):24–33, Feb. 2007.
adjusting the penalty rate β, and handling cases where persisting
the data is compulsory, such as when a system cannot hold data in        [15] Google Cloud Platform. Creating a Preemtible VM Instance.
memory due to memory pressure.                                                https://cloud.google.com/compute/docs/
   Our baseline for both qpoints and smartpoints will be fault toler-         instances/preemptible, 2015.
ance policies commonly hard-coded into modern dataflow systems           [16] S. Gupta et al. Deep Learning with Limited Numerical
such as ”checkpoint everything” in Apache Hadoop. We will fix                 Precision. CoRR, abs/1502.02551, 2015.
the placement and frequency of checkpoints, e.g., ”checkpoint the        [17] N. Littlestone et al. The weighted majority algorithm.
last operator of each third iteration”, and run jobs with such setting        Information and computation, 108(2):212–261, 1994.
to get the baseline median (out of five identical runs) wall-clock       [18] Y. Low et al. Distributed GraphLab: A Framework for
job execution time. We will then strive to decrease this time with            Machine Learning and Data Mining in the Cloud. Proc.
qpoints and smartpoints. With qpoints, the time should reduce be-             VLDB Endow., 5(8):716–727, 2012.
cause jobs will spend less time persisting qpoints than checkpoints      [19] P. Moritz et al. SparkNet: Training Deep Networks in Spark.
due to the smaller qpoint size. With smartpoints, we expect reduc-            arXiv:1511.06051, 2016.
tion in the execution time because the majority of jobs will check-      [20] M. Pundir et al. Zorro: Zero-Cost Reactive Failure Recovery
point optimally for the cluster they run in, while a fixed checkpoint         in Distributed Graph Processing. SoCC ’15, pages 195–208.
policy is likely to mismatch certain environments. For instance,         [21] A. Salama et al. Cost-based Fault-tolerance for Parallel Data
jobs under low preemption rates should on average complete faster             Processing. SIGMOD ’15, pages 285–297.
because smartpoints will automatically select the ”checkpoint noth-      [22] S. Schelter et al. All Roads Lead to Rome: Optimistic
ing” policy and remove entire checkpoint overhead.                            Recovery for Distributed Iterative Data Processing. CIKM
                                                                              ’13, pages 1919–1928.
Acknowledgements. This work has been supported through the               [23] P. Upadhyaya et al. A Latency and Fault-Tolerance
grant by the German Ministry for Education and Research as Berlin             Optimizer for Online Parallel Query Plans. SIGMOD ’11,
Big Data Center BBDC (ref. 01IS14013A).                                       pages 241–252.
                                                                         [24] C. Xu et al. Efficient Fault-Tolerance for Iterative Graph
6.    REFERENCES                                                              Processing on Distributed Dataflow Systems. ICDE ’16.
 [1] RDD Persistence.                                                    [25] S. Yi et al. Monetary Cost-Aware Checkpointing and
     spark.apache.org/docs/latest/                                            Migration on Amazon Cloud Spot Instances. IEEE
     programming-guide.html#rdd-persistence,                                  Transactions on Services Computing, 5(4):512–524, 2012.
     2015.                                                               [26] M. Zaharia et al. Resilient Distributed Datasets: A
 [2] Amazon Web Services. EC2 Spot Instance Termination                       Fault-Tolerant Abstraction for In-Memory Cluster
     Notices. aws.amazon.com/blogs/aws/                                       Computing. NSDI’12, pages 2–2.