=Paper= {{Paper |id=None |storemode=property |title=Adaptive Selective Replication for Complex Event Processing Systems |pdfUrl=https://ceur-ws.org/Vol-1018/paper4.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/GrunbergerHF13 }} ==Adaptive Selective Replication for Complex Event Processing Systems== https://ceur-ws.org/Vol-1018/paper4.pdf
          Adaptive Selective Replication for Complex Event
                        Processing Systems
                                                         Vision Paper
          Franz Josef Grüneberger                      Thomas Heinze                               Pascal Felber
                       SAP AG                               SAP AG                              University of Neuchâtel
                 Chemnitzer Str. 48                    Chemnitzer Str. 48
             01187 Dresden, Germany                 01187 Dresden, Germany                            Switzerland
          franz.josef.grueneberger@sap.com           thomas.heinze@sap.com                       pascal.felber@unine.ch

ABSTRACT                                                                almost zero latency, we focus on active replication. How-
As of today, active replication is used in complex event pro-           ever, actively replicating a system requires at least twice the
cessing systems to enable near zero latency take over in case of        resources. Therefore, this paper envisions techniques and
host failures. Moreover, elastic complex event processing sys-          outlines the major challenges for an elastic fault-tolerant com-
tems adapt their resource consumption to the actual system              plex event processing system that achieves high availability
load. However, active replication is a coarse-grained approach          via adaptive selective replication. Specifically:
demanding the duplication of all used resources. Therefore,                  1. We present an approach that leverages spare resources
we envision a system adopting adaptive fine-grained repli-                      to increase the availability of queries by replicating
cation strategies allowing to trade off availability and used                   selected parts, i. e. operators, of the running system.
resources.                                                                      Specifically, we present different strategies to select op-
                                                                                erators for replication. Moreover, striving for maximal
                                                                                availability, we introduce different placement strategies
1.   INTRODUCTION                                                               to deploy operators on hosts.
   The dissemination of high frequency event sources raises
the demand to extract information from high velocity data                    2. We envision an optimization component supporting the
streams. Prevalent application domains comprise automatic                       replication of queries to hit a certain availability target
stock trading, credit card fraud detection, automated home-                     while adhering to resource constraints. Moreover, the
care, as well as scientific experiments, logistics, and telecom-                component should assist minimizing the resource usage
munication. To process high velocity data (> 10.000 events                      for a certain availability goal.
per second) with low latency (< 100 ms), a new class of ap-
plications enabling the efficient analysis of data in real-time              3. We explore the challenges arising from replication in
has ermerged. Prominent examples of such complex event                          elastic distributed complex event processing systems,
processing systems comprise Borealis [1], IBM System S [9],                     where both queries and hosts are dynamically added
and Yahoo! S4 [17].                                                             to and removed from the system.
   Distributed complex event processing systems spanning                   The remainder of this paper is structured as follows: Sec-
thousands of hosts cope with very high data rates and exten-            tion 2 introduces the assumed system model. Section 3
sive computations. However, the error probability increases             presents an approach leveraging spare resources to improve
with the number of components in a system. Because in data              the availability of queries. In Section 4 we propose an opti-
centers the probability for a single host to fail at least once         mization component to minimize the resource consumption
a year is between 4 and 8 percent [6, 20] and distributed               for a certain availability target. The challenges arising from
complex event processing systems execute all computations               an application of the approaches in an elastic system are out-
in memory, fault tolerance techniques have to be leveraged              lined in Section 5. Related research is discussed in Section 6.
to circumvent unrecoverable data loss.                                  Finally, Section 7 concludes the paper.
   Various fault tolerance techniques like active replication [2]
and check pointing [11, 14] have been studied in the context
of complex event processing systems. To ensure failover with            2.     SYSTEM MODEL
                                                                        2.1      Query Model
                                                                          Queries in the system are continuous queries that can be
                                                                        added and removed at any point in time. As opposed to
                                                                        one-shot queries that are sent to the system and then produce
                                                                        a result once, continuous queries remain in the system for a
                                                                        certain period of time and produce results continuously. Let
                                                                        Q = {q1 , q2 , . . . , ql } be the set of queries in the system.
                                                                          Queries are specified by the user in an event processing
                                                                        language (EPL). In the system queries are represented as


                                                                    1
directed acyclic graphs (DAGs). Nodes represent operators               2.3    Failure Model
that are connected by unidirectional edges. A query compiler               According to the query model the system comprises several
transforms queries specified in EPL into a query graph.                 operators (timed processes) that are executed on a set of
   Each operator has one or two inputs and multiple outputs.            hosts. We assume that each host has a probability p to fail
Operators with two inputs are referred to as binary operators.          with a crash stop failure. Crash stop failures result in an
Each operator has a type defining its basic behaviour: source,          immediate crash of all operators placed on the specific host.
projection, filter, aggregation, sequence, join, and destination.       Moreover, we assume that Byzantine (value) errors caused
Besides a type all operators, except sources and sinks, have            by erroneous executions can be transformed into crash stop
a predicate refining its functionality. For example in case of          failures, e. g. by means of Software Encoded Processing [8].
a filter operator the predicate specifies the filter condition.         In the following we consider the time period of one day. Due
   The following example query calculates the average price             to the fact that in modern data centers the mean time to
of the SAP stock over the last 10 minutes.                              repair (MTTR) can be up to two days [6], we assume that
INSERT INTO outStream                                                   crashed hosts will stay down for the whole day. Moreover,
   SELECT compName, avg(tick)                                           host crashes are assumed to be uncorrelated, i. e. if some
   FROM tickStream WITHIN 600 seconds                                   hosts fail others will remain running.
   GROUP BY compName                                                       A query is considered broken, iff for at least one of its
   WHERE compName = "SAP";                                              operators neither a primary operator nor an operator replica
                                                                        is executed anymore.
   The corresponding query graph, which is depicted in Fig-
ure 2, contains a source, filter, aggregation, and destination          2.4    Load Model
operator.                                                                  We assume a load model, which is based on work in the
   To minimize the number of operators in the system, all               context of the data streaming system Borealis [23] and by
queries are optimized via a query optimization component,               Viglas et al. [19]. Each operator has one or two inputs and
which analyses the query graph of already running queries               multiple outputs that are associated with a certain event
with respect to reusable operators leveraging incremental               rate. We assume that the input event rate for source opera-
multi query optimization (MQO) techniques [12]. The opti-               tors is given and the output event rate for source operators
mizer maintains an internal global query graph subsuming                equals the input event rate. Each operator op has a certain
all currently deployed queries. When adding a new query,                load load(op). A load of 1.0 represents 100 % CPU usage
reusable parts are discovered using a breadth-first search              in a fixed time interval - usually one second. This operator
on the global query graph. Figure 1 shows an example for                load is calculated as product of the input event rate of an
multi query optimization with two queries. Operators of                 operator multiplied by the cost c, which describes the time
the already running query are depicted in the upper lane,               required by the operator to process a single tuple.
whereas operators for the newly added query are depicted in                For the example query depicted in Figure 2 the load of the
the lower lane. Assuming that the same operator name indi-              different operators can be specified as load(S1 ) = r1 ∗ c(S1 ),
cates the same operator type and predicate, the operators S1            load(F1 ) = r1 ∗ c(F1 ), load(A1 ) = r2 ∗ c(A1 ), load(D1 ) =
and F1 are part of both queries. Instead of redeploying this            r3 ∗ c(D1 ). Since each operator is associated with a se-
operators, they are reused from the already running query,              lectivity value, input event rates of operators can be cal-
which is illustrated via the grey box.                                  culated as linear combination of source stream rates and
                                                                        operator selectivity of predecessor operators. For example
                 S1     F1    A1     D1
                                           Running                      the input event rate r3 of operator D1 can be expressed as
                                             Query
                                                                        r3 = sel(A1 ) ∗ r2 = sel(A1 ) ∗ (sel(F1 ) ∗ r1 ).
                             merge
                                             New                                             r1        r2        r3
                 S1     F1    A2     D2     Query                                      S1         F1        A1        D1


  Figure 1: Incremental multi query optimization                                    Figure 2: Example query graph

                                                                           Moreover, each operator has incoming (netin ) and outgo-
2.2    Replication Model                                                ing (netout ) network traffic. The incoming traffic netin (op)
   To ensure near zero latency takeover, we assume active               is calculated as product of input rate and input tuple size
replication in the system. Each operator can have multiple              of the operator, whereas the outgoing traffic netout (op) is
replicas exhibiting exact the same behaviour as the primary             calculated by the output rate multiplied with the output
operator. However, this approach results in the consumption             tuple size.
of additional resources.                                                   For the sake of convenience in the remainder of this paper
   Query operators and replicas are deployed independently              the term load is used interchangeably with CPU resources.
on available hosts. The current placement of operators on
hosts is described via a placement function plc : O → H,                3.    HEURISTIC REPLICATION
where O = {o1 , o2 , . . . , om } is the set of operators for all         Complex event processing systems are exposed to a vary-
queries in the system and H = {h1 , h2 , . . . , hn } denotes the       ing workload caused by different event rates as well as the
set of used hosts.                                                      addition and removal of queries. Figure 3 sketches a fictitious
                                                                        workload of a system processing queries for traffic monitoring.
                                                                        Peak loads indicate rush hour traffic. To deliver results for
                                                                        the queries in real-time, the system has to be equipped with


                                                                    2
enough resources to handle those peak loads. However, since                   multiple queries more important. The calculation of the
the peak load can only be estimated, phases of underpro-                      query out degree heuristic can be expressed as
visioning may occur, which are indicated by the dark gray
                                                                                                                  deg(op)
areas. Moreover, this leads to phases of overprovisioning                                        impQOD (op) =
indicated by the dotted areas.                                                                                      |Q|
                                                                              where deg(op) is the number of queries leveraging the op-
                     70
                                                      Underprovisioning       erator op, and |Q| is the number of queries running in the
                     60                                                       system.
        Used Hosts




                     50            Overprovisioning
                     40                                                       3.1.2    Dynamic Heuristics
                     30
                                                                                 Dynamic heuristics assess runtime properties of operators
                     20
                     10
                                                                              and are, thus, dependent on models providing estimations
                      0                                                       if an operator is deployed for the first time. If for example
                                                                              the required resources are considered, the heuristics are de-
                                          Time
                                                                              pendent on the load model of the system (see Section 2.4).
                                                                              One example for a dynamic heuristic is the low utilization
Figure 3: Workload of a data stream processing sys-                           heuristic that tries to replicate as much operators as possible.
tem                                                                           Therefore, operators with a small load are preferred. The
                                                                              calculation of the heuristic can be expressed as follows:
   Heuristic replication targets the usage of such spare re-
                                                                                                                 load(op)
sources in distributed complex event processing systems to                                  impLU (op) = 1 −
increase the availability of queries. Assume a system that                                                     max(load(O))
comprises a set of n hosts H = {h1 , h2 , . . . , hn } with an over-          where max(load(O)) is the maximum operator load among
all CPU capacity of loadcap = n ∗ hostcap . Query operators                   all operators currently running in the system.
are placed according to an operator to host mapping plc.
The unused CPU resources loadrem can be calculated as                         3.1.3    Query Level Heuristics
                                   X                                             The heuristics introduced so far operate on an operator
             loadrem = loadcap −          load(op)
                                                                              level, i. e. operators are treated independent of each other.
                                                 op∈O
                                                                              Hence, operator-level heuristics might cause the selection of
Since queries can be added to and removed from the system                     only most but not all operators for a query. Not selected
at runtime the amount of remaining resources usable to im-                    operators are weak spots and decrease the achievable avail-
prove the availability of queries in the system changes over                  ability for the query tremendously. Therefore, we propose to
time. If the remaining load is smaller than the load of all                   combine operator importance values into query importance
queries in the system, not all operators can be replicated                    values, which are expressed using a normalized importance
and, thus, a subset of operators has to be selected for repli-                function for queries impQheur : Q → [0, 1]. Operator selec-
cation. Thus, we propose a three step approach. First, all                    tion algorithms operating on query level heuristics will ensure
operators are rated according to their potential to improve                   that once a query was selected for replication all operators
query availability by means of an operator importance heuris-                 of that query are replicated. Only if not enough resources
tic (see Section 3.1). Second, the set of operators as well as                are available to replicate a full query, single operators would
the replication level, i. e. the number of replica instances,                 be selected.
is determined for each operator using an operator selection
algorithm (see Section 3.2). Third, the replica instances
                                                                              3.2     Operator Selection Strategies
are deployed on hosts according to an operator placement                        Operator selection strategies take care of the actual op-
strategy (see Section 3.3). Finally, in Section 3.4 we present                erator selection based on the operator importance and the
an evaluation.                                                                remaining load loadrem in the system. The calculated opera-
                                                                              tor selection can be described as function sel : O → Z, which
3.1     Operator Importance Heuristics                                        associates each operator with a number of replicas referred
                                                                              to as replication level rep(op).
  The potential of operators to improve the availability of
queries is rated using a normalized importance function                       3.2.1    Simple Operator Selection
impheur : O → [0, 1] that associates each operator in the
                                                                                A simple operator selection strategy sorts all operators
system with a normalized importance value. Depending on
                                                                              descending based on their relative importance defined via
the source of information used for calculating this importance
                                                                              impheur . Afterwards operators are selected for replication
value, static and dynamic heuristics can be distinguished.
                                                                              until all spare resources loadrem are consumed. Selecting
                                                                              operators for replication includes the determination of the
3.1.1                Static Heuristics                                        replication level, i. e. the number of replicas to create for
  Static heuristics calculate the importance of operators                     a single operator. Different selection procedures can be
based on properties observable in the global query graph.                     established:
Due to the multi query optimization operators can be reused
by multiple queries. Hence, if a reused operator fails, all                     1. All operators can be selected for replication at least
depending queries crash. One possible heuristic, referred to                       once if enough remaining resources are available. If
as query out degree heuristic, is based on this observation                        afterwards spare resources are still available, the repli-
and rates the importance of operators that are reused by                           cation level for the already selected operators can be


                                                                          3
      increased stepwise. This procedure ensures that each                                             100




                                                                                 Running Queries (%)
                                                                                                              ●
                                                                                                                   ●

      operator is replicated at least once, if enough resources                                                          ●

                                                                                                                               ●

      are available.                                                                                    98
                                                                                                                                             ●



  2. A replication level larger than one might be set di-                                               96
                                                                                                                         100%                               ●


     rectly for an operator. Hence, some operators might                                                                 Query Out Low Utilization (50%)
     be excluded from replication, if already all resources                                             94               Low Utilization (50%)
                                                                                                                         Query Out (50%)
     are consumed.                                                                                                 ●
                                                                                                                         Random (50%)
                                                                                                        92
3.2.2    Optimized Operator Selection                                                                        0.0        0.5   1.0          2.0             3.0
                                                                                                                       Host Crash Probability (%)
   The simple operator selection strategy selects operators
for replication based on either a static or dynamic heuristic.
Thus, only one type of available information is incorporated          Figure 4: Comparison of query availability for differ-
at a time and the resulting selection of operators can lead to        ent heuristics
a non-optimal availability of queries. Therefore, we propose
to augment the selection process based on a static heuristic
                                                                      different operators as possible are placed on one host. Thus,
with runtime information by modeling the operator selection
                                                                      multiple replicas crash in case of a host failure.
problem as 0-1 knapsack problem [16].
                                                                         Therefore, we propose to use a bin packing algorithm with
   Each operator opi ∈ O has a value vi = impheur (opi )
                                                                      color constraints [5]. Besides load and network consumption,
and a weight wi = load(opi ). The maximum weight of the
                                                                      each replica will be associated with a unique color. The col-
bag equals the remaining resources loadrem in the system.
                                                                      ored bin packing algorithm ensures that each host contains
   Because the 0-1 knapsack problem is NP-hard and load
                                                                      only replicas with at most c distinct colors, where c is a
values of operators and, thus, the optimal solution changes
                                                                      pre-defined positive integer. Because the number of repli-
continuously, we suggest an approximation by means of a
                                                                      cas placed on the same host shall be minimized, an upper
combination of static and dynamic heuristics. The optimiza-
                                                                      bound C with c ≤ C is specified for the color constraint.
tion is based on the intuition that operators with the greatest
                                                                      Then the placement problem can be formulated with an
profit per weight unit have to be selected first. Thus, we
                                                                      additional constraint that strives to minimize the parame-
propose to leverage a product of the query out degree and
                                                                      ter c. If no placement can be found without violating the
low utilization heuristic
                                                                      upper bound C, the simple bin packing algorithm is used to
        impQOD∗LU (op) = impQOD (op) ∗ impLU (op)                     calculate a placement.

as basis for the operator selection.                                  3.4    Evaluation
                                                                         We have performed a preliminary simulation-based evalua-
3.3     Fault-tolerant Operator Placement                             tion of the proposed approaches. We used a query generator
  Operators selected for replication via the operator selection       to generate queries based on six query templates that differ in
algorithm have to be deployed on hosts. However, depending            structure and operator parameters. Operators were deployed
on the chosen operator placement for the same set of selected         on simulated hosts with a CPU capacity of loadcap = 1. We
operators different availabilities can result.                        assume a system comprising 100 hosts. Moreover, operators
                                                                      were replicated at most once. To guarantee statistical cor-
3.3.1    Simple Bin Packing                                           rectness, 1000 runs were conducted for each experiment and
   A bin packing algorithm [4] for fault-tolerant operator            values have been averaged.
placement minimizes the number of used hosts, so that idling             Figure 4 depicts the percentage of remaining running
hosts can be turned off to save energy. This property is in           queries after a time period of one day assuming different
line with the notion of elasticity. We propose a bin packing          host failure probabilities p = {0.0025; 0.005; 0.01; 0.02; 0.03}
algorithm, which is an extended version of a first-fit bin            and different heuristics for rating the importance of opera-
packing, which has a complexity of O(mn), where m is the              tors. Because a constant system load load(O) = 100 was
number of replicas that shall be placed and n the number of           generated, the stated percentages of resources available for
hosts.                                                                replication are equal to the actual remaining load loadrem
   Each host is modeled as bin, where the available CPU               in the system. Leveraging either the low utilization heuris-
resources form the capacity. Replicas are first sorted in             tic or the query out degree heuristic to select operators for
decreasing order according to their normalized importance             replication improves the percentage of remaining running
and then assigned using their load as weight. Moreover, two           queries by approximately 1.7 percentage points compared to
constraints are ensured: (i) the network capacity of a host           a random operator selection assuming a host failure proba-
should not be exceeded, (ii) two replicas of the same operator        bility of p = 0.02 and loadrem = 50. Using a combination
are never placed on the same host. To reduce the consumed             of both heuristics improves the query availability further by
network bandwidth, possible hosts are ordered according to            0.4 percentage points, resulting in 99.2 % remaining running
a neighboring factor representing the amount of predecessor           queries.
and successor operators deployed on the same host.                       Figure 5 shows the availability increase as a function of
                                                                      available resources for replication. A combination of query
3.3.2    Colored Bin Packing                                          out degree and low utilization heuristic results in 99.2 %
  Even though the simple bin packing algorithm is replica-            remaining running queries, if a host failure probability of
aware due to the additional placement constraint, the system          p = 0.02 and loadrem = 50 is assumed. Moreover, the
availability is not maximized because as many replicas of             achievable percentage of running queries is only diminished


                                                                  4
                                  100
                                                                                         5.     DYNAMIC REPLICATION


            Running Queries (%)
                                         ●

                                              ●

                                                    ●                                       Elastic data stream processing systems are able to cope
                                   98                      ●                             with varying query as well as event load. Mechanisms to dy-
                                                    100%                                 namically allocate and release hosts depending on the actual
                                   96
                                                    90%             ●
                                                                                         demand prevent costly overprovisioning and performance
                                                    50%
                                   94               10%
                                                                                         barriers due to underprovisioning. Ideally the system can
                                                    1%                          ●
                                                                                         scale out indefinitely to serve high event rates and scale in
                                              ●
                                                    0%                                   to lower the used resources in case of low utilization.
                                   92
                                        0.0        0.5   1.0       2.0         3.0          Processing queries in an elastic environment poses various
                                                  Host Crash Probability (%)             challenges. Once new queries are submitted to the system,
                                                                                         the encompassed operators have to be distributed to the
                                                                                         running hosts. Because of operator reuse the load of already
Figure 5: Query availability for combination of query
                                                                                         deployed operators changes. To not impair the performance
out degree and low utilization heuristic
                                                                                         of the system two reactive actions are taken: (i) overloaded
                                                                                         hosts might be relieved by automatically migrating operators
by approximately one percentage point compared to full                                   from one host to another, (ii) overload situations that cannot
replication, if 50 % less spare resources are used for replication                       be resolved by moving operators from one host to another
and the host failure probability is p = 0.03. This result can                            are resolved by splitting the operator into several operator
be explained by the fact that with loadrem = 50 available for                            instances that handle only a portion of the overall load and
replication still 80 % of the query operators are replicated.                            can be deployed independently.
Moreover, those operators are reused to a large extent or                                   However, handling replicas in a dynamic system is demand-
have small cost.                                                                         ing:
                                                                                              • The complexity of the reconfiguration caused by the
4.     REPLICATION OPTIMIZER                                                                    exoneration of overloaded hosts is increased since addi-
  Service level agreements are important in scenarios where                                     tional communication channels for replicas have to be
a certain availability is required. However, achieving a higher                                 maintained.
availability by replicating more operators requires more re-
sources. Therefore, the solicited availability and the resulting                              • If operators are split into several instances, all replicas
resource consumption should be traded off. Replication of                                       have to be split too in order to maintain a consistent
operators in a system can be optimized with respect to two                                      system state.
different optimization goals:
                                                                                              • Systems that are used in conjunction with heuristic
     1. For a given resource limit costmax , the availability of                                replication (see Section 3), have to decide in case of
        queries A(Q) can be maximized.                                                          spare resources whether to release hosts or to create
                                                                                                additional replicas.
     2. For a given availability of queries A(Q), the resource
        consumption cost(Q) can be minimized.                                                 • Reconfiguration routines have to ensure that the new
                                                                                                placement does not violate existing service level con-
  The achieved availability of queries is influenced by the set
                                                                                                straints for the queries (see Section 4).
of operators selected for replication, the number of replicas
that is created for each of the selected operators, as well
as the placement of replicas on hosts. All those influential                             6.     RELATED WORK
parameters are reflected in the placement plc and, thus, the                                Fault tolerance techniques for data stream processing sys-
placement can serve as predictor for the achievable availability.                        tems like active replication [2], check pointing [14], and a
To optimize the replication decisions according to the two                               combination of active and passive replication [24], are key
specified optimization goals, additional models have to be                               enablers for our proposed approaches.
incorporated. To estimate the costs for a placement, the load                               Moreover, operator placement algorithms are leveraged,
model for query operators can be leveraged. Moreover, an                                 which have been studies extensively by various authors. A
availability model is required that enables the estimation of                            survey is available in [15]. Repantis et al. [18] presented
the query availability resulting from a certain placement.                               a procedure for replica placement considering performance
  Assuming the availability of those models, the optimization                            constraints like end-to-end latency. The ZEN system [3]
problems can be formulated as follows:                                                   models different levels of availability in a systems and tries
                                                                                         to assign most important operators to hosts with the highest
          max{A(plc) | plc ∈ P; cost(plc) ≤ costmax }
                                                                                         availability. Another replication scheme based on graph
and                                                                                      coloring is presented in [22].
                                                                                            Optimization in the area of data stream processing systems
            min{cost(plc) | plc ∈ P; A(plc) ≥ A(Q)}
                                                                                         is done for example to achieve an optimal overall utiliza-
where P is the set of all possible placements resulting from                             tion [21], or optimal utilization with limited resources [13].
different operator selections, different replication levels, as                             An approach related to that in this paper has been studied
well as different placement strategies.                                                  by Fernandez et al. in [7]. The authors present an integrated
  Those optimization problems can be solved using well-                                  approach to scale out streaming systems while achieving fault
known optimization algorithms like genetic search [10]. How-                             tolerance via check pointing. In this paper we focus, however,
ever, to lower the effort for the optimization, heuristics to                            on active replication to ensure minimal latency and envision
restrict the search space have to be developed.                                          also scale in.


                                                                                     5
7.   SUMMARY                                                                 Data (SIGMOD), New York, NY, 06/2013 2013. ACM,
   As of today, many data stream processing systems use                      ACM.
replication to ensure high availability in case of host failures.        [8] C. Fetzer, U. Schiffel, and M. Süßkraut. An-encoding
However, to replicate a system completely, at least twice the                compiler: Building safety-critical systems with
resources have to be allocated, which is costly.                             commodity hardware. In SAFECOMP, pages 283–296,
   We proposed a heuristic replication approach enabling the                 2009.
use of remaining system resources to increase the availabil-             [9] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and
ity of queries. An operator selection algorithm is used to                   M. Doo. SPADE: the system s declarative stream
determine a subset of operators for replication that are then                processing engine. In SIGMOD Conference, pages
placed via an operator placement algorithm. Moreover, we                     1123–1134, 2008.
suggested a replication optimizer which allows users to guide           [10] D. E. Goldberg. Genetic Algorithms. Pearson
the replication while trading off cost and availability. Finally,            Education, 2013.
the challenges arising from a combination of these techniques           [11] J.-H. Hwang, Y. Xing, U. Çetintemel, and S. B.
with elastic data stream processing systems were discussed.                  Zdonik. A Cooperative, Self-Configuring
Using heuristic replication as well as the replication opti-                 High-Availability Solution for Stream Processing. In
mizer with a system reacting on changes, e. g. in event rate                 ICDE, pages 176–185, 2007.
and selectivities, demands the adaptation of the placement              [12] C. Jin and J. G. Carbonell. Predicate Indexing for
routines. However, the normalized importance functions as                    Incremental Multi-Query Optimization. In ISMIS,
well as the optimization routines might be reused unchanged.                 pages 339–350, 2008.
   To validate the approaches we simulated a heuristic repli-           [13] E. Kalyvianaki, W. Wiesemann, Q. H. Vu, D. Kuhn,
cation approach comprising operator selection as well as                     and P. Pietzuch. Sqpr: Stream query planning with
operator placement strategies. Given a set of remaining                      reuse. In ICDE, pages 840–851, 2011.
resources, the fault tolerance of complex event processing              [14] Y. Kwon, M. Balazinska, and A. G. Greenberg.
systems is improved. Choosing a proper heuristic can im-
                                                                             Fault-tolerant stream processing using a distributed,
prove the availability two percentage points compared to a                   replicated file system. PVLDB, 1(1):574–585, 2008.
random operator selection. Compared to full replication only
                                                                        [15] G. T. Lakshmanan, Y. Li, and R. E. Strom.
one percentage point is lost spending 50 % less resources for
                                                                             Placement strategies for internet-scale data stream
replication.
                                                                             systems. IEEE Internet Computing, 12(6):50–60, 2008.
   In the future, we plan to implement the proposed ap-
proaches with a state-of-the-art streaming system.                      [16] S. Martello and P. Toth. Knapsack problems:
                                                                             algorithms and computer implementations.
                                                                             Wiley-Interscience series in discrete mathematics and
8.   REFERENCES                                                              optimization. J. Wiley & Sons, 1990.
                                                                        [17] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:
 [1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel,                   Distributed Stream Computing Platform. In ICDM
     M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey,                       Workshops, pages 170–177, 2010.
     A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. B.                [18] T. Repantis and V. Kalogeraki. Replica placement for
     Zdonik. The design of the Borealis stream processing                    high availability in distributed stream processing
     engine. In CIDR, pages 277–289, 2005.                                   systems. In DEBS, pages 181–192, 2008.
 [2] M. Balazinska, H. Balakrishnan, S. Madden, and                     [19] S. Viglas and J. F. Naughton. Rate-based query
     M. Stonebraker. Fault-tolerance in the borealis                         optimization for streaming information sources. In
     distributed stream processing system. ACM Trans.                        SIGMOD Conference, pages 37–48, 2002.
     Database Syst., 33(1), 2008.                                       [20] K. V. Vishwanath and N. Nagappan. Characterizing
 [3] N. Bansal, R. Bhagwan, N. Jain, Y. Park, D. S.                          cloud computing hardware reliability. In SoCC, pages
     Turaga, and C. Venkatramani. Towards optimal                            193–204, 2010.
     resource allocation in partial-fault tolerant applications.        [21] J. L. Wolf, N. Bansal, K. Hildrum, S. Parekh,
     In INFOCOM, pages 1319–1327, 2008.                                      D. Rajan, R. Wagle, K.-L. Wu, and L. Fleischer. Soda:
 [4] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson.                     An optimizing scheduler for large-scale stream-based
     Approximation algorithms for np-hard problems.                          distributed computer systems. In Middleware, pages
     chapter Approximation algorithms for bin packing: a                     306–325, 2008.
     survey, pages 46–93. PWS Publishing Co., Boston, MA,               [22] F. Xiao, T. Kitasuka, and M. Aritsugi. Economical and
     USA, 1997.                                                              fault-tolerant load balancing in distributed stream
 [5] M. Dawande, J. Kalagnanam, and J. Sethuraman.                           processing systems. IEICE Transactions,
     Variable sized bin packing with color constraints.                      95-D(4):1062–1073, 2012.
     Electronic Notes in Discrete Mathematics, 7:154–157,               [23] Y. Xing, J.-H. Hwang, U. Çetintemel, and S. B.
     2001.                                                                   Zdonik. Providing resiliency to load variations in
 [6] J. Dean. Handling Large Datasets at Google: Current                     distributed stream processing. In VLDB, pages
     Systems and Future Directions. In Data-Intensive                        775–786, 2006.
     Computing Symposium, 2008.                                         [24] Z. Zhang, Y. Gu, F. Ye, H. Yang, M. Kim, H. Lei, and
 [7] R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and                    Z. Liu. A hybrid approach to high availability in stream
     P. Pietzuch. Integrating scale out and fault tolerance in               processing systems. In ICDCS, pages 138–148, 2010.
     stream processing using operator state management. In
     ACM International Conference on Management of


                                                                    6