=Paper= {{Paper |id=None |storemode=property |title=Adaptive Selective Replication for Complex Event Processing Systems |pdfUrl=https://ceur-ws.org/Vol-1018/paper4.pdf |volume=Vol-1018 |dblpUrl=https://dblp.org/rec/conf/vldb/GrunbergerHF13 }} ==Adaptive Selective Replication for Complex Event Processing Systems== https://ceur-ws.org/Vol-1018/paper4.pdf

Adaptive Selective Replication for Complex Event
Processing Systems
Vision Paper
Franz Josef Grüneberger Thomas Heinze Pascal Felber
SAP AG SAP AG University of Neuchâtel
Chemnitzer Str. 48 Chemnitzer Str. 48
01187 Dresden, Germany 01187 Dresden, Germany Switzerland
franz.josef.grueneberger@sap.com thomas.heinze@sap.com pascal.felber@unine.ch

ABSTRACT almost zero latency, we focus on active replication. How-
As of today, active replication is used in complex event pro- ever, actively replicating a system requires at least twice the
cessing systems to enable near zero latency take over in case of resources. Therefore, this paper envisions techniques and
host failures. Moreover, elastic complex event processing sys- outlines the major challenges for an elastic fault-tolerant com-
tems adapt their resource consumption to the actual system plex event processing system that achieves high availability
load. However, active replication is a coarse-grained approach via adaptive selective replication. Specifically:
demanding the duplication of all used resources. Therefore, 1. We present an approach that leverages spare resources
we envision a system adopting adaptive fine-grained repli- to increase the availability of queries by replicating
cation strategies allowing to trade off availability and used selected parts, i. e. operators, of the running system.
resources. Specifically, we present different strategies to select op-
erators for replication. Moreover, striving for maximal
availability, we introduce different placement strategies
1. INTRODUCTION to deploy operators on hosts.
The dissemination of high frequency event sources raises
the demand to extract information from high velocity data 2. We envision an optimization component supporting the
streams. Prevalent application domains comprise automatic replication of queries to hit a certain availability target
stock trading, credit card fraud detection, automated home- while adhering to resource constraints. Moreover, the
care, as well as scientific experiments, logistics, and telecom- component should assist minimizing the resource usage
munication. To process high velocity data (> 10.000 events for a certain availability goal.
per second) with low latency (< 100 ms), a new class of ap-
plications enabling the efficient analysis of data in real-time 3. We explore the challenges arising from replication in
has ermerged. Prominent examples of such complex event elastic distributed complex event processing systems,
processing systems comprise Borealis [1], IBM System S [9], where both queries and hosts are dynamically added
and Yahoo! S4 [17]. to and removed from the system.
Distributed complex event processing systems spanning The remainder of this paper is structured as follows: Sec-
thousands of hosts cope with very high data rates and exten- tion 2 introduces the assumed system model. Section 3
sive computations. However, the error probability increases presents an approach leveraging spare resources to improve
with the number of components in a system. Because in data the availability of queries. In Section 4 we propose an opti-
centers the probability for a single host to fail at least once mization component to minimize the resource consumption
a year is between 4 and 8 percent [6, 20] and distributed for a certain availability target. The challenges arising from
complex event processing systems execute all computations an application of the approaches in an elastic system are out-
in memory, fault tolerance techniques have to be leveraged lined in Section 5. Related research is discussed in Section 6.
to circumvent unrecoverable data loss. Finally, Section 7 concludes the paper.
Various fault tolerance techniques like active replication [2]
and check pointing [11, 14] have been studied in the context
of complex event processing systems. To ensure failover with 2. SYSTEM MODEL
2.1 Query Model
Queries in the system are continuous queries that can be
added and removed at any point in time. As opposed to
one-shot queries that are sent to the system and then produce
a result once, continuous queries remain in the system for a
certain period of time and produce results continuously. Let
Q = {q1 , q2 , . . . , ql } be the set of queries in the system.
Queries are specified by the user in an event processing
language (EPL). In the system queries are represented as

1
directed acyclic graphs (DAGs). Nodes represent operators 2.3 Failure Model
that are connected by unidirectional edges. A query compiler According to the query model the system comprises several
transforms queries specified in EPL into a query graph. operators (timed processes) that are executed on a set of
Each operator has one or two inputs and multiple outputs. hosts. We assume that each host has a probability p to fail
Operators with two inputs are referred to as binary operators. with a crash stop failure. Crash stop failures result in an
Each operator has a type defining its basic behaviour: source, immediate crash of all operators placed on the specific host.
projection, filter, aggregation, sequence, join, and destination. Moreover, we assume that Byzantine (value) errors caused
Besides a type all operators, except sources and sinks, have by erroneous executions can be transformed into crash stop
a predicate refining its functionality. For example in case of failures, e. g. by means of Software Encoded Processing [8].
a filter operator the predicate specifies the filter condition. In the following we consider the time period of one day. Due
The following example query calculates the average price to the fact that in modern data centers the mean time to
of the SAP stock over the last 10 minutes. repair (MTTR) can be up to two days [6], we assume that
INSERT INTO outStream crashed hosts will stay down for the whole day. Moreover,
SELECT compName, avg(tick) host crashes are assumed to be uncorrelated, i. e. if some
FROM tickStream WITHIN 600 seconds hosts fail others will remain running.
GROUP BY compName A query is considered broken, iff for at least one of its
WHERE compName = "SAP"; operators neither a primary operator nor an operator replica
is executed anymore.
The corresponding query graph, which is depicted in Fig-
ure 2, contains a source, filter, aggregation, and destination 2.4 Load Model
operator. We assume a load model, which is based on work in the
To minimize the number of operators in the system, all context of the data streaming system Borealis [23] and by
queries are optimized via a query optimization component, Viglas et al. [19]. Each operator has one or two inputs and
which analyses the query graph of already running queries multiple outputs that are associated with a certain event
with respect to reusable operators leveraging incremental rate. We assume that the input event rate for source opera-
multi query optimization (MQO) techniques [12]. The opti- tors is given and the output event rate for source operators
mizer maintains an internal global query graph subsuming equals the input event rate. Each operator op has a certain
all currently deployed queries. When adding a new query, load load(op). A load of 1.0 represents 100 % CPU usage
reusable parts are discovered using a breadth-first search in a fixed time interval - usually one second. This operator
on the global query graph. Figure 1 shows an example for load is calculated as product of the input event rate of an
multi query optimization with two queries. Operators of operator multiplied by the cost c, which describes the time
the already running query are depicted in the upper lane, required by the operator to process a single tuple.
whereas operators for the newly added query are depicted in For the example query depicted in Figure 2 the load of the
the lower lane. Assuming that the same operator name indi- different operators can be specified as load(S1 ) = r1 ∗ c(S1 ),
cates the same operator type and predicate, the operators S1 load(F1 ) = r1 ∗ c(F1 ), load(A1 ) = r2 ∗ c(A1 ), load(D1 ) =
and F1 are part of both queries. Instead of redeploying this r3 ∗ c(D1 ). Since each operator is associated with a se-
operators, they are reused from the already running query, lectivity value, input event rates of operators can be cal-
which is illustrated via the grey box. culated as linear combination of source stream rates and
operator selectivity of predecessor operators. For example
S1 F1 A1 D1
Running the input event rate r3 of operator D1 can be expressed as
Query
r3 = sel(A1 ) ∗ r2 = sel(A1 ) ∗ (sel(F1 ) ∗ r1 ).
merge
New r1 r2 r3
S1 F1 A2 D2 Query S1 F1 A1 D1

Figure 1: Incremental multi query optimization Figure 2: Example query graph

Moreover, each operator has incoming (netin ) and outgo-
2.2 Replication Model ing (netout ) network traffic. The incoming traffic netin (op)
To ensure near zero latency takeover, we assume active is calculated as product of input rate and input tuple size
replication in the system. Each operator can have multiple of the operator, whereas the outgoing traffic netout (op) is
replicas exhibiting exact the same behaviour as the primary calculated by the output rate multiplied with the output
operator. However, this approach results in the consumption tuple size.
of additional resources. For the sake of convenience in the remainder of this paper
Query operators and replicas are deployed independently the term load is used interchangeably with CPU resources.
on available hosts. The current placement of operators on
hosts is described via a placement function plc : O → H, 3. HEURISTIC REPLICATION
where O = {o1 , o2 , . . . , om } is the set of operators for all Complex event processing systems are exposed to a vary-
queries in the system and H = {h1 , h2 , . . . , hn } denotes the ing workload caused by different event rates as well as the
set of used hosts. addition and removal of queries. Figure 3 sketches a fictitious
workload of a system processing queries for traffic monitoring.
Peak loads indicate rush hour traffic. To deliver results for
the queries in real-time, the system has to be equipped with

2
enough resources to handle those peak loads. However, since multiple queries more important. The calculation of the
the peak load can only be estimated, phases of underpro- query out degree heuristic can be expressed as
visioning may occur, which are indicated by the dark gray
deg(op)
areas. Moreover, this leads to phases of overprovisioning impQOD (op) =
indicated by the dotted areas. |Q|
where deg(op) is the number of queries leveraging the op-
70
Underprovisioning erator op, and |Q| is the number of queries running in the
60 system.
Used Hosts

50 Overprovisioning
40 3.1.2 Dynamic Heuristics
30
Dynamic heuristics assess runtime properties of operators
20
10
and are, thus, dependent on models providing estimations
0 if an operator is deployed for the first time. If for example
the required resources are considered, the heuristics are de-
Time
pendent on the load model of the system (see Section 2.4).
One example for a dynamic heuristic is the low utilization
Figure 3: Workload of a data stream processing sys- heuristic that tries to replicate as much operators as possible.
tem Therefore, operators with a small load are preferred. The
calculation of the heuristic can be expressed as follows:
Heuristic replication targets the usage of such spare re-
load(op)
sources in distributed complex event processing systems to impLU (op) = 1 −
increase the availability of queries. Assume a system that max(load(O))
comprises a set of n hosts H = {h1 , h2 , . . . , hn } with an over- where max(load(O)) is the maximum operator load among
all CPU capacity of loadcap = n ∗ hostcap . Query operators all operators currently running in the system.
are placed according to an operator to host mapping plc.
The unused CPU resources loadrem can be calculated as 3.1.3 Query Level Heuristics
X The heuristics introduced so far operate on an operator
loadrem = loadcap − load(op)
level, i. e. operators are treated independent of each other.
op∈O
Hence, operator-level heuristics might cause the selection of
Since queries can be added to and removed from the system only most but not all operators for a query. Not selected
at runtime the amount of remaining resources usable to im- operators are weak spots and decrease the achievable avail-
prove the availability of queries in the system changes over ability for the query tremendously. Therefore, we propose to
time. If the remaining load is smaller than the load of all combine operator importance values into query importance
queries in the system, not all operators can be replicated values, which are expressed using a normalized importance
and, thus, a subset of operators has to be selected for repli- function for queries impQheur : Q → [0, 1]. Operator selec-
cation. Thus, we propose a three step approach. First, all tion algorithms operating on query level heuristics will ensure
operators are rated according to their potential to improve that once a query was selected for replication all operators
query availability by means of an operator importance heuris- of that query are replicated. Only if not enough resources
tic (see Section 3.1). Second, the set of operators as well as are available to replicate a full query, single operators would
the replication level, i. e. the number of replica instances, be selected.
is determined for each operator using an operator selection
algorithm (see Section 3.2). Third, the replica instances
3.2 Operator Selection Strategies
are deployed on hosts according to an operator placement Operator selection strategies take care of the actual op-
strategy (see Section 3.3). Finally, in Section 3.4 we present erator selection based on the operator importance and the
an evaluation. remaining load loadrem in the system. The calculated opera-
tor selection can be described as function sel : O → Z, which
3.1 Operator Importance Heuristics associates each operator with a number of replicas referred
to as replication level rep(op).
The potential of operators to improve the availability of
queries is rated using a normalized importance function 3.2.1 Simple Operator Selection
impheur : O → [0, 1] that associates each operator in the
A simple operator selection strategy sorts all operators
system with a normalized importance value. Depending on
descending based on their relative importance defined via
the source of information used for calculating this importance
impheur . Afterwards operators are selected for replication
value, static and dynamic heuristics can be distinguished.
until all spare resources loadrem are consumed. Selecting
operators for replication includes the determination of the
3.1.1 Static Heuristics replication level, i. e. the number of replicas to create for
Static heuristics calculate the importance of operators a single operator. Different selection procedures can be
based on properties observable in the global query graph. established:
Due to the multi query optimization operators can be reused
by multiple queries. Hence, if a reused operator fails, all 1. All operators can be selected for replication at least
depending queries crash. One possible heuristic, referred to once if enough remaining resources are available. If
as query out degree heuristic, is based on this observation afterwards spare resources are still available, the repli-
and rates the importance of operators that are reused by cation level for the already selected operators can be

3
increased stepwise. This procedure ensures that each 100

Running Queries (%)
●
●

operator is replicated at least once, if enough resources ●

●

are available. 98
●

2. A replication level larger than one might be set di- 96
100% ●

rectly for an operator. Hence, some operators might Query Out Low Utilization (50%)
be excluded from replication, if already all resources 94 Low Utilization (50%)
Query Out (50%)
are consumed. ●
Random (50%)
92
3.2.2 Optimized Operator Selection 0.0 0.5 1.0 2.0 3.0
Host Crash Probability (%)
The simple operator selection strategy selects operators
for replication based on either a static or dynamic heuristic.
Thus, only one type of available information is incorporated Figure 4: Comparison of query availability for differ-
at a time and the resulting selection of operators can lead to ent heuristics
a non-optimal availability of queries. Therefore, we propose
to augment the selection process based on a static heuristic
different operators as possible are placed on one host. Thus,
with runtime information by modeling the operator selection
multiple replicas crash in case of a host failure.
problem as 0-1 knapsack problem [16].
Therefore, we propose to use a bin packing algorithm with
Each operator opi ∈ O has a value vi = impheur (opi )
color constraints [5]. Besides load and network consumption,
and a weight wi = load(opi ). The maximum weight of the
each replica will be associated with a unique color. The col-
bag equals the remaining resources loadrem in the system.
ored bin packing algorithm ensures that each host contains
Because the 0-1 knapsack problem is NP-hard and load
only replicas with at most c distinct colors, where c is a
values of operators and, thus, the optimal solution changes
pre-defined positive integer. Because the number of repli-
continuously, we suggest an approximation by means of a
cas placed on the same host shall be minimized, an upper
combination of static and dynamic heuristics. The optimiza-
bound C with c ≤ C is specified for the color constraint.
tion is based on the intuition that operators with the greatest
Then the placement problem can be formulated with an
profit per weight unit have to be selected first. Thus, we
additional constraint that strives to minimize the parame-
propose to leverage a product of the query out degree and
ter c. If no placement can be found without violating the
low utilization heuristic
upper bound C, the simple bin packing algorithm is used to
impQOD∗LU (op) = impQOD (op) ∗ impLU (op) calculate a placement.

as basis for the operator selection. 3.4 Evaluation
We have performed a preliminary simulation-based evalua-
3.3 Fault-tolerant Operator Placement tion of the proposed approaches. We used a query generator
Operators selected for replication via the operator selection to generate queries based on six query templates that differ in
algorithm have to be deployed on hosts. However, depending structure and operator parameters. Operators were deployed
on the chosen operator placement for the same set of selected on simulated hosts with a CPU capacity of loadcap = 1. We
operators different availabilities can result. assume a system comprising 100 hosts. Moreover, operators
were replicated at most once. To guarantee statistical cor-
3.3.1 Simple Bin Packing rectness, 1000 runs were conducted for each experiment and
A bin packing algorithm [4] for fault-tolerant operator values have been averaged.
placement minimizes the number of used hosts, so that idling Figure 4 depicts the percentage of remaining running
hosts can be turned off to save energy. This property is in queries after a time period of one day assuming different
line with the notion of elasticity. We propose a bin packing host failure probabilities p = {0.0025; 0.005; 0.01; 0.02; 0.03}
algorithm, which is an extended version of a first-fit bin and different heuristics for rating the importance of opera-
packing, which has a complexity of O(mn), where m is the tors. Because a constant system load load(O) = 100 was
number of replicas that shall be placed and n the number of generated, the stated percentages of resources available for
hosts. replication are equal to the actual remaining load loadrem
Each host is modeled as bin, where the available CPU in the system. Leveraging either the low utilization heuris-
resources form the capacity. Replicas are first sorted in tic or the query out degree heuristic to select operators for
decreasing order according to their normalized importance replication improves the percentage of remaining running
and then assigned using their load as weight. Moreover, two queries by approximately 1.7 percentage points compared to
constraints are ensured: (i) the network capacity of a host a random operator selection assuming a host failure proba-
should not be exceeded, (ii) two replicas of the same operator bility of p = 0.02 and loadrem = 50. Using a combination
are never placed on the same host. To reduce the consumed of both heuristics improves the query availability further by
network bandwidth, possible hosts are ordered according to 0.4 percentage points, resulting in 99.2 % remaining running
a neighboring factor representing the amount of predecessor queries.
and successor operators deployed on the same host. Figure 5 shows the availability increase as a function of
available resources for replication. A combination of query
3.3.2 Colored Bin Packing out degree and low utilization heuristic results in 99.2 %
Even though the simple bin packing algorithm is replica- remaining running queries, if a host failure probability of
aware due to the additional placement constraint, the system p = 0.02 and loadrem = 50 is assumed. Moreover, the
availability is not maximized because as many replicas of achievable percentage of running queries is only diminished

4
100
5. DYNAMIC REPLICATION

Running Queries (%)
●

●

● Elastic data stream processing systems are able to cope
98 ● with varying query as well as event load. Mechanisms to dy-
100% namically allocate and release hosts depending on the actual
96
90% ●
demand prevent costly overprovisioning and performance
50%
94 10%
barriers due to underprovisioning. Ideally the system can
1% ●
scale out indefinitely to serve high event rates and scale in
●
0% to lower the used resources in case of low utilization.
92
0.0 0.5 1.0 2.0 3.0 Processing queries in an elastic environment poses various
Host Crash Probability (%) challenges. Once new queries are submitted to the system,
the encompassed operators have to be distributed to the
running hosts. Because of operator reuse the load of already
Figure 5: Query availability for combination of query
deployed operators changes. To not impair the performance
out degree and low utilization heuristic
of the system two reactive actions are taken: (i) overloaded
hosts might be relieved by automatically migrating operators
by approximately one percentage point compared to full from one host to another, (ii) overload situations that cannot
replication, if 50 % less spare resources are used for replication be resolved by moving operators from one host to another
and the host failure probability is p = 0.03. This result can are resolved by splitting the operator into several operator
be explained by the fact that with loadrem = 50 available for instances that handle only a portion of the overall load and
replication still 80 % of the query operators are replicated. can be deployed independently.
Moreover, those operators are reused to a large extent or However, handling replicas in a dynamic system is demand-
have small cost. ing:
• The complexity of the reconfiguration caused by the
4. REPLICATION OPTIMIZER exoneration of overloaded hosts is increased since addi-
Service level agreements are important in scenarios where tional communication channels for replicas have to be
a certain availability is required. However, achieving a higher maintained.
availability by replicating more operators requires more re-
sources. Therefore, the solicited availability and the resulting • If operators are split into several instances, all replicas
resource consumption should be traded off. Replication of have to be split too in order to maintain a consistent
operators in a system can be optimized with respect to two system state.
different optimization goals:
• Systems that are used in conjunction with heuristic
1. For a given resource limit costmax , the availability of replication (see Section 3), have to decide in case of
queries A(Q) can be maximized. spare resources whether to release hosts or to create
additional replicas.
2. For a given availability of queries A(Q), the resource
consumption cost(Q) can be minimized. • Reconfiguration routines have to ensure that the new
placement does not violate existing service level con-
The achieved availability of queries is influenced by the set
straints for the queries (see Section 4).
of operators selected for replication, the number of replicas
that is created for each of the selected operators, as well
as the placement of replicas on hosts. All those influential 6. RELATED WORK
parameters are reflected in the placement plc and, thus, the Fault tolerance techniques for data stream processing sys-
placement can serve as predictor for the achievable availability. tems like active replication [2], check pointing [14], and a
To optimize the replication decisions according to the two combination of active and passive replication [24], are key
specified optimization goals, additional models have to be enablers for our proposed approaches.
incorporated. To estimate the costs for a placement, the load Moreover, operator placement algorithms are leveraged,
model for query operators can be leveraged. Moreover, an which have been studies extensively by various authors. A
availability model is required that enables the estimation of survey is available in [15]. Repantis et al. [18] presented
the query availability resulting from a certain placement. a procedure for replica placement considering performance
Assuming the availability of those models, the optimization constraints like end-to-end latency. The ZEN system [3]
problems can be formulated as follows: models different levels of availability in a systems and tries
to assign most important operators to hosts with the highest
max{A(plc) | plc ∈ P; cost(plc) ≤ costmax }
availability. Another replication scheme based on graph
and coloring is presented in [22].
Optimization in the area of data stream processing systems
min{cost(plc) | plc ∈ P; A(plc) ≥ A(Q)}
is done for example to achieve an optimal overall utiliza-
where P is the set of all possible placements resulting from tion [21], or optimal utilization with limited resources [13].
different operator selections, different replication levels, as An approach related to that in this paper has been studied
well as different placement strategies. by Fernandez et al. in [7]. The authors present an integrated
Those optimization problems can be solved using well- approach to scale out streaming systems while achieving fault
known optimization algorithms like genetic search [10]. How- tolerance via check pointing. In this paper we focus, however,
ever, to lower the effort for the optimization, heuristics to on active replication to ensure minimal latency and envision
restrict the search space have to be developed. also scale in.

5
7. SUMMARY Data (SIGMOD), New York, NY, 06/2013 2013. ACM,
As of today, many data stream processing systems use ACM.
replication to ensure high availability in case of host failures. [8] C. Fetzer, U. Schiffel, and M. Süßkraut. An-encoding
However, to replicate a system completely, at least twice the compiler: Building safety-critical systems with
resources have to be allocated, which is costly. commodity hardware. In SAFECOMP, pages 283–296,
We proposed a heuristic replication approach enabling the 2009.
use of remaining system resources to increase the availabil- [9] B. Gedik, H. Andrade, K.-L. Wu, P. S. Yu, and
ity of queries. An operator selection algorithm is used to M. Doo. SPADE: the system s declarative stream
determine a subset of operators for replication that are then processing engine. In SIGMOD Conference, pages
placed via an operator placement algorithm. Moreover, we 1123–1134, 2008.
suggested a replication optimizer which allows users to guide [10] D. E. Goldberg. Genetic Algorithms. Pearson
the replication while trading off cost and availability. Finally, Education, 2013.
the challenges arising from a combination of these techniques [11] J.-H. Hwang, Y. Xing, U. Çetintemel, and S. B.
with elastic data stream processing systems were discussed. Zdonik. A Cooperative, Self-Configuring
Using heuristic replication as well as the replication opti- High-Availability Solution for Stream Processing. In
mizer with a system reacting on changes, e. g. in event rate ICDE, pages 176–185, 2007.
and selectivities, demands the adaptation of the placement [12] C. Jin and J. G. Carbonell. Predicate Indexing for
routines. However, the normalized importance functions as Incremental Multi-Query Optimization. In ISMIS,
well as the optimization routines might be reused unchanged. pages 339–350, 2008.
To validate the approaches we simulated a heuristic repli- [13] E. Kalyvianaki, W. Wiesemann, Q. H. Vu, D. Kuhn,
cation approach comprising operator selection as well as and P. Pietzuch. Sqpr: Stream query planning with
operator placement strategies. Given a set of remaining reuse. In ICDE, pages 840–851, 2011.
resources, the fault tolerance of complex event processing [14] Y. Kwon, M. Balazinska, and A. G. Greenberg.
systems is improved. Choosing a proper heuristic can im-
Fault-tolerant stream processing using a distributed,
prove the availability two percentage points compared to a replicated file system. PVLDB, 1(1):574–585, 2008.
random operator selection. Compared to full replication only
[15] G. T. Lakshmanan, Y. Li, and R. E. Strom.
one percentage point is lost spending 50 % less resources for
Placement strategies for internet-scale data stream
replication.
systems. IEEE Internet Computing, 12(6):50–60, 2008.
In the future, we plan to implement the proposed ap-
proaches with a state-of-the-art streaming system. [16] S. Martello and P. Toth. Knapsack problems:
algorithms and computer implementations.
Wiley-Interscience series in discrete mathematics and
8. REFERENCES optimization. J. Wiley & Sons, 1990.
[17] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:
[1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Çetintemel, Distributed Stream Computing Platform. In ICDM
M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, Workshops, pages 170–177, 2010.
A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. B. [18] T. Repantis and V. Kalogeraki. Replica placement for
Zdonik. The design of the Borealis stream processing high availability in distributed stream processing
engine. In CIDR, pages 277–289, 2005. systems. In DEBS, pages 181–192, 2008.
[2] M. Balazinska, H. Balakrishnan, S. Madden, and [19] S. Viglas and J. F. Naughton. Rate-based query
M. Stonebraker. Fault-tolerance in the borealis optimization for streaming information sources. In
distributed stream processing system. ACM Trans. SIGMOD Conference, pages 37–48, 2002.
Database Syst., 33(1), 2008. [20] K. V. Vishwanath and N. Nagappan. Characterizing
[3] N. Bansal, R. Bhagwan, N. Jain, Y. Park, D. S. cloud computing hardware reliability. In SoCC, pages
Turaga, and C. Venkatramani. Towards optimal 193–204, 2010.
resource allocation in partial-fault tolerant applications. [21] J. L. Wolf, N. Bansal, K. Hildrum, S. Parekh,
In INFOCOM, pages 1319–1327, 2008. D. Rajan, R. Wagle, K.-L. Wu, and L. Fleischer. Soda:
[4] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson. An optimizing scheduler for large-scale stream-based
Approximation algorithms for np-hard problems. distributed computer systems. In Middleware, pages
chapter Approximation algorithms for bin packing: a 306–325, 2008.
survey, pages 46–93. PWS Publishing Co., Boston, MA, [22] F. Xiao, T. Kitasuka, and M. Aritsugi. Economical and
USA, 1997. fault-tolerant load balancing in distributed stream
[5] M. Dawande, J. Kalagnanam, and J. Sethuraman. processing systems. IEICE Transactions,
Variable sized bin packing with color constraints. 95-D(4):1062–1073, 2012.
Electronic Notes in Discrete Mathematics, 7:154–157, [23] Y. Xing, J.-H. Hwang, U. Çetintemel, and S. B.
2001. Zdonik. Providing resiliency to load variations in
[6] J. Dean. Handling Large Datasets at Google: Current distributed stream processing. In VLDB, pages
Systems and Future Directions. In Data-Intensive 775–786, 2006.
Computing Symposium, 2008. [24] Z. Zhang, Y. Gu, F. Ye, H. Yang, M. Kim, H. Lei, and
[7] R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and Z. Liu. A hybrid approach to high availability in stream
P. Pietzuch. Integrating scale out and fault tolerance in processing systems. In ICDCS, pages 138–148, 2010.
stream processing using operator state management. In
ACM International Conference on Management of