<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fault Tolerant Distributed Hash-Join in Relational Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>1st Arsen Nasibullin</string-name>
          <email>nevskyarseny@yandex.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>2nd Boris Novikov</string-name>
          <email>bnovikov@hseu.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HSE University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Saint-Petersburg State University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-The business today are facing with immense challenges due to complex applications and rapid growth in data volumes. Many of applications use data for computing statistical data to make proper decision in other applications such as machine learning or social networking. Mostly, these applications assume performing sophisticated client queries with such operators as aggregation and join. State-of-the-art distributed relational databases get over these challenges. Unfortunately, distributed database management systems suffer from failures. Failures causes sophisticated queries with joining large tables are re-executed so that enormous volume of resources must be leveraged. In this paper, we introduce a new fault tolerant join algorithm for distributed RDBMS. The algorithm based on mechanisms of data replication and heartbeat messages. We compare our algorithm with traditional, unreliable distributed join algorithm. This paper demonstrates how we achieved trade-off between time required to perform tasks of failed sites and extra resources needed to carry it out. Index Terms-databases, hash-join, fault-tolerance, query processing, replication</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Let us consider the definition of distributed database
systems. Distributed database systems are database management
systems, consisting of local database systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These
systems with their own disks are located and dispersed over
a network of interconnected computers. In this paper, we
deal with shared-nothing architecture of distributed database
systems.
      </p>
      <p>
        Fault tolerance is the property of system to keep on carrying
out tasks in the event of the failure [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A system is able to
detect a failure and properly handle it without affecting correct
execution of tasks.
      </p>
      <p>Distributed systems, based on MapReduce framework,
provide high availability, improved performance. They also
provide fault-tolerance in case of any site is down. In contrast to
distributed systems, parallel RDBMS cannot handle occurred
fails so that entire work has to be re-executed. In our work
we will focus on how achieve fault tolerance for RDBMS.</p>
      <p>
        Modern distributed database systems are applicable for a
variety of tasks such as business intelligence [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Mostly,
these tasks assume performing sophisticated client queries
with such operators as aggregation and join.
      </p>
      <p>
        The join operation has been studied and discussed
extensively in research efforts because it is one of the most
timeconsuming and data-intensive operations in relational query
processing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]–[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Distributed database systems have the main goal is
processing an enormous amount of data in a short time,
by transmitting data, handling its and passing the final
outcome to applications. In failure-free scenarios, database
systems may achieve good results. Given that failures are
common at large-scale, distributed database systems remain
fault tolerance as built-in feature. As example, modern data
processing systems have algorithms of recovering tasks of a
failed site [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        According to studies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the most common failure
types are divided into the following:
• Communication failure. This type of failures is
specific for distributed systems. There are many types of
communication failures - errors in messages, unordered
messages, undelivered message, and lost connection. The
first two types are responsibility of computer network.
Lost messages and/or connection can be the consequence
of failed sites or communication failure.
• System failure. A hardware of a software failure may
cause a system failure. For example, may be either CPU
crashed or the presence of bugs in application code which
cause failures occur. Once a system failure occurred,
content of main memory is loosen.
• Media failure. It refers to the failures of the secondary
storage devices that store the database. Such failures may
be due to operating system errors, as well as to hardware
faults such as head crashes or controller failures. The
important point from the perspective of DBMS reliability
is that all or part of the database that is on the secondary
storage is considered to be destroyed and inaccessible.
• Transaction failure. Transactions can fail for a number
of reasons. For instance, a failure can occur due to either
incorrect input data or potential deadlock.
      </p>
      <p>
        As state in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], hardware and software failures make up
more than 90% of all failures. This percentage has not changed
significantly since the early days of computing. What is more
interesting is the occurrence of soft failure is significantly
higher than that of hardware failures.
      </p>
      <p>
        In contrast to the occurrence of system failures, the occurrence
of transaction failures, as it states in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], is 3%. Modern
database management systems are capable of recovering
execution of abrupt interrupted transactions in a short time. In
this paper, we do not consider transactions failures.
      </p>
      <p>Unfortunately, state-of-the-art data processing systems do
not have algorithms for handling failures when a client
performs a query joining two of more tables. Currently, the
modern relational and NoSQL distributed database systems
re-execute an entire query even if a one node of systems
becomes inactive. For example, a node with stored data
becomes unavailable when transmitting data to sites where
join operation are performed. It causes re-execution of a query
with join leads to spend enormous resources and time to end
up a join query.</p>
      <p>The paper addresses the challenges of recovering tasks at
failed sites. Our solution achieves the fault tolerant execution
of join queries through making trade-off between completing
recovery tasks for the least time and extra resources to be
leveraged for this. Data replication insure to reallocate a
client query to available site with stored data. An approach
of sending heartbeat messages allows to detect a failure of
any site and undertake required steps to recover undone work.</p>
      <p>We leverage the following notations used in the paper:</p>
      <p>This paper is organized as follows. Section II describes
related work. Description of the distributed cost model, used
in the paper, is given in Section III. In Section IV, we address
to classical distributed hash-join algorithm and examine cost
of the execution. Fault tolerant distributed hash-join algorithm
and its cost is introduced in Section V. Section VI provides
the description and evaluations of how fault tolerant algorithm
gets over failures. Comparison of estimations is presented in
Section VII. This paper is concluded by Section VIII.</p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORK</title>
      <p>
        In this section, we briefly review works dedicated to
different approaches of tackling failures in modern framework
for processing vast amount of data in parallel called Hadoop
MapReduce [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We examined algorithms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]–[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] of
parallel join for different kind of systems. Analyzed works do
not consider the task of handling failures. Authors supposed
algorithms work on fail-free systems.
      </p>
      <p>
        Review [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] describes the state of the art approaches to
improve the performance of parallel query processing using
MapReduce. Even more, authors discussed significant
weaknesses and limitations of the framework. A classification of
existing studies over MapReduce is provided focusing on the
optimization objective.
      </p>
      <p>
        Authors proposed a strategy of doubling each task in
execution [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This stands for if one of the tasks fails, the second
backup task will end up on time, reducing the job completion
time by using larger amounts of resources. Intuitively, readers
may guess that doubling the tasks leads to approximately
doubling the resources.
      </p>
      <p>
        In work [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], authors have devised two strategies to improve
the failure detection in Hadoop via heartbeat messages in the
worker side. The first strategy is an adaptive interval which
will dynamically configure the expiry time adapted to the
various sizes of jobs. As authors state, the adaptive interval
is advantageous to the small jobs. The second strategy is to
evaluate the reputation of each worker according the reports of
the failed fetch-errors from each worker. If a worker failures,
it lows its reputation. Once the reputation becomes equal to
some bound, the master node marks this worker as failed.
      </p>
      <p>
        To remove single point of failure in Hadoop, a new
approach of a metadata replication based solution was
proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The solution involves three major phases:
in initialization phase, each secondary node is registered to
primary node and its initial metadata such as version file and
file system image are caught up with those of active/primary
node; in replication phase, the runtime metadata (such as
outstanding operations and lease states) for failover in future
are replicated; in failover phase, standby/new elected primary
node takes over all communications.
      </p>
    </sec>
    <sec id="sec-3">
      <title>III. DISTRIBUTED COST MODEL</title>
      <p>
        One of the challenges of multi objective query optimization
in distributed database management systems is to find Pareto
set of solutions or the best possible trade-offs among the
objective functions [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Objective functions are defined as
total time of query execution, I/O operations, CPU instructions
and a number of messages to be transmitted.
      </p>
      <p>In distributed database systems the total time of query
execution is expressed through mathematical model of weighted
average. This model consists of time to perform I/O operations,
CPU instructions and time to exchange a number of messages
among involved sites. In this work we are going to find
trade-off between the least time of the execution in case of
failure occurrence and extra resources required to recover
failed tasks. We will evaluate time to perform I/O operations,
CPU instructions, and communication separately.</p>
      <p>In this paper, we consider joining two relations R and S,
and R.A = S.B is the join condition. Assume all data of each
relation evenly distributed throughout the system; each node
stores T (R) and T (S) respectively.</p>
      <p>|K| |K|
IV. CLASSICAL DISTRIBUTED HASH-JOIN ALGORITHM
In this section, we describe distributed hash-join algorithm.
We will often refer to distributed hash-join algorithm as
classical join algorithm.</p>
      <p>At the highest level, the working of distributed hash join in
distributed database systems of a shared-nothing architecture
is straightforward:
1) A coordinator receives a client query. Then, the
coordinator populates messages with a client query across all
keepers.
2) Each keeper reads its partitions of relation R, applies a
hash function h1 to the join attribute of each attribute.
Hash function h1 has its range of values 0...|K| − 1. If
a tuple hashes to value i, then it is sent to a worker Wi.
Once a keeper ends up reading its partitions of relation
R, it notifies the coordinator about the status of work.
3) Each worker Wi builds a hash table, allocated in
memory, and fills in it with tuples received from step 2. In
this step, each worker uses a different hash function h2
than the one used in the step 2.
4) Once all keepers stopped reading their partitions of
relation R, the coordinator initiates a probing phase by
sending notifications to keepers.
5) Each keeper reads its partitions of relation S, applies a
hash function h1 to the join attribute of each attribute
as it is in the step 2. If a tuple hashes to value i, then it
has to be sent to a worker Wi.
6) A worker receives a tuple of relation S, probes the hash
table built in step 2 to find out a tuple of relation S joins
with any tuple of relation R. If so, tuples join and an
outcome tuple is generated.
The direction of a red line from a keeper to the coordinator
means process of sending a notification to the coordinator with
message a keeper stopped reading its partitions on a particular
phase. The blue lines among keepers and workers denote the
process of sending tuples at two phases. Lines of green color,
directed from workers to the coordinator, signify submitting
outcome tuples.</p>
      <p>Here and further in the paper we do not consider the
single coordinator receives all client queries. Instead, a new
coordinator is assigned for each query.</p>
      <sec id="sec-3-1">
        <title>A. Cost of Classical Distributed Hash-Join</title>
        <p>1) Communication cost: the coordinator sends a message
to K keepers to take up reading their partitions of relation R.
One more message the coordinator sends to keepers to start
reading partitions of relation S. At the building phase, keepers
send T (R) messages to workers, and at the probing phase they
send T (S) messages. The total communication cost:
2) I/O cost: keepers read B(R) blocks at the building phase
and workers write B(R) blocks into their disks. At the probing
phase, keepers read B(S) blocks of relation S. To write the
result of matching, it will take
3) CPU cost: at the both building and probing phases,
the coordinator sends 2 messages to keepers. Each keeper
receives 2 messages. For reading both relations, keepers perform
B(R) + B(S) operations. To write blocks of relation R on
each worker, it is required to perform B(R) operations. At the
building and probing phases, keepers perform T (R) + T (S)
operations to submit messages to workers and workers execute
T (R) + T (S) operations to receive messages. Each worker
executes
• The coordinator became unreachable. For example, due
to a communication or a system failure. All stored data
may be loosen.
• A media or system failure occurred at a keeper or a
worker performing operations over its disk.
• A site was suddenly turned off in the middle of the
probing phase so the entire work has to be re-executed.</p>
        <p>
          To detect and properly handle the failures above, we decided
to leverage the concept of sending heartbeat messages. The
concept of heartbeat is widely used in the consensus algorithm
Raft [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. In a heartbeat message, a site may inform another
site about status of carried out tasks or pass crucial metadata
information.
        </p>
        <p>The approach of heartbeat messages benefits detecting any
failures of sites. The coordinator’s duty is to send heartbeat
messages to all sites and receive responses. Once the
coordinator receives heartbeat messages, it has to sort out inputs. If
an input contains error message, it stands for an error occurred
in a site, for example, software or media failure. If there is no
response from a site, it means either a site is unreachable due
to communication failure or hardware issues occurred.
We injected the approach of heartbeat messages into our
algorithm. Based on content of a received message, the coordinator
undertakes required steps to recover work of a failed site.</p>
        <p>
          Apart from telling about communication and system
failures, consider failures with disks. A failure with a disk can be
caused by power loss, media faults, I/O errors, or corruption
of information on the disk. Particularly, a failure with a disk
of a worker may cause the loss of intermediate join result.
To prevent the loss of data stored in sites, we leverage the
strategy of full data replication [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Keepers form a
ring. Keeper (i + 1) mod |K| stores a full copy of data of
the previous keeper i mod |K|. To protect intermediate data
of workers from the loss, each keeper copy data for joining to
both workers. The first worker executes join operation of both
tables whereas the second worker stays on until the coordinator
signals to join reserved data.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>A. Algorithm</title>
        <p>Summarizing said all above, we introduce fault tolerant
distributed hash-join algorithm. The algorithm is similar to
classical hash-join for distributed database systems in a
sharednothing architecture.</p>
        <p>1) Building. A coordinator receives a client query. To
initiate a build phase, the coordinator populates messages
with a client query across all nodes. Once messages are
sent, the coordinator sets status of performing a client
query as processing for all keepers.
2) Each keeper reads its partitions of relation R, applies a
hash function h1 to the join attribute of each attribute.
Hash function h1 has its range of values 0...|W | − 1. If
a tuple hashes to value i, then it is sent to i mod |W |
and (i + 1) mod |W | workers. For the latter, a message
has to contain message reserved data. Once a keeper
ends up reading its partitions of relation R, it notifies
the coordinator about the status of work.
3) Each worker builds a hash table, allocated in memory,
and fills in it with tuples received from step 2. In this
step, each worker uses a different hash function h2 than
the one used in the step 2.
4) Once all keepers stopped reading their partitions of
relation R, the coordinator initiates a probing phase by
sending notifications to keepers.
5) Probing. Each keeper reads its partitions of relation S,
applies a hash function h1 to the join attribute of each
attribute as it is in the step 2. If a tuple hashes to value
i, then it is sent to i mod |W | and (i + 1) mod |W |
workers.
6) Worker i mod |W | receives a tuple of relation S, probes
the hash table built in step 2 to find out a tuple of relation
S joins with any tuple of relation R. If so, tuples join
and an outcome tuple is generated. The other worker
(i + 1) mod |W | puts reserved data into its disk.
7) Once an outcome tuple is generated, a worker sends
heartbeat message the following worker. In this
message, it points a position of the last successfully
joined tuple of relation S.</p>
        <p>C
RC</p>
        <p>K1
K2
...</p>
        <p>KM</p>
        <p>W2</p>
        <p>WN
W3
In the Figure 2 shown a scheme of working of fault tolerant
distributed join algorithm. There is the same principle of
working as it is shown in Figure 1 for the classical join
algorithm. The difference is that there is added a reserved
coordinator RC and it synchronizes with the primary coordinator
C. Workers comprise a ring of nodes. Each worker is aware of
the following node. It facilitates a worker submits info about
proceeded work during the join to the following worker. In
case of i mod |W | worker is failed, (i+1) mod |W | worker
takes over tasks of a failed worker.</p>
        <p>B. Cost of Fault Tolerant Distributed Hash-Join Algorithm
1) Communication cost: similar to the classical distributed
join algorithm, in fault tolerant algorithm the coordinator sends
one message to keepers to take up reading their partitions
of relation R and one message to start reading partitions of
relation S. At the building phase, keepers send T (R) messages
to workers, and at the probing phase they send T (S) messages.</p>
        <p>Tmsg = 2 + T (R) + T (S)
(6)
2) I/O cost: keepers read B(R) blocks at the building phase
and workers write B(R) blocks into their disks in parallel
including reserved blocks of relation R. At the probing phase,
keepers read B(S) blocks and workers write B(S) blocks
as reserved. To write the result of matching, it’s required to
perform
3) CPU cost: the coordinator carries out 2 operations to
submit messages to keepers at the both phases. All in all,
keepers perform 2|K| operations to receive messages. To read
and write both relations by keepers at two phases, it will takes</p>
        <p>IORW = 2(B(S) + B(R))
At the building and probing phases, to submit and receive
messages with tuples, it will takes</p>
        <p>CSR = 2(T (R) + T (S))
operations to submit and receive messages with tuples. Each
worker executes
operations comparing of tuples both relations during the
matching, and</p>
        <p>CP Ujoin =</p>
        <p>B(R)T (S) + B(S)T (R)
operations to write join result. After an outcome tuple is
generated and sent to the coordinator, two operations are
needed to send a message from i mod W worker to (i + 1)
mod |W | worker.</p>
        <p>The total CPU cost:</p>
        <p>TCP U = IORW + CSR + 2(|K| + 1)+</p>
        <p>CP Ucompare + CP Ujoin
(7)
(8)
(9)
(10)
(11)
(12)
(13)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>VI. HANDLING FAILURES In this section, we examine steps to handle failures during the execution of fault tolerant hash join algorithm.</title>
      <sec id="sec-4-1">
        <title>A. Failure of The Coordinator</title>
        <p>As we said before, the coordinator may fail at any phase. For
instance, the coordinator becomes unreachable before probing
phase or even in the middle of the execution of building phase.
To eliminate any of those scenarios, the secondary coordinator
takes over a work of the failed coordinator. Both keepers and
workers have to be aware of the secondary coordinator and be
able to quickly join it.</p>
        <p>We suggest adding both coordinators’ addresses to keepers
and workers. When a heartbeat message is sent from a site to
the failed coordinator, a node should handle it by re-sending a
message to the secondary coordinator and keeping on work
with it. A database manager should be aware of a failed
coordinator and recover it.</p>
        <p>Failed
Coordinator</p>
        <p>New
Primary
Coordinator</p>
        <p>K1
K2
...</p>
        <p>KM</p>
        <p>W2</p>
        <p>W3
W1
1) Before performing a query. Once the coordinator knows
a keeper is failed, it will redirect tasks to another keeper
where reserved data of a failed keeper is stored.
2) Building or probing. Despite the phase, a keeper stops
sending heartbeat messages to the coordinator. The latter
reassigns task to another keeper with stored data of an
inactive keeper and says its to keep scanning a particular
relation.</p>
        <p>To handle the first case, all the coordinator needs to carry out
is to mark the failed coordinator as unreachable and not to
submit messages to it. We do not calculate cost of performing
this case because of we assume the coordinator is aware of
the address of the failed keeper and the cost will not affect
the query execution.</p>
        <p>Let us consider the second case. As we pointed out in Section
III, the total cost of execution is the sum of communication
cost, I/O cost and CPU cost. Communication cost is composed
of sending a message from the coordinator to a keeper and
submitting messages of a particular relation to workers.</p>
        <p>Tmsg = 1 +</p>
        <p>T (R) + T (S)
|K|
I/O cost implies scanning blocks of two relations, writing
blocks of relation R twice in parallel and once blocks of
relation S.</p>
        <p>TI/O =
2(B(R) + B(S))
|K|
As for CPU cost, it is composed of two operations to send and
receive message from the coordinator to a keeper, reading and
(14)
(15)
writing blocks of two relations and submitting and receiving
messages with tuples.</p>
        <p>B(R) + B(S) + T (R) + T (S)
TCP U = 2 + 2
|K|
The total cost of the execution of recovering work of a failed
keeper is the sum of</p>
        <p>Trecovery = (14) + (15) + (16)</p>
      </sec>
      <sec id="sec-4-2">
        <title>C. Failure of a Worker</title>
        <p>If a worker becomes unavailable at building phase, the
coordinator submits a notification in the following heartbeat
message stop communicating with an unavailable worker.
If worker i mod |W | is in unreachable state at the probing
phase, the coordinator sends out a message to worker (i + 1)
mod |W | so that it may take over task of joining tuples.
The cost of communication:</p>
        <p>Tmsg = 1
I/O cost is composed of the following assumption. During
performing join operation, once a worker i mod |W | sends
an outcome tuple to the coordinator, it submits index of a tuple
V which just joined and sent to the coordinator. Worker (i+1)
mod |W | starts reading from V + 1 tuple.</p>
        <p>T (S)
TI/O = 2 |W | − V (19)
b
CPU cost is the sum of performing sending and receiving
messages, reading untreated tuples and comparing them</p>
        <p>T (S)
TCP U = 2 + |W | − V +
b
( T| W(R|) − V )( T (S)</p>
        <p>|W | − V )
I(A)
The total cost to recover work of a failed worker is the sum
of</p>
        <p>Trecovery = (18) + (19) + (20)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>VII. EVALUATION AND COMPARISON</title>
      <p>In this section we computed cost of both algorithms.</p>
      <p>The average time of the execution of joining both tables is
defined by introducing the probability P that the total time of
the execution is the sum of probabilities of fail-free execution
and total time of work required to recover a work at a failed
site at any phase.</p>
      <p>The formula below depicts the average time of execution
The average time = P ∗ N W + (1 − P ) ∗ RW
(22)
where NW is the total time of fail-free execution of an
algorithm whereas RW is the total time of recovering work. In
case of fault tolerant join algorithm, under RW we assume the
execution of recovery work at any phase described in Section
VI. For classical distributed join, the recovery work is to
reexecute the entire query across all sites.</p>
      <p>We consider the following cases: fail-free work, a keeper fails,
and a worker fails. For simplicity, we assume that the
probability of fault tolerant work of any site equals. As we said in
(16)
(17)
(18)
(20)
(21)
100000
Section III, we analyze different objective functions separately.
Table II provides parameters used during evaluations.</p>
      <p>Working with disks in fail-free mode of work, the execution
of fault tolerant join takes from 5 to 11% more time than
classical join takes. This is due to tuples of the second table
have to be written and marked as reserved. In the rest cases,
fault tolerant join algorithm works with disks 97% time faster.</p>
      <p>As for CPU cost, classical join wins in term of time of the
execution at fail-free mode. 81% time less requires to recover
work of the execution of client query if a keeper fails. In case
of a worker fails, it will take 93% less time to assign a task
of a failed site to another worker.</p>
      <p>The more fascinating results are got when comparing
communication costs. In fail-free case, time required to pass data
across all sites for both algorithms equals. When a keeper
fails, fault tolerant algorithm works 80% time faster than
classical algorithm works. In case of a worker fails, fault
tolerant join algorithms demonstrates impressive time of the
execution. It will 99% faster than its competitor. The results
of communication evaluations are shown in Figures 4, 5, 6.</p>
    </sec>
    <sec id="sec-6">
      <title>VIII. CONCLUSION</title>
      <p>In this paper, we introduced a fault-tolerant distributed
hash-join algorithm. Cost model has been provided, classical
and fault tolerant algorithms are compared with each other.
In case of failure-free, the unstable algorithm demonstrates
better results. In contrast to classical distributed hash join,
estimations showed that fault tolerant join algorithm takes
precedence over unreliable distributed hash join algorithm.</p>
    </sec>
    <sec id="sec-7">
      <title>Normal</title>
      <p>work
Keeper
failed</p>
    </sec>
    <sec id="sec-8">
      <title>Classical distributed join Fault tolerant join</title>
      <p>With double copying tuples of relations to servers, where join
is performed, we may omit abrupt failures of workers and keep
on joining tuples in available ones.</p>
      <p>Future work includes adapting our algorithm to data skew.
We will also consider implementing fault tolerant algorithm
and conduct experiments with other RDBMS.
1 · 10−5</p>
    </sec>
    <sec id="sec-9">
      <title>Worker failed</title>
      <p>1 · 10−5
Worker
failed
0.2
0.4</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Tanenbaum</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>v. Steen, Distributed Systems: Principles and Paradigms (2Nd Edition)</article-title>
          .
          <source>Upper Saddle River</source>
          , NJ, USA: Prentice-Hall, Inc.,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Rahimi</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Haug</surname>
          </string-name>
          ,
          <article-title>Distributed Database Management Systems: A Practical Approach</article-title>
          . Wiley-IEEE Computer Society Pr,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Avizienis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Laprie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Randell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Landwehr</surname>
          </string-name>
          , “
          <article-title>Basic concepts and taxonomy of dependable and secure computing,”</article-title>
          <source>IEEE Trans. Dependable Secur. Comput.</source>
          , vol.
          <volume>1</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>33</lpage>
          , Jan.
          <year>2004</year>
          . [Online]. Available: https://doi.org/10.1109/TDSC.
          <year>2004</year>
          .2
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <source>Advanced Query Processing: An Introduction</source>
          ,
          <year>01 2012</year>
          , vol.
          <volume>36</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Graefe</surname>
          </string-name>
          , “
          <article-title>Query evaluation techniques for large databases</article-title>
          ,
          <source>” ACM Comput. Surv.</source>
          , vol.
          <volume>25</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>169</lpage>
          , Jun.
          <year>1993</year>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/152610.152611
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Barthels</surname>
          </string-name>
          , I. Mu¨ller, T. Schneider,
          <string-name>
            <given-names>G.</given-names>
            <surname>Alonso</surname>
          </string-name>
          , and T. Hoefler, “
          <article-title>Distributed join algorithms on thousands of cores</article-title>
          ,
          <source>” Proc. VLDB Endow.</source>
          , vol.
          <volume>10</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>517</fpage>
          -
          <lpage>528</lpage>
          , Jan.
          <year>2017</year>
          . [Online]. Available: https://doi.org/10.14778/3055540.3055545
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaldewey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sedlar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Satish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chhugani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Blas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Dubey</surname>
          </string-name>
          , “
          <article-title>Sort vs. hash revisited: Fast join implementation on modern multi-core cpus</article-title>
          ,
          <source>” Proc. VLDB Endow.</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>1378</fpage>
          -
          <lpage>1389</lpage>
          , Aug.
          <year>2009</year>
          . [Online]. Available: https://doi.org/10.14778/1687553.1687564
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>A. S. Foundation.</surname>
          </string-name>
          (
          <year>2019</year>
          )
          <article-title>Apache hadoop</article-title>
          . [Online]. Available: https://hadoop.apache.org/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. T. zsu and P.</given-names>
            <surname>Valduriez</surname>
          </string-name>
          ,
          <source>Principles of Distributed Database Systems</source>
          , 3rd ed. Springer Publishing Company, Incorporated,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McJones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blasgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lindsay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lorie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Putzolu</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Traiger</surname>
          </string-name>
          , “
          <article-title>The recovery manager of the system r database manager</article-title>
          ,
          <source>” ACM Comput. Surv.</source>
          , vol.
          <volume>13</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>242</lpage>
          , Jun.
          <year>1981</year>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/356842.356847
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gardarin</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Valduriez</surname>
          </string-name>
          , “
          <article-title>Join and semijoin algorithms for a multiprocessor database machine,”</article-title>
          <source>ACM Transactions on Database Systems</source>
          , vol.
          <volume>9</volume>
          ,
          <issue>03</issue>
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Balkesen</surname>
          </string-name>
          , G. Alonso,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          , and M. T. O¨zsu, “
          <article-title>Multicore, main-memory joins: Sort vs</article-title>
          . hash revisited,
          <source>” Proc. VLDB Endow.</source>
          , vol.
          <volume>7</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
          , Sep.
          <year>2013</year>
          . [Online]. Available: https://doi.org/10.14778/2732219.2732227
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          and G. Alonso, “
          <article-title>Main-memory hash joins on modern processor architectures,” IEEE Transactions on Knowledge and Data Engineering</article-title>
          , vol.
          <volume>27</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>1754</fpage>
          -
          <lpage>1766</lpage>
          ,
          <year>July 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulkeridis</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Norvaag</surname>
          </string-name>
          ,
          <article-title>A“ survey of large-scale analytical query processing in mapreduce,” The VLDB Journal</article-title>
          , vol.
          <volume>23</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>355</fpage>
          -
          <lpage>380</lpage>
          , Jun.
          <year>2014</year>
          . [Online]. Available: http://dx.doi.org/10.1007/s00778- 013-0319-9
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bessani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Correia</surname>
          </string-name>
          , “
          <article-title>Byzantine fault-tolerant mapreduce: Faults are not just crashes</article-title>
          ,”
          <volume>11</volume>
          2011, pp.
          <fpage>32</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A“daptive failure detection via heartbeat under hadoop</article-title>
          ,”
          <volume>12</volume>
          2011, pp.
          <fpage>231</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>“Hadoop high availability through metadata replication</article-title>
          ,”
          <source>in Proceedings of the First International Workshop on Cloud Data Management</source>
          , ser.
          <source>CloudDB '09</source>
          . New York, NY, USA: ACM,
          <year>2009</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/1651263.1651271
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>“Multi-objective parametric query optimization for distributed database systems</article-title>
          ,”
          <source>in Proceedings of Fifth International Conference on Soft Computing for Problem Solving</source>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Deep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nagar</surname>
          </string-name>
          , and
          <string-name>
            <surname>K. N. Das</surname>
          </string-name>
          , Eds. Singapore: Springer Singapore,
          <year>2016</year>
          , pp.
          <fpage>219</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ongaro</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Ousterhout</surname>
          </string-name>
          , “
          <article-title>In search of an understandable consensus algorithm</article-title>
          ,”
          <source>in Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference</source>
          , ser.
          <source>USENIX ATC'14</source>
          . Berkeley, CA, USA: USENIX Association,
          <year>2014</year>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>320</lpage>
          . [Online]. Available: http://dl.acm.org/citation.cfm?id=
          <volume>2643634</volume>
          .
          <fpage>2643666</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Son</surname>
          </string-name>
          , “
          <article-title>Replicated data management in distributed database systems,” SIGMOD Rec.</article-title>
          , vol.
          <volume>17</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          , Nov.
          <year>1988</year>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/61733.61738
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Alonso</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kemme</surname>
          </string-name>
          , “
          <article-title>How to select a replication protocol according to scalability, availability and communication overhead</article-title>
          ,”
          <volume>08</volume>
          2001.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>