=Paper= {{Paper |id=Vol-1256/paper6 |storemode=property |title=Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm |pdfUrl=https://ceur-ws.org/Vol-1256/paper6.pdf |volume=Vol-1256 |dblpUrl=https://dblp.org/rec/conf/vecos/Nadjette14 }} ==Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm== https://ceur-ws.org/Vol-1256/paper6.pdf
                                                                                                               63




     Failure Detector-Ring Paxos based Atomic
                Broadcast Algorithm


                                           Nadjette Rebouh
                         LAMOS, Faculty of Exact Sciences, University of Béjaı̈a
                                               Algeria
                                   nadjette rebouh34@yahoo.fr



   The Atomic broadcast problem constitutes an essential component in fault-tolerant distributed systems.
   An Atomic broadcast algorithm ensures that all processes deliver the same messages sequence. Many
   algorithms have been published. Recently, a new Paxos-based algorithm has been proposed for clustered
   systems, named Ring Paxos. It inherits many of its characteristics: safe under asynchronous assumptions,
   live under weak synchronous assumptions, and requires a majority of non faulty processes to ensure
   progress. However, the proposed algorithm relies on strong assumptions, and uses the group membership
   for faults tolerance. In this paper, we propose a new Ring Paxos-based atomic broadcast algorithm for
   distributed systems. It inherits some of Ring Paxos characteristics and uses failure detectors to tolerate
   failures. A slight comparison between Ring Paxos and the proposed algorithm shows that our algorithm
   surpasses the performance of the other algorithm.


               Agreement Problem; Atomic broadcast; Ring Paxos; Failure Detectors

1. INTRODUCTION                                                 Toueg (1996) and group membership Chockler et
                                                                al. (2001). Recently, a new algorithm that solves the
An Atomic Broadcast Algorithm is an important                   atomic broadcast problem has been proposed, Ring
building block in distributed fault-tolerant systems.           Paxos (RP) Parisa et al. (2012). RP is based on
It provides a group communication primitive that                Paxos, a well-known consensus algorithm Lamport
allows processes to deliver the same messages                   (1998), and inherits many of its characteristics: it is
sequence Hadzilacos (1993). Atomic Broadcast                    safe under asynchronous assumptions, live under
is pratically useful to implement state-machine                 weak synchronous assumptions, and resiliency-
replication Verissimo and Raynal          (1999). By            optimal, that is, it requires a majority of correct
employing this group primitive to disseminate                   processes to ensure progress. They optimize the
updates, all correct copies deliver the same set                original version of Paxos from which they derive
of updates in the same order, and consequently                  RP. In their paper, they look for an efficient
ensuring the copies consistency. The Atomic                     implementation, in terms of throughput, of RP.
Broadcast problem has been extensively studied                  For that purpose, RP uses a single ip-multicast
in the literature. Therefore, several algorithms                stream to disseminate messages and thus benefits
have been proposed to implement this primitive,                 from the throughput that ip-multicast can provide.
considering a diverse set of execution environments             Unfortunately, ip-multicast is unreliable. Hence, RP
and system models. Although many classifications                places a majority of correct processes (f+1) in a
are possible when considering algorithms designed               logical ring, where f is the number of tolerated
for distributed systems, according to the ordering              failures, to totally order messages. From one part,
mechanism or the mechanism used to tolerate                     RP achieves very high throughput while providing
failures. The first class includes protocols that               low latency, almost constant with the number of
rely on consensus to decide the order and the                   receivers. From the other part, RP relies on very
number of messages to deliver (Chandra and Toueg                strong assumptions: it supposes one coordinator
(1996),Verissimo and Raynal (1999),etc.) or on the              along the execution, impractical assumption. It, also,
token circulation protocols schiper et al. (2004).              requires a partially synchronous system, that is,
Two fundamental mechanisms define the second                    an asynchronous model until a global stabilization
class: unreliable failure detectors Chandra and                 time, when it becomes synchronous. Last and not
                                                                                                        64




least, RP relies on a group membership service            orders the messages according to their timestamps
to detect failures, a highly expensive mechanism          Lamport (1978), Bar-Joseph et al. (2002) and (5)
Ekwall and Schiper (2011). The aim of this paper          destination agreement algorithms: in this class, the
is to circumvent these inconvenient by considering        delivery order results from an agreement between
an asynchronous system, with a variable coordinator       the destination processes in different states Chandra
and ♦S failure detector for fault tolerance, the          and Toueg (1996), Rodrigues and Raynal (2000).
resulting algorithm is Failure Detector-Ring Paxos        However, the ordering mechanism is not the only
based Atomic Broadcast Algorithm (FD-RP). FD-RP           key for atomic broadcast algorithms. The mechanism
is an algorithm that solves the atomic broadcast          used for faults tolerance is another important
problem in a distributed asynchronous system. As,         characteristic of these algorithms. Two classes, in
in RP, FD-RP distinguishes three roles of processes:      asynchronous systems, are defined: (1) algorithms
proposers, acceptors and learners, indeed of the          that rely on unreliable failure detectors Chandra
coordinator (that can be one among them). Its main        and Toueg (1996) and (2) algorithms that rely on
characteristic is the use of ♦S failure detector to       group membership Chockler et al. (2001). It will be
agree on the set of acceptors forming the logical         interesting to propose an algorithm that combines
ring, hence enabling processes from waiting for           more than two classes. In Parisa et al. (2012), the
responses from crashed processes, consequently,           authors proposed a solution that is a combination
reducing the execution time. Initially, the coordinator   of several classes. The solution is based on
broadcasts a request to form the ring. Correct            the coordinator principle, similarly to the Paxos
acceptors that have not suspected the coordinator         protocol Lamport (1998), to ensure the delivery
accepts the request and respond positively. When          order by affecting unique identifiers to broadcast
the agreement about the value decided (the                messages. They also, once the value is picked by the
sequence of messages to deliver) has been reached         coordinator, use the token principle for disseminating
(between coordinator and the elected acceptors),          the proposed value until taking the decision.
the decision will be broadcast by the coordinator         However, the solution uses ip-multicast primitives,
to all correct processes (proposers, acceptors and        which is unreliable communication mechanism and
learners). The rest of the paper is structured as         needs the retransmission, for unknown times, of
follows. Section 2 describes some related work.           lost messages. The solution, also, assumes very
Section 3 is devoted to the system model and the          strong hypothesis: one coordinator and failure free
definition of related agreement problems. In section      partially synchronous model. Consequently, these
4 we give an overview of the FD-RP-based Atomic           hypotheses make the protocol impractical in some
Broadcast. The correctness proof of the algorithm is      models and increase the time execution of the
provided in section 5. Section 6 concludes the paper.     protocol.
                                                          In this paper, we propose a new solution for
                                                          the atomic broadcast problem. It circumvents the
2. RELATED WORK                                           different inconvenient faced in the RP and adds
                                                          other characteristics from other classes to obtain
Atomic broadcast algorithms have been widely
                                                          a novel algorithm for the atomic broadcast. Our
studied in the last twenty years. They can be
                                                          solution assumes a general asynchronous system.
summarized into five interesting classes schiper et
                                                          It combines two different classes: the rotating
al. (2004): (1) fixed sequencer algorithms: in this
                                                          coordinator and the token ring as the ordering
class, one unique process, during the hole execution,
                                                          mechanisms and the ♦S failure detector Chandra
is chosen as a sequencer (coordinator) and it
                                                          and Toueg (1996) for tolerating faults. In order
is responsible for ordering messages by affecting
                                                          to disseminate messages, our protocol implements
sequence numbers to the broadcast messages
                                                          the reliable broadcast primitives that is a broadcast
Garcia-Molina and Spauster (1991),Birman et al.
                                                          message is received by all correct processes or
(1991), (2) moving sequencer class: it keeps the
                                                          by none of them. These characteristics make our
same principle as the previous class but allows the
                                                          protocol general and perform good results in terms of
role of the sequencer to be transferred between
                                                          time and the number of messages needed to reach
several processes Chang and Maxemchuk (1984),
                                                          a decision.
Cristian et al. (1997), (3) privilege based class: it
is characterized by the use of the token, processes
can send their messages only when they receive the        3. SYSTEM MODEL AND DEFINITIONS
token and the order is fixed by the message sender
Ekwall and Schiper (2011), Amir et al. (1995),            3.1. System Model
(4) communication history algorithms: the messages’
order is determined by the messages’ senders by           We consider a distributed asynchronous, there is no
affecting a timestamp to each message broadcast,          assumption about the relative speed of processes
upon receiving the messages, the destination              nor on message transfer delays, system composed
                                                                                                         65




of n processes Π = {p1 , p2 , . . . , pn }. Processes     the highest-numbered round that the coordinator
communicate by message passing through reliable           has started to an arbitrary value that will be the
channel, defined by the two primitives send(m) and        greater round number picked so far and cring: the
receive(m). A process can only fail by crashing. At       proposed ring for this round. It sends, using ip-
most f (< n/2) processes are faulty, when n is the        multicast primitive, a phase1A message to a majority
total number of the processes in the system. By           quorum Qa = (Na +1)/2 of acceptors (Na is the set of
definition a correct process is a process that does       acceptors) placed in a logical directed ring including
not crash. A faulty process is one that is not correct.   the coordinator as the last acceptor, differently than
We assume that the system is augmented with               Paxos. The proposed structure reduces the incoming
♦S failure detector that provides (possibly incorrect)    messages at the coordinator and balances the
information about the processes that are suspected        communication among acceptors. The purpose of
to have crashed.                                          this message is to propose the ring that will be
                                                          used in the remaining parts of the round. Upon
3.2. Atomic Broadcast                                     receiving the message by an acceptor, it affects
                                                          the variables rnd: the highest-numbered round, in
Atomic broadcast is an agreement problem that
                                                          which the acceptor has participated and ring: the
is equivalent to consensus Chandra and Toueg
                                                          current ring accepted by those received from the
(1996), defined by two primitive abroadcast(m)
                                                          coordinator then it replays by a phase1B message
and adeliver(m). Atomic Broadcast satisfies the
                                                          accepting the coordinator proposition and proposing
following properties Hadzilacos (1993):(1) if a
                                                          the last voted value vval: the value voted by the
correct process abroadcasts a message m, then it
                                                          acceptor in round vrnd: the highest-numbered round
eventually adelivers m (validity), (2) for any message
                                                          in which the acceptor has cast a vote (it follows
m, a correct process adelivers m at most once and
                                                          that rnd ≤ vrnd always holds). Upon receiving a
only if m was previously abroadcast (integrity), (3)
                                                          majority phase1B messages for the same round,
if a correct process adelivers m, then all correct
                                                          the coordinator calculates cval: the value that the
processes eventually adeliver m (agreement), (4)
                                                          coordinator has picked for round crnd (chosen from
if a correct process adelivers m before m0 , then
                                                          the received acceptors values or a new proposed
every correct process adelivers m0 only after it has
                                                          value by a proposer), affects a cvid: a unique
adelivered m (total order ).
                                                          identifier affected to the proposed value. It, then
3.3. Failure Detector                                     executes the phase 2 of its round and ip-multicasts a
                                                          phase2A message to Qa + Nl ( the set of learners).
We refer below to the ♦S failure detector class in-       In the remaining tasks, acceptors try to agree on
troduced in Chandra and Toueg (1996) and defined          the value proposed by the coordinator. So, after
by the two following properties: (i) eventually every     ip-delivering a phase2A message by an acceptor
process that crashes is permanently suspected by          with the same sending information, it updates its
every correct process (strong completeness), and          vrnd,vval and affects its vvid: a unique identifier
(ii) there is a time after which some correct process     affected to the proposed value by cvid. If it is not
is never suspected by any correct process (Eventu-        the last acceptor in the ring, it keeps sending a
ally Weak Accuracy). Notice that every process is         phase2B message to its successor. Otherwise, if it
equipped by a local failure detector that maintains a     is the last acceptor in the ring, i.e. the coordinator,
set of processes suspected to have crashed.               then it ip-multicasts the DECISION message to all
                                                          the processes (Qa + Nl ).

4. OVERVIEW OF THE RING PAXOS
ALGORITHM                                                 5. FAILURE DETECTOR-RING PAXOS BASED
                                                          ATOMIC BROADCAST ALGORITHM
The Ring Paxos (RP) Algorithm Parisa et al. (2012)
is an atomic broadcast algorithm that is based on the     The FD-RP Atomic Broadcast Algorithm, given in
Paxos algorithm Lamport (1998). The computation           Algorithm 1, is also based on the Paxos algorithm
proceeds in asynchronous rounds. Each round is            Lamport (1998), and similarly, every process can
composed of five tasks, distributed on two phases. A      be a proposer, an acceptor or a learner. The
process in a round crnd can take one or several roles     algorithm proceeds in asynchronous rounds. Each
among: (1) proposer, (2) acceptor, (3) learner. A         round is coordinated by one coordinator, it is process
proposer initiates the protocol by proposing a value.     number (crnd mod n)+1. A coordinator can be a
Several proposers can coexist during one round, but       proposer or an acceptor. In the first task of the
one value can be chosen by a coordinator. Once            first phase and after receiving propositions from the
a coordinator (a proposer or an acceptor) receives        proposers, the coordinator launches the algorithm
the needed value, it begins the first phase of its        by broadcasting a phase1A message to the set
round. For this purpose, it updates the variable crnd:    of acceptors. The message contains the round
                                                                                                         66




number rnd and the proposed ring number cring             Algorithm 1 FD-RP based AB Algorithm
that will be used in the remaining tasks (Task1). Two      1: Task1 : (Coordinator)
cases can be distinguished. The acceptor suspects          2: upon reception of value v f rom a proposer
the coordinator to be crashed, by requesting its           3: increase crnd to an arbitrary unique value
associated failure detector ♦S, and proceeds to            4: let cring be an overlay ring with processes in Qa
the next round. Otherwise, it replays by a phase1B         5: f or all pi ∈ Qa do send (pi ,(phase1A,crnd,cring))
message informing the coordinator that it abides by        6: Task2 : (Acceptor)
the proposed ring. It attaches the message by its last     7: wait    until reception of (phase1A,crnd,cring)
voted value vval (Task2). If the coordinator hasn’t           f rom the coord or coord ∈ Di
received a majority of responses from the acceptors,       8: if coord ∈ / Di then
because of acceptors’ failures or its suspicion by         9:    if crnd > rnd then
those acceptors, it aborts the round. Otherwise, the      10:       rnd ← crnd; ring ← cring;
coordinated has received a majority of messages; it                 send (coord,(phase1B,rnd,vrnd,vval))
chooses the value to be proposed cval. This value         11:    end if
can be the voted value during the greater round           12: else
executed so far, or a new value proposed by a             13:    send (coord,(phase1B,rnd,nack))
proposer. Notice that the value proposed can be           14: end if
an update of a copy in a distributed database, a          15: Task3 : (Coordinator)
shared object, etc. Then, the coordinator affects         16: wait until reception of (phase1B,rnd,vrnd,v-
an identifier cvid to the proposed value and sends            val)f rom Qa such that crnd =rnd
a phase2A message to the acceptors (Task3). An            17: if ∃ one message (phase1B,rnd,nack) then
acceptor, that hasn’t suspected the coordinator for       18:    send (Qa ∪ Nl ,(phase2A,crnd,Abort))
the current round (by requesting its failure detector),   19: else
updates its variables vval, vrnd and vvid. It checks      20:    k ← M axvrnd received
if it is not the last acceptor in the ring. If so, it     21:    V S ← (vrnd, vval) received s.t. k = vrnd
keeps sending a phase2B message to its successor          22:    if k = 0 then
in the ring (Task4). Otherwise, it is the coordinator,    23:       cval ← v
it broadcasts a DECISION message to all the               24:    else
processes (acceptors and learners) (Task5).               25:       cval ← vval, the only vval in V S
                                                          26:       let cvid be the unique identif ier f or cval
6. PROOF OF CORRECTNESS                                   27:       send (Qa ∪ Nl ,(phase2A,crnd,cval,cvid))
                                                          28:    end if
Lemma1 if a correct process adelivers a message           29: end if
m, then all correct processes eventually adeliver m       30: Task4 : (Acceptor)
(Agreement property).                                     31: wait until reception of (phase2A,crnd,cval,cvid)
Proof This lemma is satisfied by the use of                   f rom the coordinator or coord ∈ Di
the reliable broadcast primitive to broadcast the         32: if coord ∈ / Di then
decision. A reliable broadcast primitive ensures          33:    if N o message (phase2A,crnd,Abort) has been
that all correct processes deliver the same set of               received then
messages broadcast by the coordinator. Let pc be          34:       if crnd > rnd then
the coordinator that decides at round crnd at line 52.    35:          vrnd ← crnd;vval ← cval;vvid ← cvid;
Then, it rbroadcasts this decision (including all the     36:          if f irst(ring) then
processes in Qa ∪ Nl ) at the same line 52. According     37:             send (successor,(phase2B,crnd, cvid))
to the reliable broadcast properties, all the processes   38:          end if
will, eventually, receive the coordinator’s message.      39:       end if
Hence, they adeliver the set of messages according        40:    end if
to a deterministic rule.                                  41: else
Lemma2 If a correct process adelivers a message           42:    send (successor,(phase2B,crnd,nack))
m then m has been abroadcast by a correct process         43: end if
(validity property).                                      44: Task5 : (Coordinator and acceptor)
Proof Once a majority of acceptors agree on the ring      45: upon reception of (phase2B,crnd,cvid) f rom a
cring proposed by the coordinator of the round pc             successor or (phase2B,crnd,nack)
(line 10,11), the coordinator chooses a value cval.       46: if (phase2B,crnd,cvid) has been received then
This value can be a new value v received from a           47:    if cvid= vvid then
proposer at line 2 or a voted value vval of some          48:       if not last(ring) then
processes in previous rounds calculated by pc at line     49:          send (successor,(phase2B,crnd,cvid))
26. The coordinator will propose the selected value       50:       else
to the majority of acceptors at line 28. Consequently,    51:          send (Qa ∪ Nl ,(DECISION ,cvid))
                                                          52:       end if
                                                          53:    end if
                                                          54: end if
                                                                                                         67




the decided value cval has a total agreement from          May. 1998.
the acceptors in cring (including the coordinator) and     P. Verissimo, M. Raynal (1999) Time in Distributed
it is a proposition of the coordinator pc of the current   System Models and Algorithms. In proceeding, May.
round crnd.                                                1999. 1–32.
Lemma3 For each message m , every correct
                                                           X. Defago, A. Schiper, P. Urbn (2004) Total order
process adelivers m at most once and only if m was
                                                           broadcast and multicast algorithms: Taxonomy and
previously broadcast (integrity property).
                                                           survey. ACM Computing Surveys, Vol. 36, Issue 4,
Proof The coordinator pc , after choosing the value
                                                           Page(s):372-421, Dec. 2004.
to be proposed cval at line 24 or 26 (see lemma 2),
affects to the proposed value cval a unique identifier     V. Hadzilacos and S. Toueg, Reliable broadcast and
cvid at line 27 that can be used by the processes to       related problems , In Distributed Systems, ACM
check if the message has been adelivered or not.           Press, Page(s):97-145, 1993.
Lemma4 If some correct process adelivers m before          H. Garcia-Molina, A. Spauster, Ordered and reliable
m0 , then every correct process adelivers m0 only          multicast communication, ACM Trans. Comput. Syst,
after it has adelivered m (total order property).          Vol. 9, Issue 3, Page(s):242-271, Aug. 1991.
Proof The property is satisfied by the use of the          K. P. Birman, A. Schiper, P. Stephenson Lightweight
♦S failure detector which ensures that eventually          causal and atomic group multicast, ACM Trans.
a correct process pc will not be suspected by any          Comput. Syst, Vol. 9, Issue 3, Page(s):272-314, Aug.
correct process, it is the coordinator. Once pc is         1991.
elected for the round crnd, it proposes a value cval
                                                           J. M. Chang, N. F. Maxemchuk Reliable broadcast
at line 28, affects a unique identifier cvid at line 27
                                                           protocols, ACM Trans. Comput. Syst, Vol. 2, Issue 3,
to this value (for an other round, the coordinator
                                                           Page(s):251273, Aug. 1984.
will affect cvid0 to cval0 in crnd0 . Consequently, the
coordinator will gather the votes, firstly, to cval and    F. Cristian, S. Mishra,G. Alvarez High-performance
deliver it, then to cval0 . Hence, the order between the   asynchronous atomic broadcast, Distributed System
values (broadcast messages) is guaranteed.                 Engineering Journal, Vol. 4, Issue 2, Page(s):109-
Theorem The algorithm 1 implements the Atomic              128, June. 1997.
Broadcast primitives.                                      R. Ekwall, A. Schiper, A Fault-Tolerant Token-Based
                                                           Atomic Broadcast Algorithm, IEEE Trans. Depend-
                                                           able Sec. Comput, Vol. 8, Issue 5, Page(s):625-639,
7. CONCLUSIONS
                                                           2011.
The paper has presented a new algorithm for solving        Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A.
atomic broadcast in asynchronous system. The               Agrawal, P. Ciarfella, The Totem single-ring ordering
algorithm inherits from RP its main characteristics;       and membership protocol, ACM Trans. Comput.
coordinator principle, ring structure to decide the        Syst, Vol. 13, Issue 4, Page(s):311-342, Nov. 1995.
order of messages. They make the communication             L. Lamport, Time, clocks, and the ordering of events
centralized and reduce the contention of the system,       in a distributed system, ACM, Vol. 21, Issue 7,
i.e. the number of messages transmitted. We also           Page(s):558-565, July. 1978.
introduced the ♦S failure detector to tolerate process
                                                           Z. Bar-Joseph, I. Keidar, N. Lynch, Early-delivery
failures. The use of this mechanism is considered as
                                                           dynamic atomic broadcast, In Proc. 16th Intl. Symp.
the best method for fault tolerance. Hence, unlike RP,
                                                           on Distributed Computing (DISC’02), D. Malkhi, Ed.
many steps can be omitted using failure detectors,
                                                           LNCS, vol. 2508, pp. 1-16, 2002.
especially when the coordinator is crashed, so
processes haven’t to wait, a long time, for its            L. Rodrigues, M. Raynal, ”Atomic broadcast in
message and proceed directly to the next round. The        asynchronous crash-recovery distributed systems”.
same thing for the coordinator, when its associated        In Proc. 20th IEEE Intl. Conf. on Distributed
failure detector suspects a majority of acceptors then     Computing Systems (ICDCS- 20),pp. 288-295, Juin
it proceeds directly to the next round without waiting     2000.
a long time for un-received messages from crashed          T. Chandra, S. Toueg, Unreliable failure detectors for
acceptors. Hence, a good performance in time and           reliable distributed systems, ACM, Vol. 43, Issue 2,
message number is reached, in comparison with RP.          Page(s):225-267, 1996.
These positive points will be justified by a simulation    G. Chockler, I. Keidar, R. Vitenberg, Group
results as future goals.                                   communication specifications: A comprehensive
                                                           study, ACM, Vol. 33, Issue 4, Page(s):1-43, Dec.
8. REFERENCES                                              1996.
                                                           P.J. Marandi, M. Primi, N. Schiper, F. Pedone,
L. Lamport (1998) The part-time parliament. ACM            Ring Paxos: A HighThroughput Atomic Broadcast
Transactions on Computer Systems, 16, 133–169,
                                                68




Protocol, DNS,IEEE Computer Society, pp. 112,
2012.