=Paper= {{Paper |id=Vol-1256/paper6 |storemode=property |title=Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm |pdfUrl=https://ceur-ws.org/Vol-1256/paper6.pdf |volume=Vol-1256 |dblpUrl=https://dblp.org/rec/conf/vecos/Nadjette14 }} ==Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm== https://ceur-ws.org/Vol-1256/paper6.pdf

Failure Detector-Ring Paxos based Atomic
Broadcast Algorithm

Nadjette Rebouh
LAMOS, Faculty of Exact Sciences, University of Béjaı̈a
Algeria
nadjette rebouh34@yahoo.fr

The Atomic broadcast problem constitutes an essential component in fault-tolerant distributed systems.
An Atomic broadcast algorithm ensures that all processes deliver the same messages sequence. Many
algorithms have been published. Recently, a new Paxos-based algorithm has been proposed for clustered
systems, named Ring Paxos. It inherits many of its characteristics: safe under asynchronous assumptions,
live under weak synchronous assumptions, and requires a majority of non faulty processes to ensure
progress. However, the proposed algorithm relies on strong assumptions, and uses the group membership
for faults tolerance. In this paper, we propose a new Ring Paxos-based atomic broadcast algorithm for
distributed systems. It inherits some of Ring Paxos characteristics and uses failure detectors to tolerate
failures. A slight comparison between Ring Paxos and the proposed algorithm shows that our algorithm
surpasses the performance of the other algorithm.

Agreement Problem; Atomic broadcast; Ring Paxos; Failure Detectors

1. INTRODUCTION Toueg (1996) and group membership Chockler et
al. (2001). Recently, a new algorithm that solves the
An Atomic Broadcast Algorithm is an important atomic broadcast problem has been proposed, Ring
building block in distributed fault-tolerant systems. Paxos (RP) Parisa et al. (2012). RP is based on
It provides a group communication primitive that Paxos, a well-known consensus algorithm Lamport
allows processes to deliver the same messages (1998), and inherits many of its characteristics: it is
sequence Hadzilacos (1993). Atomic Broadcast safe under asynchronous assumptions, live under
is pratically useful to implement state-machine weak synchronous assumptions, and resiliency-
replication Verissimo and Raynal (1999). By optimal, that is, it requires a majority of correct
employing this group primitive to disseminate processes to ensure progress. They optimize the
updates, all correct copies deliver the same set original version of Paxos from which they derive
of updates in the same order, and consequently RP. In their paper, they look for an efficient
ensuring the copies consistency. The Atomic implementation, in terms of throughput, of RP.
Broadcast problem has been extensively studied For that purpose, RP uses a single ip-multicast
in the literature. Therefore, several algorithms stream to disseminate messages and thus benefits
have been proposed to implement this primitive, from the throughput that ip-multicast can provide.
considering a diverse set of execution environments Unfortunately, ip-multicast is unreliable. Hence, RP
and system models. Although many classifications places a majority of correct processes (f+1) in a
are possible when considering algorithms designed logical ring, where f is the number of tolerated
for distributed systems, according to the ordering failures, to totally order messages. From one part,
mechanism or the mechanism used to tolerate RP achieves very high throughput while providing
failures. The first class includes protocols that low latency, almost constant with the number of
rely on consensus to decide the order and the receivers. From the other part, RP relies on very
number of messages to deliver (Chandra and Toueg strong assumptions: it supposes one coordinator
(1996),Verissimo and Raynal (1999),etc.) or on the along the execution, impractical assumption. It, also,
token circulation protocols schiper et al. (2004). requires a partially synchronous system, that is,
Two fundamental mechanisms define the second an asynchronous model until a global stabilization
class: unreliable failure detectors Chandra and time, when it becomes synchronous. Last and not
64

least, RP relies on a group membership service orders the messages according to their timestamps
to detect failures, a highly expensive mechanism Lamport (1978), Bar-Joseph et al. (2002) and (5)
Ekwall and Schiper (2011). The aim of this paper destination agreement algorithms: in this class, the
is to circumvent these inconvenient by considering delivery order results from an agreement between
an asynchronous system, with a variable coordinator the destination processes in different states Chandra
and ♦S failure detector for fault tolerance, the and Toueg (1996), Rodrigues and Raynal (2000).
resulting algorithm is Failure Detector-Ring Paxos However, the ordering mechanism is not the only
based Atomic Broadcast Algorithm (FD-RP). FD-RP key for atomic broadcast algorithms. The mechanism
is an algorithm that solves the atomic broadcast used for faults tolerance is another important
problem in a distributed asynchronous system. As, characteristic of these algorithms. Two classes, in
in RP, FD-RP distinguishes three roles of processes: asynchronous systems, are defined: (1) algorithms
proposers, acceptors and learners, indeed of the that rely on unreliable failure detectors Chandra
coordinator (that can be one among them). Its main and Toueg (1996) and (2) algorithms that rely on
characteristic is the use of ♦S failure detector to group membership Chockler et al. (2001). It will be
agree on the set of acceptors forming the logical interesting to propose an algorithm that combines
ring, hence enabling processes from waiting for more than two classes. In Parisa et al. (2012), the
responses from crashed processes, consequently, authors proposed a solution that is a combination
reducing the execution time. Initially, the coordinator of several classes. The solution is based on
broadcasts a request to form the ring. Correct the coordinator principle, similarly to the Paxos
acceptors that have not suspected the coordinator protocol Lamport (1998), to ensure the delivery
accepts the request and respond positively. When order by affecting unique identifiers to broadcast
the agreement about the value decided (the messages. They also, once the value is picked by the
sequence of messages to deliver) has been reached coordinator, use the token principle for disseminating
(between coordinator and the elected acceptors), the proposed value until taking the decision.
the decision will be broadcast by the coordinator However, the solution uses ip-multicast primitives,
to all correct processes (proposers, acceptors and which is unreliable communication mechanism and
learners). The rest of the paper is structured as needs the retransmission, for unknown times, of
follows. Section 2 describes some related work. lost messages. The solution, also, assumes very
Section 3 is devoted to the system model and the strong hypothesis: one coordinator and failure free
definition of related agreement problems. In section partially synchronous model. Consequently, these
4 we give an overview of the FD-RP-based Atomic hypotheses make the protocol impractical in some
Broadcast. The correctness proof of the algorithm is models and increase the time execution of the
provided in section 5. Section 6 concludes the paper. protocol.
In this paper, we propose a new solution for
the atomic broadcast problem. It circumvents the
2. RELATED WORK different inconvenient faced in the RP and adds
other characteristics from other classes to obtain
Atomic broadcast algorithms have been widely
a novel algorithm for the atomic broadcast. Our
studied in the last twenty years. They can be
solution assumes a general asynchronous system.
summarized into five interesting classes schiper et
It combines two different classes: the rotating
al. (2004): (1) fixed sequencer algorithms: in this
coordinator and the token ring as the ordering
class, one unique process, during the hole execution,
mechanisms and the ♦S failure detector Chandra
is chosen as a sequencer (coordinator) and it
and Toueg (1996) for tolerating faults. In order
is responsible for ordering messages by affecting
to disseminate messages, our protocol implements
sequence numbers to the broadcast messages
the reliable broadcast primitives that is a broadcast
Garcia-Molina and Spauster (1991),Birman et al.
message is received by all correct processes or
(1991), (2) moving sequencer class: it keeps the
by none of them. These characteristics make our
same principle as the previous class but allows the
protocol general and perform good results in terms of
role of the sequencer to be transferred between
time and the number of messages needed to reach
several processes Chang and Maxemchuk (1984),
a decision.
Cristian et al. (1997), (3) privilege based class: it
is characterized by the use of the token, processes
can send their messages only when they receive the 3. SYSTEM MODEL AND DEFINITIONS
token and the order is fixed by the message sender
Ekwall and Schiper (2011), Amir et al. (1995), 3.1. System Model
(4) communication history algorithms: the messages’
order is determined by the messages’ senders by We consider a distributed asynchronous, there is no
affecting a timestamp to each message broadcast, assumption about the relative speed of processes
upon receiving the messages, the destination nor on message transfer delays, system composed
65

of n processes Π = {p1 , p2 , . . . , pn }. Processes the highest-numbered round that the coordinator
communicate by message passing through reliable has started to an arbitrary value that will be the
channel, defined by the two primitives send(m) and greater round number picked so far and cring: the
receive(m). A process can only fail by crashing. At proposed ring for this round. It sends, using ip-
most f (< n/2) processes are faulty, when n is the multicast primitive, a phase1A message to a majority
total number of the processes in the system. By quorum Qa = (Na +1)/2 of acceptors (Na is the set of
definition a correct process is a process that does acceptors) placed in a logical directed ring including
not crash. A faulty process is one that is not correct. the coordinator as the last acceptor, differently than
We assume that the system is augmented with Paxos. The proposed structure reduces the incoming
♦S failure detector that provides (possibly incorrect) messages at the coordinator and balances the
information about the processes that are suspected communication among acceptors. The purpose of
to have crashed. this message is to propose the ring that will be
used in the remaining parts of the round. Upon
3.2. Atomic Broadcast receiving the message by an acceptor, it affects
the variables rnd: the highest-numbered round, in
Atomic broadcast is an agreement problem that
which the acceptor has participated and ring: the
is equivalent to consensus Chandra and Toueg
current ring accepted by those received from the
(1996), defined by two primitive abroadcast(m)
coordinator then it replays by a phase1B message
and adeliver(m). Atomic Broadcast satisfies the
accepting the coordinator proposition and proposing
following properties Hadzilacos (1993):(1) if a
the last voted value vval: the value voted by the
correct process abroadcasts a message m, then it
acceptor in round vrnd: the highest-numbered round
eventually adelivers m (validity), (2) for any message
in which the acceptor has cast a vote (it follows
m, a correct process adelivers m at most once and
that rnd ≤ vrnd always holds). Upon receiving a
only if m was previously abroadcast (integrity), (3)
majority phase1B messages for the same round,
if a correct process adelivers m, then all correct
the coordinator calculates cval: the value that the
processes eventually adeliver m (agreement), (4)
coordinator has picked for round crnd (chosen from
if a correct process adelivers m before m0 , then
the received acceptors values or a new proposed
every correct process adelivers m0 only after it has
value by a proposer), affects a cvid: a unique
adelivered m (total order ).
identifier affected to the proposed value. It, then
3.3. Failure Detector executes the phase 2 of its round and ip-multicasts a
phase2A message to Qa + Nl ( the set of learners).
We refer below to the ♦S failure detector class in- In the remaining tasks, acceptors try to agree on
troduced in Chandra and Toueg (1996) and defined the value proposed by the coordinator. So, after
by the two following properties: (i) eventually every ip-delivering a phase2A message by an acceptor
process that crashes is permanently suspected by with the same sending information, it updates its
every correct process (strong completeness), and vrnd,vval and affects its vvid: a unique identifier
(ii) there is a time after which some correct process affected to the proposed value by cvid. If it is not
is never suspected by any correct process (Eventu- the last acceptor in the ring, it keeps sending a
ally Weak Accuracy). Notice that every process is phase2B message to its successor. Otherwise, if it
equipped by a local failure detector that maintains a is the last acceptor in the ring, i.e. the coordinator,
set of processes suspected to have crashed. then it ip-multicasts the DECISION message to all
the processes (Qa + Nl ).

4. OVERVIEW OF THE RING PAXOS
ALGORITHM 5. FAILURE DETECTOR-RING PAXOS BASED
ATOMIC BROADCAST ALGORITHM
The Ring Paxos (RP) Algorithm Parisa et al. (2012)
is an atomic broadcast algorithm that is based on the The FD-RP Atomic Broadcast Algorithm, given in
Paxos algorithm Lamport (1998). The computation Algorithm 1, is also based on the Paxos algorithm
proceeds in asynchronous rounds. Each round is Lamport (1998), and similarly, every process can
composed of five tasks, distributed on two phases. A be a proposer, an acceptor or a learner. The
process in a round crnd can take one or several roles algorithm proceeds in asynchronous rounds. Each
among: (1) proposer, (2) acceptor, (3) learner. A round is coordinated by one coordinator, it is process
proposer initiates the protocol by proposing a value. number (crnd mod n)+1. A coordinator can be a
Several proposers can coexist during one round, but proposer or an acceptor. In the first task of the
one value can be chosen by a coordinator. Once first phase and after receiving propositions from the
a coordinator (a proposer or an acceptor) receives proposers, the coordinator launches the algorithm
the needed value, it begins the first phase of its by broadcasting a phase1A message to the set
round. For this purpose, it updates the variable crnd: of acceptors. The message contains the round
66

number rnd and the proposed ring number cring Algorithm 1 FD-RP based AB Algorithm
that will be used in the remaining tasks (Task1). Two 1: Task1 : (Coordinator)
cases can be distinguished. The acceptor suspects 2: upon reception of value v f rom a proposer
the coordinator to be crashed, by requesting its 3: increase crnd to an arbitrary unique value
associated failure detector ♦S, and proceeds to 4: let cring be an overlay ring with processes in Qa
the next round. Otherwise, it replays by a phase1B 5: f or all pi ∈ Qa do send (pi ,(phase1A,crnd,cring))
message informing the coordinator that it abides by 6: Task2 : (Acceptor)
the proposed ring. It attaches the message by its last 7: wait until reception of (phase1A,crnd,cring)
voted value vval (Task2). If the coordinator hasn’t f rom the coord or coord ∈ Di
received a majority of responses from the acceptors, 8: if coord ∈ / Di then
because of acceptors’ failures or its suspicion by 9: if crnd > rnd then
those acceptors, it aborts the round. Otherwise, the 10: rnd ← crnd; ring ← cring;
coordinated has received a majority of messages; it send (coord,(phase1B,rnd,vrnd,vval))
chooses the value to be proposed cval. This value 11: end if
can be the voted value during the greater round 12: else
executed so far, or a new value proposed by a 13: send (coord,(phase1B,rnd,nack))
proposer. Notice that the value proposed can be 14: end if
an update of a copy in a distributed database, a 15: Task3 : (Coordinator)
shared object, etc. Then, the coordinator affects 16: wait until reception of (phase1B,rnd,vrnd,v-
an identifier cvid to the proposed value and sends val)f rom Qa such that crnd =rnd
a phase2A message to the acceptors (Task3). An 17: if ∃ one message (phase1B,rnd,nack) then
acceptor, that hasn’t suspected the coordinator for 18: send (Qa ∪ Nl ,(phase2A,crnd,Abort))
the current round (by requesting its failure detector), 19: else
updates its variables vval, vrnd and vvid. It checks 20: k ← M axvrnd received
if it is not the last acceptor in the ring. If so, it 21: V S ← (vrnd, vval) received s.t. k = vrnd
keeps sending a phase2B message to its successor 22: if k = 0 then
in the ring (Task4). Otherwise, it is the coordinator, 23: cval ← v
it broadcasts a DECISION message to all the 24: else
processes (acceptors and learners) (Task5). 25: cval ← vval, the only vval in V S
26: let cvid be the unique identif ier f or cval
6. PROOF OF CORRECTNESS 27: send (Qa ∪ Nl ,(phase2A,crnd,cval,cvid))
28: end if
Lemma1 if a correct process adelivers a message 29: end if
m, then all correct processes eventually adeliver m 30: Task4 : (Acceptor)
(Agreement property). 31: wait until reception of (phase2A,crnd,cval,cvid)
Proof This lemma is satisfied by the use of f rom the coordinator or coord ∈ Di
the reliable broadcast primitive to broadcast the 32: if coord ∈ / Di then
decision. A reliable broadcast primitive ensures 33: if N o message (phase2A,crnd,Abort) has been
that all correct processes deliver the same set of received then
messages broadcast by the coordinator. Let pc be 34: if crnd > rnd then
the coordinator that decides at round crnd at line 52. 35: vrnd ← crnd;vval ← cval;vvid ← cvid;
Then, it rbroadcasts this decision (including all the 36: if f irst(ring) then
processes in Qa ∪ Nl ) at the same line 52. According 37: send (successor,(phase2B,crnd, cvid))
to the reliable broadcast properties, all the processes 38: end if
will, eventually, receive the coordinator’s message. 39: end if
Hence, they adeliver the set of messages according 40: end if
to a deterministic rule. 41: else
Lemma2 If a correct process adelivers a message 42: send (successor,(phase2B,crnd,nack))
m then m has been abroadcast by a correct process 43: end if
(validity property). 44: Task5 : (Coordinator and acceptor)
Proof Once a majority of acceptors agree on the ring 45: upon reception of (phase2B,crnd,cvid) f rom a
cring proposed by the coordinator of the round pc successor or (phase2B,crnd,nack)
(line 10,11), the coordinator chooses a value cval. 46: if (phase2B,crnd,cvid) has been received then
This value can be a new value v received from a 47: if cvid= vvid then
proposer at line 2 or a voted value vval of some 48: if not last(ring) then
processes in previous rounds calculated by pc at line 49: send (successor,(phase2B,crnd,cvid))
26. The coordinator will propose the selected value 50: else
to the majority of acceptors at line 28. Consequently, 51: send (Qa ∪ Nl ,(DECISION ,cvid))
52: end if
53: end if
54: end if
67

the decided value cval has a total agreement from May. 1998.
the acceptors in cring (including the coordinator) and P. Verissimo, M. Raynal (1999) Time in Distributed
it is a proposition of the coordinator pc of the current System Models and Algorithms. In proceeding, May.
round crnd. 1999. 1–32.
Lemma3 For each message m , every correct
X. Defago, A. Schiper, P. Urbn (2004) Total order
process adelivers m at most once and only if m was
broadcast and multicast algorithms: Taxonomy and
previously broadcast (integrity property).
survey. ACM Computing Surveys, Vol. 36, Issue 4,
Proof The coordinator pc , after choosing the value
Page(s):372-421, Dec. 2004.
to be proposed cval at line 24 or 26 (see lemma 2),
affects to the proposed value cval a unique identifier V. Hadzilacos and S. Toueg, Reliable broadcast and
cvid at line 27 that can be used by the processes to related problems , In Distributed Systems, ACM
check if the message has been adelivered or not. Press, Page(s):97-145, 1993.
Lemma4 If some correct process adelivers m before H. Garcia-Molina, A. Spauster, Ordered and reliable
m0 , then every correct process adelivers m0 only multicast communication, ACM Trans. Comput. Syst,
after it has adelivered m (total order property). Vol. 9, Issue 3, Page(s):242-271, Aug. 1991.
Proof The property is satisfied by the use of the K. P. Birman, A. Schiper, P. Stephenson Lightweight
♦S failure detector which ensures that eventually causal and atomic group multicast, ACM Trans.
a correct process pc will not be suspected by any Comput. Syst, Vol. 9, Issue 3, Page(s):272-314, Aug.
correct process, it is the coordinator. Once pc is 1991.
elected for the round crnd, it proposes a value cval
J. M. Chang, N. F. Maxemchuk Reliable broadcast
at line 28, affects a unique identifier cvid at line 27
protocols, ACM Trans. Comput. Syst, Vol. 2, Issue 3,
to this value (for an other round, the coordinator
Page(s):251273, Aug. 1984.
will affect cvid0 to cval0 in crnd0 . Consequently, the
coordinator will gather the votes, firstly, to cval and F. Cristian, S. Mishra,G. Alvarez High-performance
deliver it, then to cval0 . Hence, the order between the asynchronous atomic broadcast, Distributed System
values (broadcast messages) is guaranteed. Engineering Journal, Vol. 4, Issue 2, Page(s):109-
Theorem The algorithm 1 implements the Atomic 128, June. 1997.
Broadcast primitives. R. Ekwall, A. Schiper, A Fault-Tolerant Token-Based
Atomic Broadcast Algorithm, IEEE Trans. Depend-
able Sec. Comput, Vol. 8, Issue 5, Page(s):625-639,
7. CONCLUSIONS
2011.
The paper has presented a new algorithm for solving Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A.
atomic broadcast in asynchronous system. The Agrawal, P. Ciarfella, The Totem single-ring ordering
algorithm inherits from RP its main characteristics; and membership protocol, ACM Trans. Comput.
coordinator principle, ring structure to decide the Syst, Vol. 13, Issue 4, Page(s):311-342, Nov. 1995.
order of messages. They make the communication L. Lamport, Time, clocks, and the ordering of events
centralized and reduce the contention of the system, in a distributed system, ACM, Vol. 21, Issue 7,
i.e. the number of messages transmitted. We also Page(s):558-565, July. 1978.
introduced the ♦S failure detector to tolerate process
Z. Bar-Joseph, I. Keidar, N. Lynch, Early-delivery
failures. The use of this mechanism is considered as
dynamic atomic broadcast, In Proc. 16th Intl. Symp.
the best method for fault tolerance. Hence, unlike RP,
on Distributed Computing (DISC’02), D. Malkhi, Ed.
many steps can be omitted using failure detectors,
LNCS, vol. 2508, pp. 1-16, 2002.
especially when the coordinator is crashed, so
processes haven’t to wait, a long time, for its L. Rodrigues, M. Raynal, ”Atomic broadcast in
message and proceed directly to the next round. The asynchronous crash-recovery distributed systems”.
same thing for the coordinator, when its associated In Proc. 20th IEEE Intl. Conf. on Distributed
failure detector suspects a majority of acceptors then Computing Systems (ICDCS- 20),pp. 288-295, Juin
it proceeds directly to the next round without waiting 2000.
a long time for un-received messages from crashed T. Chandra, S. Toueg, Unreliable failure detectors for
acceptors. Hence, a good performance in time and reliable distributed systems, ACM, Vol. 43, Issue 2,
message number is reached, in comparison with RP. Page(s):225-267, 1996.
These positive points will be justified by a simulation G. Chockler, I. Keidar, R. Vitenberg, Group
results as future goals. communication specifications: A comprehensive
study, ACM, Vol. 33, Issue 4, Page(s):1-43, Dec.
8. REFERENCES 1996.
P.J. Marandi, M. Primi, N. Schiper, F. Pedone,
L. Lamport (1998) The part-time parliament. ACM Ring Paxos: A HighThroughput Atomic Broadcast
Transactions on Computer Systems, 16, 133–169,
68

Protocol, DNS,IEEE Computer Society, pp. 112,
2012.