=Paper=
{{Paper
|id=Vol-1256/paper6
|storemode=property
|title=Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm
|pdfUrl=https://ceur-ws.org/Vol-1256/paper6.pdf
|volume=Vol-1256
|dblpUrl=https://dblp.org/rec/conf/vecos/Nadjette14
}}
==Failure Detector-Ring Paxos Based Atomic Broadcast Algorithm==
63 Failure Detector-Ring Paxos based Atomic Broadcast Algorithm Nadjette Rebouh LAMOS, Faculty of Exact Sciences, University of Béjaı̈a Algeria nadjette rebouh34@yahoo.fr The Atomic broadcast problem constitutes an essential component in fault-tolerant distributed systems. An Atomic broadcast algorithm ensures that all processes deliver the same messages sequence. Many algorithms have been published. Recently, a new Paxos-based algorithm has been proposed for clustered systems, named Ring Paxos. It inherits many of its characteristics: safe under asynchronous assumptions, live under weak synchronous assumptions, and requires a majority of non faulty processes to ensure progress. However, the proposed algorithm relies on strong assumptions, and uses the group membership for faults tolerance. In this paper, we propose a new Ring Paxos-based atomic broadcast algorithm for distributed systems. It inherits some of Ring Paxos characteristics and uses failure detectors to tolerate failures. A slight comparison between Ring Paxos and the proposed algorithm shows that our algorithm surpasses the performance of the other algorithm. Agreement Problem; Atomic broadcast; Ring Paxos; Failure Detectors 1. INTRODUCTION Toueg (1996) and group membership Chockler et al. (2001). Recently, a new algorithm that solves the An Atomic Broadcast Algorithm is an important atomic broadcast problem has been proposed, Ring building block in distributed fault-tolerant systems. Paxos (RP) Parisa et al. (2012). RP is based on It provides a group communication primitive that Paxos, a well-known consensus algorithm Lamport allows processes to deliver the same messages (1998), and inherits many of its characteristics: it is sequence Hadzilacos (1993). Atomic Broadcast safe under asynchronous assumptions, live under is pratically useful to implement state-machine weak synchronous assumptions, and resiliency- replication Verissimo and Raynal (1999). By optimal, that is, it requires a majority of correct employing this group primitive to disseminate processes to ensure progress. They optimize the updates, all correct copies deliver the same set original version of Paxos from which they derive of updates in the same order, and consequently RP. In their paper, they look for an efficient ensuring the copies consistency. The Atomic implementation, in terms of throughput, of RP. Broadcast problem has been extensively studied For that purpose, RP uses a single ip-multicast in the literature. Therefore, several algorithms stream to disseminate messages and thus benefits have been proposed to implement this primitive, from the throughput that ip-multicast can provide. considering a diverse set of execution environments Unfortunately, ip-multicast is unreliable. Hence, RP and system models. Although many classifications places a majority of correct processes (f+1) in a are possible when considering algorithms designed logical ring, where f is the number of tolerated for distributed systems, according to the ordering failures, to totally order messages. From one part, mechanism or the mechanism used to tolerate RP achieves very high throughput while providing failures. The first class includes protocols that low latency, almost constant with the number of rely on consensus to decide the order and the receivers. From the other part, RP relies on very number of messages to deliver (Chandra and Toueg strong assumptions: it supposes one coordinator (1996),Verissimo and Raynal (1999),etc.) or on the along the execution, impractical assumption. It, also, token circulation protocols schiper et al. (2004). requires a partially synchronous system, that is, Two fundamental mechanisms define the second an asynchronous model until a global stabilization class: unreliable failure detectors Chandra and time, when it becomes synchronous. Last and not 64 least, RP relies on a group membership service orders the messages according to their timestamps to detect failures, a highly expensive mechanism Lamport (1978), Bar-Joseph et al. (2002) and (5) Ekwall and Schiper (2011). The aim of this paper destination agreement algorithms: in this class, the is to circumvent these inconvenient by considering delivery order results from an agreement between an asynchronous system, with a variable coordinator the destination processes in different states Chandra and ♦S failure detector for fault tolerance, the and Toueg (1996), Rodrigues and Raynal (2000). resulting algorithm is Failure Detector-Ring Paxos However, the ordering mechanism is not the only based Atomic Broadcast Algorithm (FD-RP). FD-RP key for atomic broadcast algorithms. The mechanism is an algorithm that solves the atomic broadcast used for faults tolerance is another important problem in a distributed asynchronous system. As, characteristic of these algorithms. Two classes, in in RP, FD-RP distinguishes three roles of processes: asynchronous systems, are defined: (1) algorithms proposers, acceptors and learners, indeed of the that rely on unreliable failure detectors Chandra coordinator (that can be one among them). Its main and Toueg (1996) and (2) algorithms that rely on characteristic is the use of ♦S failure detector to group membership Chockler et al. (2001). It will be agree on the set of acceptors forming the logical interesting to propose an algorithm that combines ring, hence enabling processes from waiting for more than two classes. In Parisa et al. (2012), the responses from crashed processes, consequently, authors proposed a solution that is a combination reducing the execution time. Initially, the coordinator of several classes. The solution is based on broadcasts a request to form the ring. Correct the coordinator principle, similarly to the Paxos acceptors that have not suspected the coordinator protocol Lamport (1998), to ensure the delivery accepts the request and respond positively. When order by affecting unique identifiers to broadcast the agreement about the value decided (the messages. They also, once the value is picked by the sequence of messages to deliver) has been reached coordinator, use the token principle for disseminating (between coordinator and the elected acceptors), the proposed value until taking the decision. the decision will be broadcast by the coordinator However, the solution uses ip-multicast primitives, to all correct processes (proposers, acceptors and which is unreliable communication mechanism and learners). The rest of the paper is structured as needs the retransmission, for unknown times, of follows. Section 2 describes some related work. lost messages. The solution, also, assumes very Section 3 is devoted to the system model and the strong hypothesis: one coordinator and failure free definition of related agreement problems. In section partially synchronous model. Consequently, these 4 we give an overview of the FD-RP-based Atomic hypotheses make the protocol impractical in some Broadcast. The correctness proof of the algorithm is models and increase the time execution of the provided in section 5. Section 6 concludes the paper. protocol. In this paper, we propose a new solution for the atomic broadcast problem. It circumvents the 2. RELATED WORK different inconvenient faced in the RP and adds other characteristics from other classes to obtain Atomic broadcast algorithms have been widely a novel algorithm for the atomic broadcast. Our studied in the last twenty years. They can be solution assumes a general asynchronous system. summarized into five interesting classes schiper et It combines two different classes: the rotating al. (2004): (1) fixed sequencer algorithms: in this coordinator and the token ring as the ordering class, one unique process, during the hole execution, mechanisms and the ♦S failure detector Chandra is chosen as a sequencer (coordinator) and it and Toueg (1996) for tolerating faults. In order is responsible for ordering messages by affecting to disseminate messages, our protocol implements sequence numbers to the broadcast messages the reliable broadcast primitives that is a broadcast Garcia-Molina and Spauster (1991),Birman et al. message is received by all correct processes or (1991), (2) moving sequencer class: it keeps the by none of them. These characteristics make our same principle as the previous class but allows the protocol general and perform good results in terms of role of the sequencer to be transferred between time and the number of messages needed to reach several processes Chang and Maxemchuk (1984), a decision. Cristian et al. (1997), (3) privilege based class: it is characterized by the use of the token, processes can send their messages only when they receive the 3. SYSTEM MODEL AND DEFINITIONS token and the order is fixed by the message sender Ekwall and Schiper (2011), Amir et al. (1995), 3.1. System Model (4) communication history algorithms: the messages’ order is determined by the messages’ senders by We consider a distributed asynchronous, there is no affecting a timestamp to each message broadcast, assumption about the relative speed of processes upon receiving the messages, the destination nor on message transfer delays, system composed 65 of n processes Π = {p1 , p2 , . . . , pn }. Processes the highest-numbered round that the coordinator communicate by message passing through reliable has started to an arbitrary value that will be the channel, defined by the two primitives send(m) and greater round number picked so far and cring: the receive(m). A process can only fail by crashing. At proposed ring for this round. It sends, using ip- most f (< n/2) processes are faulty, when n is the multicast primitive, a phase1A message to a majority total number of the processes in the system. By quorum Qa = (Na +1)/2 of acceptors (Na is the set of definition a correct process is a process that does acceptors) placed in a logical directed ring including not crash. A faulty process is one that is not correct. the coordinator as the last acceptor, differently than We assume that the system is augmented with Paxos. The proposed structure reduces the incoming ♦S failure detector that provides (possibly incorrect) messages at the coordinator and balances the information about the processes that are suspected communication among acceptors. The purpose of to have crashed. this message is to propose the ring that will be used in the remaining parts of the round. Upon 3.2. Atomic Broadcast receiving the message by an acceptor, it affects the variables rnd: the highest-numbered round, in Atomic broadcast is an agreement problem that which the acceptor has participated and ring: the is equivalent to consensus Chandra and Toueg current ring accepted by those received from the (1996), defined by two primitive abroadcast(m) coordinator then it replays by a phase1B message and adeliver(m). Atomic Broadcast satisfies the accepting the coordinator proposition and proposing following properties Hadzilacos (1993):(1) if a the last voted value vval: the value voted by the correct process abroadcasts a message m, then it acceptor in round vrnd: the highest-numbered round eventually adelivers m (validity), (2) for any message in which the acceptor has cast a vote (it follows m, a correct process adelivers m at most once and that rnd ≤ vrnd always holds). Upon receiving a only if m was previously abroadcast (integrity), (3) majority phase1B messages for the same round, if a correct process adelivers m, then all correct the coordinator calculates cval: the value that the processes eventually adeliver m (agreement), (4) coordinator has picked for round crnd (chosen from if a correct process adelivers m before m0 , then the received acceptors values or a new proposed every correct process adelivers m0 only after it has value by a proposer), affects a cvid: a unique adelivered m (total order ). identifier affected to the proposed value. It, then 3.3. Failure Detector executes the phase 2 of its round and ip-multicasts a phase2A message to Qa + Nl ( the set of learners). We refer below to the ♦S failure detector class in- In the remaining tasks, acceptors try to agree on troduced in Chandra and Toueg (1996) and defined the value proposed by the coordinator. So, after by the two following properties: (i) eventually every ip-delivering a phase2A message by an acceptor process that crashes is permanently suspected by with the same sending information, it updates its every correct process (strong completeness), and vrnd,vval and affects its vvid: a unique identifier (ii) there is a time after which some correct process affected to the proposed value by cvid. If it is not is never suspected by any correct process (Eventu- the last acceptor in the ring, it keeps sending a ally Weak Accuracy). Notice that every process is phase2B message to its successor. Otherwise, if it equipped by a local failure detector that maintains a is the last acceptor in the ring, i.e. the coordinator, set of processes suspected to have crashed. then it ip-multicasts the DECISION message to all the processes (Qa + Nl ). 4. OVERVIEW OF THE RING PAXOS ALGORITHM 5. FAILURE DETECTOR-RING PAXOS BASED ATOMIC BROADCAST ALGORITHM The Ring Paxos (RP) Algorithm Parisa et al. (2012) is an atomic broadcast algorithm that is based on the The FD-RP Atomic Broadcast Algorithm, given in Paxos algorithm Lamport (1998). The computation Algorithm 1, is also based on the Paxos algorithm proceeds in asynchronous rounds. Each round is Lamport (1998), and similarly, every process can composed of five tasks, distributed on two phases. A be a proposer, an acceptor or a learner. The process in a round crnd can take one or several roles algorithm proceeds in asynchronous rounds. Each among: (1) proposer, (2) acceptor, (3) learner. A round is coordinated by one coordinator, it is process proposer initiates the protocol by proposing a value. number (crnd mod n)+1. A coordinator can be a Several proposers can coexist during one round, but proposer or an acceptor. In the first task of the one value can be chosen by a coordinator. Once first phase and after receiving propositions from the a coordinator (a proposer or an acceptor) receives proposers, the coordinator launches the algorithm the needed value, it begins the first phase of its by broadcasting a phase1A message to the set round. For this purpose, it updates the variable crnd: of acceptors. The message contains the round 66 number rnd and the proposed ring number cring Algorithm 1 FD-RP based AB Algorithm that will be used in the remaining tasks (Task1). Two 1: Task1 : (Coordinator) cases can be distinguished. The acceptor suspects 2: upon reception of value v f rom a proposer the coordinator to be crashed, by requesting its 3: increase crnd to an arbitrary unique value associated failure detector ♦S, and proceeds to 4: let cring be an overlay ring with processes in Qa the next round. Otherwise, it replays by a phase1B 5: f or all pi ∈ Qa do send (pi ,(phase1A,crnd,cring)) message informing the coordinator that it abides by 6: Task2 : (Acceptor) the proposed ring. It attaches the message by its last 7: wait until reception of (phase1A,crnd,cring) voted value vval (Task2). If the coordinator hasn’t f rom the coord or coord ∈ Di received a majority of responses from the acceptors, 8: if coord ∈ / Di then because of acceptors’ failures or its suspicion by 9: if crnd > rnd then those acceptors, it aborts the round. Otherwise, the 10: rnd ← crnd; ring ← cring; coordinated has received a majority of messages; it send (coord,(phase1B,rnd,vrnd,vval)) chooses the value to be proposed cval. This value 11: end if can be the voted value during the greater round 12: else executed so far, or a new value proposed by a 13: send (coord,(phase1B,rnd,nack)) proposer. Notice that the value proposed can be 14: end if an update of a copy in a distributed database, a 15: Task3 : (Coordinator) shared object, etc. Then, the coordinator affects 16: wait until reception of (phase1B,rnd,vrnd,v- an identifier cvid to the proposed value and sends val)f rom Qa such that crnd =rnd a phase2A message to the acceptors (Task3). An 17: if ∃ one message (phase1B,rnd,nack) then acceptor, that hasn’t suspected the coordinator for 18: send (Qa ∪ Nl ,(phase2A,crnd,Abort)) the current round (by requesting its failure detector), 19: else updates its variables vval, vrnd and vvid. It checks 20: k ← M axvrnd received if it is not the last acceptor in the ring. If so, it 21: V S ← (vrnd, vval) received s.t. k = vrnd keeps sending a phase2B message to its successor 22: if k = 0 then in the ring (Task4). Otherwise, it is the coordinator, 23: cval ← v it broadcasts a DECISION message to all the 24: else processes (acceptors and learners) (Task5). 25: cval ← vval, the only vval in V S 26: let cvid be the unique identif ier f or cval 6. PROOF OF CORRECTNESS 27: send (Qa ∪ Nl ,(phase2A,crnd,cval,cvid)) 28: end if Lemma1 if a correct process adelivers a message 29: end if m, then all correct processes eventually adeliver m 30: Task4 : (Acceptor) (Agreement property). 31: wait until reception of (phase2A,crnd,cval,cvid) Proof This lemma is satisfied by the use of f rom the coordinator or coord ∈ Di the reliable broadcast primitive to broadcast the 32: if coord ∈ / Di then decision. A reliable broadcast primitive ensures 33: if N o message (phase2A,crnd,Abort) has been that all correct processes deliver the same set of received then messages broadcast by the coordinator. Let pc be 34: if crnd > rnd then the coordinator that decides at round crnd at line 52. 35: vrnd ← crnd;vval ← cval;vvid ← cvid; Then, it rbroadcasts this decision (including all the 36: if f irst(ring) then processes in Qa ∪ Nl ) at the same line 52. According 37: send (successor,(phase2B,crnd, cvid)) to the reliable broadcast properties, all the processes 38: end if will, eventually, receive the coordinator’s message. 39: end if Hence, they adeliver the set of messages according 40: end if to a deterministic rule. 41: else Lemma2 If a correct process adelivers a message 42: send (successor,(phase2B,crnd,nack)) m then m has been abroadcast by a correct process 43: end if (validity property). 44: Task5 : (Coordinator and acceptor) Proof Once a majority of acceptors agree on the ring 45: upon reception of (phase2B,crnd,cvid) f rom a cring proposed by the coordinator of the round pc successor or (phase2B,crnd,nack) (line 10,11), the coordinator chooses a value cval. 46: if (phase2B,crnd,cvid) has been received then This value can be a new value v received from a 47: if cvid= vvid then proposer at line 2 or a voted value vval of some 48: if not last(ring) then processes in previous rounds calculated by pc at line 49: send (successor,(phase2B,crnd,cvid)) 26. The coordinator will propose the selected value 50: else to the majority of acceptors at line 28. Consequently, 51: send (Qa ∪ Nl ,(DECISION ,cvid)) 52: end if 53: end if 54: end if 67 the decided value cval has a total agreement from May. 1998. the acceptors in cring (including the coordinator) and P. Verissimo, M. Raynal (1999) Time in Distributed it is a proposition of the coordinator pc of the current System Models and Algorithms. In proceeding, May. round crnd. 1999. 1–32. Lemma3 For each message m , every correct X. Defago, A. Schiper, P. Urbn (2004) Total order process adelivers m at most once and only if m was broadcast and multicast algorithms: Taxonomy and previously broadcast (integrity property). survey. ACM Computing Surveys, Vol. 36, Issue 4, Proof The coordinator pc , after choosing the value Page(s):372-421, Dec. 2004. to be proposed cval at line 24 or 26 (see lemma 2), affects to the proposed value cval a unique identifier V. Hadzilacos and S. Toueg, Reliable broadcast and cvid at line 27 that can be used by the processes to related problems , In Distributed Systems, ACM check if the message has been adelivered or not. Press, Page(s):97-145, 1993. Lemma4 If some correct process adelivers m before H. Garcia-Molina, A. Spauster, Ordered and reliable m0 , then every correct process adelivers m0 only multicast communication, ACM Trans. Comput. Syst, after it has adelivered m (total order property). Vol. 9, Issue 3, Page(s):242-271, Aug. 1991. Proof The property is satisfied by the use of the K. P. Birman, A. Schiper, P. Stephenson Lightweight ♦S failure detector which ensures that eventually causal and atomic group multicast, ACM Trans. a correct process pc will not be suspected by any Comput. Syst, Vol. 9, Issue 3, Page(s):272-314, Aug. correct process, it is the coordinator. Once pc is 1991. elected for the round crnd, it proposes a value cval J. M. Chang, N. F. Maxemchuk Reliable broadcast at line 28, affects a unique identifier cvid at line 27 protocols, ACM Trans. Comput. Syst, Vol. 2, Issue 3, to this value (for an other round, the coordinator Page(s):251273, Aug. 1984. will affect cvid0 to cval0 in crnd0 . Consequently, the coordinator will gather the votes, firstly, to cval and F. Cristian, S. Mishra,G. Alvarez High-performance deliver it, then to cval0 . Hence, the order between the asynchronous atomic broadcast, Distributed System values (broadcast messages) is guaranteed. Engineering Journal, Vol. 4, Issue 2, Page(s):109- Theorem The algorithm 1 implements the Atomic 128, June. 1997. Broadcast primitives. R. Ekwall, A. Schiper, A Fault-Tolerant Token-Based Atomic Broadcast Algorithm, IEEE Trans. Depend- able Sec. Comput, Vol. 8, Issue 5, Page(s):625-639, 7. CONCLUSIONS 2011. The paper has presented a new algorithm for solving Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A. atomic broadcast in asynchronous system. The Agrawal, P. Ciarfella, The Totem single-ring ordering algorithm inherits from RP its main characteristics; and membership protocol, ACM Trans. Comput. coordinator principle, ring structure to decide the Syst, Vol. 13, Issue 4, Page(s):311-342, Nov. 1995. order of messages. They make the communication L. Lamport, Time, clocks, and the ordering of events centralized and reduce the contention of the system, in a distributed system, ACM, Vol. 21, Issue 7, i.e. the number of messages transmitted. We also Page(s):558-565, July. 1978. introduced the ♦S failure detector to tolerate process Z. Bar-Joseph, I. Keidar, N. Lynch, Early-delivery failures. The use of this mechanism is considered as dynamic atomic broadcast, In Proc. 16th Intl. Symp. the best method for fault tolerance. Hence, unlike RP, on Distributed Computing (DISC’02), D. Malkhi, Ed. many steps can be omitted using failure detectors, LNCS, vol. 2508, pp. 1-16, 2002. especially when the coordinator is crashed, so processes haven’t to wait, a long time, for its L. Rodrigues, M. Raynal, ”Atomic broadcast in message and proceed directly to the next round. The asynchronous crash-recovery distributed systems”. same thing for the coordinator, when its associated In Proc. 20th IEEE Intl. Conf. on Distributed failure detector suspects a majority of acceptors then Computing Systems (ICDCS- 20),pp. 288-295, Juin it proceeds directly to the next round without waiting 2000. a long time for un-received messages from crashed T. Chandra, S. Toueg, Unreliable failure detectors for acceptors. Hence, a good performance in time and reliable distributed systems, ACM, Vol. 43, Issue 2, message number is reached, in comparison with RP. Page(s):225-267, 1996. These positive points will be justified by a simulation G. Chockler, I. Keidar, R. Vitenberg, Group results as future goals. communication specifications: A comprehensive study, ACM, Vol. 33, Issue 4, Page(s):1-43, Dec. 8. REFERENCES 1996. P.J. Marandi, M. Primi, N. Schiper, F. Pedone, L. Lamport (1998) The part-time parliament. ACM Ring Paxos: A HighThroughput Atomic Broadcast Transactions on Computer Systems, 16, 133–169, 68 Protocol, DNS,IEEE Computer Society, pp. 112, 2012.