Core-based Reconfiguration for Reliable Overlay Networks Silvia Bonomi Sara Tucci-Piergiovanni Sapienza Univertità di Roma Sapienza Univertità di Roma Dipartimento di Informatica e Sistemistica Dipartimento di Informatica e Sistemistica Via Ariosto 25, 00185 Roma, Italy Via Ariosto 25, 00185 Roma, Italy bonomi@dis.uniroma1.it tucci@dis.uniroma1.it ABSTRACT correctly search for the matching results. In P2P overlay networks, the continual arrival and depar- ture of nodes leads to the necessity of running periodically To face this problem many approaches have been proposed. a specific reconfiguration algorithm able to renew the over- In structured overlay networks, a dedicated routine is in lay. This paper presents a new reconfiguration mechanism charge of revealing node departures [8]. Upon the detection for overlay networks. The algorithm exploits a character- of a node departure, this routine starts an overlay reconfig- istic observed in many overlay networks: the presence of a uration devoted to renew the overlay network, in order to core composed by reliable nodes (nodes that never leave the come back to the operative state. In unstructured overlay overlay). Even if members of the core are a-priori unknown, networks, usually there is no dedicated routine to detect de- the algorithm is able to eventually select one of them that partures [9]. The departure detection happens only when will carry out the periodic renewing of the overlay. Once the a node wants to exchange information with an already de- reconfiguration is performed by a core node, the overlay will parted one. Usually, the detection does not lead to renew never lose its reliability, letting any query, issued from this the entire overlay, but who detected the departure only re- time on, be ever satisfied. moves its pending connection. Note that the approach used in unstructured overlays can be pursued because a departure 1. INTRODUCTION affects only locally the overlay, in contrast with structured Recently, Peer-to-Peer (P2P) systems attract attention. In overlays where a departure may affect the whole overlay. Let P2P systems, each user shares and exchanges information as us note that if a departure affects the whole overlay, a possi- equals by playing the role of both the server and the client. ble burst of departures may significantly lengthens the time Because nodes that are participating to the system are con- for the reconfiguration to take effect. The longer the time nected to each others to construct an overlay network, the is for reconfiguration, the higher is the number of queries p2p paradigm is superior on scalability compared to client- that can cross a still damaged network with the possibility sever model. However, P2P networks have technical issues to return an empty (or partial) result. This could lead to a that should be solved. Generally, P2P networks are catego- very poor reliability. rized into two types, structured [6, 7, 8] and unstructured [1, 2, 3, 9]). Due to the absence of a server node that cen- The approach proposed in this paper circumvents the draw- trally stores and manages all of the information, a query backs of a continual global reconfiguration starting from the has to be routed inside the network to search the a-priori un- observation that not all nodes in the network have the same known set of nodes that maintain the matching information. lifetime. It is widely recognized that overlay networks con- Therefore, a lot of effort has been paid in order to realize ef- tains, among a huge number of nodes with very short life- ficient and reliable information search/retrieval mechanisms times, a core, i.e. a small number of very reliable nodes by properly constructing the overlay network with the aim with unlimited lifetime [5]. However, the identities of the of improving successiveness of the search, shorten of search core members are not a priori known. delay, or reducing a total resource consumption. The is- sue of reliability is particularly challenging in P2P systems In this paper we propose a core-based approach to overlay due to the continuous arrival and departure of nodes. This reconfiguration. The basic idea is to eventually elect a core phenomenon, also called churn, may seriously compromise node as a special node able to carry out the reconfiguration. overlay functionalities. For instance, overlay may degrade This node, also called supervisor, actually carries the recon- until disconnection as time passes by, avoiding queries to figuration in a proactive way: the node renews the overlay periodically, before it degrades to disconnection. This can be done deterministically, by knowing an upper bound on the departure-rate of nodes. More in details, the supervisor peri- odically gathers information about current alive nodes inside the network, computes for each node a new state (e.g. new connections to other nodes) and communicates it to these alive nodes, which update their state before the network is damaged; in such a way a query crossing the network at any time will be ever satisfied. Note that, at the beginning, the algorithm is not able to give to a core node the supervisor In our algorithm, a node, called supervisor, is elected and it role, as core nodes are unknown to the algorithm. However, will be in charge of running the reconfiguration procedure. the proposed algorithm is able to converge to the election of a core node as supervisor after a finite time. Before this In a dynamic system, continuous failures of nodes make con- time, queries can be lost, but after this time, queries will nectivity of the ON decreasing. In our approach, the ON be satisfied forever, contrarily to what happens in the struc- graph is a k-connected graph so that it will be resilient to tured approach. at least (k − 1) failures from its definition. The paper is organized as follows: Section 2 defines the sys- Fixing a desiderata degree of connectivity k and given the tem model and Section 3 presents the reconfiguration algo- failure rate fr , it is possible to know how long the ON graph rithm along with its correctness proof. will be connected. 2. SYSTEM MODEL Property 2. Let k be the degree of connectivity and let t We consider an infinite set of processes Π={p1 , p2 . . . , pn . . . }. be the time where the ON graph is defined. Let be fr the fail- Each process has a unique identifier for all its life time in the ure rate of the system, then the ON graph will be connected system. Processes are arranged in a logical network, called for a time period ∆T = (k − 1)/fr from t. Overlay Network (ON), built on top of the physical one. The ON can be seen as a graph where each node represents a process pi ∈ Π and each edge represents a logical commu- We call this period graph lifetime LG . nication channel between two elements pi , pj ∈ Π such that pi and pj can communicate. Every process is able to commu- The idea of the protocol is to use the degree of connectivity nicate with its direct neighbors in the ON by means of mes- and the failure rate to know when the overlay is becoming sages exchange on point-to-point reliable channels. There disconnected and renew it. This is the task of the supervisor exists a known bound δ on the message transmission delay that recomputes a new k-connected graph Gnew and com- on the considered channels, hence the system can be con- municates it to all the other nodes in the system before the sidered as synchronous. We define as correct a process that graph lifetime LG of the current graph G expires. The idea never fails. A faulty process fails by crashing and if it recov- of the reconfiguration protocol is the following: during the ers from the crash then it is considered as new in the system lifetime of G, the supervisor starts the reconfiguration pro- (with a new identifier). System is dynamic, i.e., nodes may cedure to compute Gnew . The reconfiguration is composed join and leave system at any time. More formally, we assume of three phases: (i) nodes health verification, (ii) graph com- that: putation, (iii) graph dissemination. The first phase consists of a verification of the alive nodes in the system. During this phase, the supervisor sends a “ping” message to all the • when a node leaves the system, it does not perform any nodes included in the current graph G and to newly joined specific task and then a voluntary leave is considered nodes, thus waits for their replies. Then the second phase as a failure. In the following we consider equally faults starts and Gnew is computed including all nodes which have and voluntary leaves and we refer them as faults; replied. In the third phase, the supervisor starts the dis- • when a node fails, it is removed from the graph to- semination of the Gnew to all the alive nodes included in gether with all its incident edges; the new graph. • a node joins the system through a join event. The over- If the supervisor does not crash, connectivity is preserved as lay is actually composed by all and only those nodes the dissemination of the new graph Gnew deterministically that have joined the system and have not left yet. terminates before LG expires. • there exists a known upper bound fr on the node fail- If the supervisor crashes, a new election will start and a new ure rate (i.e. the number of nodes that fail/leave the supervisor is selected. During the period when the supervi- system in each time unit); sor is changing, connectivity cannot be assured. However, due to the core assumption, we have that in a finite num- • there exists a-priori unknown finite set of processes, ber of elections, and then in a finite time, a core node will called core ∈ Π, that never crash or leave the system become a supervisor and then connectivity will hold forever. [4, 5]. The join to the overlay is also managed by a specific pro- 3. THE RECONFIGURATION ALGORITHM cedure that allows a new node to become part of the ON by means of an access point node selected among the ones 3.1 Algorithm Overview already connected in the ON through a graph computed by The proposed algorithm aims at keeping overlay connectiv- some supervisor. ity most of the time. Formally the problem is defined as follows: 3.2 Algorithm pseudo-code 3.2.1 Data Structures Property 1. Eventual Connectivity. Eventually and At the process start-up, all the data structures have to be permanently, all processes in the ON will be included in a initialized. In Figure 1 the pseudo-code of the initialization connected graph GP . phase is shown. Init We assume to have a bootstrap service that makes possible, 1 supervisor = nil 2 active = f alse for incoming nodes, to find an access point to the ON. In 3 myLevel = 0 Figure 2 is presented the Join protocol. 4 joined = ∅ 5 myKnowledge[] = emptyArray Join(pi ) 6 parentsLevel = 0 1 if (pi == nil) 7 parents = ∅ 2 then 8 candidates = ∅ 3 knowledge[0] = myId 4 supervisor = myId 5 active = true 6 set timerGraph = ((k − 1)/fr ) − 4δ Figure 1: Data Structure Initialization 7 else send (“Join00 , myId) to pi (a) Data structures, maintained by a node pi , collect several when (receive (“Join00 , i) from pi ) do information, as described as follows: 1 if (supervisor 6= myId) 2 then joined = joined ∪ {pi } The variable active is initially false and does not change 3 knowledge[myLevel + 1] = knowledge[myLevel + 1] ∪ {pi } until pi does not install1 the first graph and remains true 4 send (“Ack00 , myLevel, supervisor, knowledge[]) to pi until pi crashes or leaves the system. (b) The variable myLevel represents a logical “distance” be- when (receive (“Ack00 , level, s, knowledge[]) from pi ) do tween pi and the supervisor and is set during the join pro- 1 myLevel = level + 1; parentsLevel = level cedure depending from the access point used by pi to join 2 myKnowledge[] = knowledge[] the ON. 3 supervisor = s (c) The variable myKnowledge contains the information about processes currently joined to the ON. This information is actually gathered by the supervisor and it is the only one al- Figure 2: Join Protocol lowed to communicate using connections to processes stored in the knowledge variable. This communication happens The joining node pj contacts the bootstrap service that re- when the supervisor has to communicate the new graph to turns an active node pi already member of ON if there exists, current participants. This variable also maintains the in- otherwise it returns nil. formation about the distance that processes have from the current supervisor. To this end the variable myKnowledge If no process is returned by the bootstrap service, pj starts has an array structure defined as follows: at the i-th entry to build the knowledge inserting itself at the level 0 and of the array it is stored the set of nodes having level i. becoming active. At this point pj is the first node part of ON. Since pj is the only node part of the ON it becomes The variable supervisor contains the identifier of the node automatically the supervisor and then it starts to monitor that is currently considered as supervisor for the ON by the the ON. In particular it sets a timer, namely timerGraph, process pi . equal to the graph lifetime LG (minus the time needed to verify the health of the ON participants and to communicate The variable joining contains a set of nodes joined from the the new graph before the current one lose the connectivity) installation of the last graph and not yet active, including and then starts the Reconfiguration procedure. those that joined using pi as access point. If a process pi is returned, pj contacts pi sending a mes- The information used to elect a supervisor is stored in the sage of join request with attached its identifier. When pi variables parents and candidates; parents is the set con- receives the request of pj , it updates its knowledge (stor- taining all the nodes alive at the level just before pi ’s level ing pj identifier at its level plus one) and sends back to pj and represents the nodes to be monitored in order to detect an acknowledgment message containing pi ’s level, the cur- a possible crash of the supervisor while candidates is the set rent supervisor and its knowledge. Moreover, if pi is not the containing the nodes at pi level that are possible candidates current supervisor, it puts the identifier of pj in the list of for the election of the new supervisor. Due to the failures, joined nodes in order to let the supervisor aware of pj for an entry of the myKnowledge array may become empty and the next graph. When pj receives the ack of pi , it updates then there exist some nodes inside the system that have to its structure with the information received. update the level of their parents in the myKnowledge struc- ture. To this aim we store in the variable parentsLevel the 3.2.3 Reconfiguration Procedure current level not empty where the node can find its parents The Reconfiguration procedure is managed by the supervisor in the myKnowledge structure. node and it is triggered periodically. In Figure 3 it is shown the reconfiguration protocol. 3.2.2 Join Procedure This protocol is based on the usage of two timers, timerGraph 1 and timerM onitor, exploiting the synchrony of the system. A graph is installed when a node receives it and starts to communicate using its links. timerGraph measures the graph lifetime while timerM onitor measures the time needed to collect all the information needed for the detection. When the timerGraph expires, the super- visor sends a ping message to all the nodes in myKnowledge, resets its knowledge, sets the timerM onitor and then waits the replies for the maximum time needed to send and receive a message (2δ). When a node pi receives the ping, it replies with a “pong” message and attaches its level and the set joined of nodes that it knew have joined in the last graph lifetime. When the supervisor receives the pong message, it updates 1 when (timerGraph expired) do its knowledge and when the timerM onitor expires it com- 2 for each (p ∈ memberOf (myKnowledge)) do putes the new graph containing all the nodes that have 3 send (“ping 00 ) to pfor each i do replied to the ping and the nodes that have completed their 4 myKnowledge[i] = ∅ join before the reconfiguration starts. The new graph is com- 5 Set timerM onitor = 2δ puted by means of the compute graph() function that has as (a) parameters the set of nodes to be connected and the degree of connectivity k and returns the new graph. This function can build whichever type of k-connected graph because the 1 when (receive (“ping 00 ) from pi ) do 2 send (“pong 00 , joined, myLevel) to pi graph topology is irrelevant for our algorithm since it uses this function as black box. Once the new graph is computed, (b) the supervisor sends it to all the nodes it knows and then sets again the timerGraph. 1 when (receive (“pong 00 , joined, level) from pi ) do 2 myKnowledge[level + 1] = myKnowledge[level + 1] ∪ joined When a node pi receives the graph, it updates its data struc- 3 myKnowledge[level] = myKnowledge[level] ∪ pi tures; if pi was not active, it changes its state and from now on it becomes effectively part of the ON and it starts the (c) (transitive) monitoring of the supervisor. 1 when (timerM onitor expired) do 2 if (|member(myKnowledge[])| > f (k)) 3.2.4 Supervisor Election Procedures 3 then G ← compute graph(to monitor ∪ myId, k) The supervisor can crash and then it is necessary to elect 4 for each p ∈ member(myKnowledge[]) do a new one. The monitoring procedure uses the particular 5 send (“newGraph00 , G, myKnowledge[]) to p structure of the knowledge and avoids all the nodes to ping 6 Set timerGraph = ((k − 1)/fr ) − 4δ the supervisor delegating this task only to the nodes at the supervisor level or at the subsequent level. Since every node (d) can crash this local monitoring is repeated for all the lev- els of the myKnowledge structure and each node monitors 1 when (receive (“newGraph00 , G, knowledge[]) from pi ) do transitively the supervisor by means of a local monitoring. 2 if (¬active) 3 then active = true 4 parentsLevel = maxi {i < myLevel∧ In Figure 4 it is shown the monitoring procedure. 5 myKnowledge[i] 6= ∅} 6 parents = parents ∪ knowledge[parentsLevel] The monitor procedure is activated as soon as a node be- 7 candidates = candidates ∪ knowledge[myLevel] comes active and it is executed periodically. Let us consider 8 trigger monitorSupervisor() a process pi , it sends a heartbeat message to all the nodes 9 myKnowledge[] = knowledge[]; supervisor = pi ; joined = ∅ contained in the parents and candidates lists and waits for 10 install(G) their replies setting a timer, namely timerElection. (e) When a node receives the heartbeat message, it sends back a heartbeat reply and, if it did not know the sender of the Figure 3: Graph Reconstruction Procedure message, it adds the sender to the list of candidates. Receiving the heartbeat reply, every process updates its data structure according to the level of the sender node. When the timerElection expires, the election procedure starts as shown in Figure 5. The election procedure selects a new supervisor when the crash of the current one is detected. The crash of the super- visor is detected by the nodes whose level is the supervisor one and from nodes at the first level not empty following the supervisor level; when these nodes do not receive any reply by the “parents” nodes, they know that they are potentially candidates to become supervisor and then decide with a de- terministic rule the supervisor. The new supervisor is now ready to build the new graph and starts the reconfiguration procedure. monitorSupervisor() 1 while (active every 2δ) do 3.3 Correctness and Guarantees of the Algo- 2 for each (p ∈ parents ∪ candidates) do rithm 3 send (“HB Req, myLevel00 ) to p In this section we show that our algorithm satisfies eventual 4 parents = ∅ 5 candidates = ∅ connectivity. 6 set timerElection = 2δ (a) Theorem 1. The reconfiguration procedure satisfies Even- tual Connectivity. 1 when (receive (“HB Req, level00 ) from pi ) do 2 send (“HB Rep00 , myLevel) to pi 3 if (pi ∈ / knowledge[level]) Proof. (Sketch) We first show that the reconfiguration 4 then joined = joined ∪ pi procedure maintains connectivity when the supervisor does 5 knowledge[level] = knowldge[level] ∪ pi not crash and then we show that it works even with fault 6 if (level == myLevel) supervisor. 7 then candidates = candidates ∪ pi At the beginning there is only one node pi that becomes (b) supervisor and activates the reconfiguration thread. pi is the only node active in the ON and the graph G0 of the 1 when (receive (“HB Rep00 , level) from pi ) do ON is composed only by pi ; moreover, both the knowledge 2 if (level < myLevel) 3 then parents = parents ∪ pi and the election information are composed only by pi . Since 4 else candidates = candidates ∪ pi the supervisor is the only active node, the bootstrap service returns always its identifier to the incoming nodes. When (c) the reconfiguration procedure starts due to the expiration of the timerGraph, the supervisor starts to ping all the nodes Figure 4: Monitoring Supervisor Procedure who it knows about and waits for 2δ for the replies. We may have two cases (i) no node has joined the ON in LG0 ; (ii) some nodes have joined the ON in LG0 . In the first case in 2δ pi will receive only its own reply, due to the synchrony of the system, and will compute the graph containing again only itself. In the second case, the nodes who have joined the ON and are still alive will receive the ping message in δ time due to the synchrony and the perfect link, and then will reply to the supervisor that will receive the pong message after at most δ time. After 2δ from the ping messages, the supervisor will know exactly how many and who are the nodes alive to be connected in the new graph G1 . If there is enough nodes to build the k-connected graph then a graph including all the nodes is computed otherwise G1 will include only pi . when (timerElection expired) 1 if (|parents == 0|) Let us consider a graph Gi with more than one node; when 2 then supervisor = min(candidates) the timerGraph expires the supervisor repeats the proce- 3 if (supervisor = myId) dure described above. Since a k-connected graph is able to 4 then set timerGraph =  tolerate k − 1 failures, the graph lifetime is LG = (k − 1)/fr 5 else knowledge[myLevel] = candidates and the timerGraph expires after ((k − 1)/fr ) − 4δ we have 6 knowledge[parentsLevel] = parents that the connectivity is maintained when the reconfiguration 7 parentsLevel = maxi {i < myLevel∧ 8 knowledge[i] 6= ∅} starts and it is guaranteed for 4δ more time. To verify which nodes are still alive between the known ones, the supervisor (a) uses 2δ times and to spread the new graph the supervisor uses one more δ time then when the new graph is installed, the old one is still connected. Moreover the new graph is Figure 5: Election Procedure k-connected again and then connectivity is preserved. Consider now the case where the supervisor can fail. Let i be the level of the supervisor. Due to the monitorSupervisor() procedure, every 2δ time nodes at level i and i + 1 send a heartbeat message to the supervisor. Let us suppose that at some t the supervisor crashes. If there are no other nodes at level i, when the timerElection expires, nodes at level Distributed systems(OTMÕ06), October 2006., 2006. i + 1 have the set parents empty and then will execute the [6] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and line 2 of Figure 5. Since the rule used to select the super- S. Schenker. A scalable content-addressable network. In visor is deterministic, all the nodes will recognize the same SIGCOMM ’01: Proceedings of the 2001 conference on new supervisor and the new supervisor knows that it is the Applications, technologies, architectures, and protocols new one. Similarly, a new supervisor can be selected deter- for computer communications, pages 161–172, New ministically even if other nodes are still at level i; the only York, NY, USA, 2001. ACM Press. difference is that now the supervisor will be chosen between [7] A. I. T. Rowstron and P. Druschel. Pastry: Scalable, the nodes still alive at level i. The new supervisor starts Decentralized Object Location, and Routing for now the reconfiguration procedure and then the new graph Large-Scale Peer-to-Peer Systems. In Middleware ’01: could be installed. Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, pages Due to the core assumption, we have that inside the sys- 329–350, London, UK, 2001. Springer-Verlag. tem there exist stable nodes and eventually one of them will [8] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, be selected to become supervisor and then it never crashes. M. F. Kaashoek, F. Dabek, and H. Balakrishnan. When a stable node is selected, we return to the case de- Chord: a scalable peer-to-peer lookup protocol for scribed above where the supervisor does not crash and then internet applications. IEEE/ACM Trans. Netw., from that point connectivity is guaranteed forever. 11(1):17–32, 2003. [9] S. Voulgaris, D. Gavidia, and M. Steen. CYCLON: Inexpensive Membership Management for Unstructured P2P Overlays. Journal of Network and Systems Management, 13(2):197–217, June 2005. 4. CONCLUSIONS This paper presented a new core-based reconfiguration al- gorithm able to build an overlay network highly reliable. In terms of deterministic guarantees, the overlay is able to let queries crossing the network be ever satisfied from some point of time on. This limits the possible continual lost of re- liability current reconfiguration approaches suffer from. In fact, our approach can lose queries only for a finite time. This finite-time unreliability comes from the fact that mem- bers of the core are a-priori unknown to any process join- ing the network. Nevertheless, all processes will be even- tually able to select a core member that will carry out all the reconfigurations. By assuming known an upper-bound on the node failure-rate, successive reconfigurations done by the same supervisor will take effect before the overlay de- grades its functionalities. 5. REFERENCES [1] A. Allavena, A. Demers, and J. E. Hopcroft. Correctness of a gossip based membership protocol. In PODC ’05: Proceedings of the twenty-fourth annual ACM symposium on Principles of distributed computing, pages 292–301, New York, NY, USA, 2005. ACM Press. [2] P. Eugster, S. Handurukande, R. Guerraoui, A. Kermarrec, and P. Kouznetsov. Lightweight probabilistic broadcast. In In Proceedings of The International Conference on Dependable Systems and Networks (DSN ’01), July 2001., 2001. [3] A. J. Ganesh, A. Kermarrec, and L. Massoulié. Peer-to-Peer Membership Management for Gossip-Based Protocols. IEEE Trans. Comput., 52(2):139–149, 2003. [4] P. B. Godfrey, S. Shenker, and I. Stoica. Minimizing churn in distributed systems. ACM SIGCOMM Computer Communication Review, 36(4):147–158, 2006. [5] V. Gramoli, A. Kermarrec, A. Mostefaouii, M. Raynal, and B. Sericola. Core persistence in peer to peer systems: Relating size to lifetime. In On The Move International Workshop on Reliability in Decentralized