=Paper= {{Paper |id=Vol-3933/Paper_10.pdf |storemode=property |title=Stateful Cluster Leader Failover Models and Methods Based on Replica State Discovery Protocol |pdfUrl=https://ceur-ws.org/Vol-3933/Paper_10.pdf |volume=Vol-3933 |authors=Serhii Toliupa,Maksym Kotov,Serhii Buchyk,Juliy Boiko,Serhii Shtanenko |dblpUrl=https://dblp.org/rec/conf/iti2/ToliupaKBBS24 }} ==Stateful Cluster Leader Failover Models and Methods Based on Replica State Discovery Protocol== https://ceur-ws.org/Vol-3933/Paper_10.pdf
                                Stateful cluster leader failover models and methods based
                                on Replica State Discovery Protocol
                                Serhii Toliupa1, , , Maksym Kotov1, , Serhii Buchyk1, Juliy Boiko2, and Serhii Shtanenko3
                                1
                                    Taras Shevchenko National University of Kyiv, 60 Volodymyrska St., Kyiv, 01033, Ukraine
                                2

                                3
                                 Military Institute of Telecommunication and Information Technologies named after the Heroes of Kruty, Street of Princes of
                                Ostrozki 45/1, Kyiv, 01011, Ukraine


                                                   Abstract
                                                   High availability is a cornerstone of fault tolerance in production clusters. The following article delves into
                                                   the novel methods and models to achieve rapid cluster leader failover based on the Replica State Discovery
                                                   Protocol (RSDP). Firstly, RSDP is described and evaluated as a method of achieving consensus within
                                                   homogenous multiagent distributed systems. This paper provides a novel mathematical model that
                                                   describes the internal procedure of the said protocol. Additionally, diagrams and algorithm steps are
                                                   provided to further simplify integration of the RSDP into modern Decentralized Coordination Networks
                                                   (DCNs). Secondly, a new state reducer is developed that allows to perform a synchronized leader election
                                                   process. Its mathematical model and code implementation written in JavaScript are provided and comply
                                                   with an established extension interface described within the confines of
                                                   Evaluation and implications of the newly created leader election protocol are provided to further expand
                                                   the horizons of DCN coordination. Lastly, this article explores the practical implications of the mentioned
                                                   state reducer in the context of the stateful cluster leader failover. Three different approaches and models
                                                   based on the proposed consensus algorithm to mitigate spontaneous critical events are modeled and
                                                   assessed. Based on failure probability, failover duration, and communication overhead mathematical
                                                   models, the said approaches were compared, and recommendations for their application were provided.
                                                   Overall, this article is aimed at further development of RSDP and describes novel approaches towards
                                                   relevant coordination issues inside the clusters with high demands for availability and fault tolerance.

                                                   Keywords
                                                   distributed computing; Decentralized Coordination Networks (DCNs); Replica State Discovery Protocol
                                                   (RSDP); cluster state management models; cluster failover management models; leader election protocol
                                                   based on RSDP and deterministic operations.1



                                1. Introduction
                                Being tasked with a complex design of a modern distributed system leads inevitably to the myriads
                                of convoluted architecture decisions towards achieving high availability and fault tolerance.
                                Throughout the entire history of computer science and Internet Technology industry, the
                                cornerstone problem is threefold: consistency, availability, and partition tolerance, renown also as
                                CAP theorem [1-7].
                                   The theorem in its basis promotes an assumption that service could only be two of three: either
                                consistent and available, available and tolerant to partitioning, or consistent and tolerant to
                                partitioning [1-7]. Though it has to be stated that business-centric approaches produced variants of
                                the said theorem that put resource utilization, complexity, and service quality instead while going
                                through the decision-making process.


                                Information Technology and Implementation (IT&I-2024), November 20-21, 2024, Kyiv, Ukraine
                                 Corresponding author.
                                 These authors contributed equally.
                                   tolupa@i.ua (S. Toliupa); maksym_kotov@ukr.net (M. Kotov); buchyk@knu.ua (S. Buchyk); boiko_julius@ukr.net (J.
                                Boiko); sh_sergei@ukr.net (S. Shtanenko)
                                    0000-0002-1919-9174 (S. Toliupa); 0000-0003-1153-3198 (M. Kotov); 0000-0003-0892-3494 (S. Buchyk); 0000-0003-0603-
                                7827 (J. Boiko); 0000-0001-9776-4653 (S. Shtanenko)
                                              ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).



                                                                                                                                                                                        120
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
    Nevertheless, the crux is the same; it is assumed that no model, method, approach, or
methodology exists that could completely satisfy every property of this group. That assumption is
still holding strong, since mechanisms that comply with one subset directly obstruct efforts of
achieving the other [1-7].
    It has been known throughout the history of research efforts into building reliable systems, that
trying to achieve fault tolerance while relying on a single instance is futile. This can be attributed to
the following reasons:

   1.   No matter how much the source code is being tested, extremely rare race conditions could
        still happen when dealing with external devices or even replicated systems.
   2.   Physical destruction of the datacenter will nullify any logical sound and fault-tolerant
        mechanism that was preliminarily implemented.
   3.   Even if we assume that software is logically consistent, a random radiation ray could flip
        some bits in the system and lead to a catastrophe.
   4.

        logically. That is especially problematic during wartime or a nation-wide crisis.
   5.   At some point you will simply have to put the running instance under maintenance, and this
        will effectively stop its operation for a while.

    While building cloud systems that require high availability, an architect has to eventually
consider replication and clusterization techniques [8-12]. Within the context of this article, we will
differentiate between these terminologies in the following way:

   โ€ข    Replication    is a distributed topology of a homogeneous multiagent system, where each

        participants of the said network. Each replica may have the same set of initial parameters,
        program code, logic, and ongoing state. Therefore, replicated environments provide rapid
        disaster recoveries since every other instance could effectively take the responsibilities of the
        one that failed [8-10].
   โ€ข    Clusterization refers to the distributed system organization, where the overall state is still
        synchronized and coordinated but the purpose and the concurrent tasks on different nodes
        are different. The common purpose of building clustered systems is state splitting. Other
        examples may include hot and warm standby servers that replicate events from the main
        machine but still follow the orders from a leader and are usually restricted in their
        functionality [13-15].

   Therefore, the purpose of this article is to model multiple approaches towards coordinating
replicated and clustered decentralized networks. As a result of the conducted evaluation, a set of
practical recommendations is proposed to simplify the decision-making process during the system
design stage.
   Additionally, it is the intent of this paper to develop a mathematical model for the Replica State
Discovery Protocol, which serves as a framework for performing cluster-wide state synchronization
and coordination. RSDP provides a basis upon which a set of logical extensions could be built to
achieve various consensus effects.
   Using the mentioned consensus basis, multiple leader election mechanisms were developed and
modeled. Each of those mechanisms is characterized by a set of unique security, efficiency, and
resilience properties, allowing for catering to the need of a specific environment. The said properties
were compared based on probability and computational complexity assessments and as a result,
recommendations for their application were provided.


                                                                                                     121
2. Replica State Discovery Protocol
The fundamental problem in managing resilience through redundancy is coordination. Since the
managed objects are by definition separated, they usually do not have any common memory location
that would allow them to successfully establish a synchronization algorithm based on classical
concurrency control mechanisms such as locks, mutexes, or semaphores [16-18].
   There are quite a few solutions that allow for both immediate and eventually consistent
consensus-achieving. An example of the first would be total order broadcast, or, in other words,
complete replication of events in the original order. Eventually-consistent algorithms tend to take
the process in steps to reach consensus regarding a proposed value. Examples of such algorithms
include Raft, Paxos, Ring, ZooKeeper, and many others [19-21].
   In that regard, the Replica State Discovery Protocol could be called one of the eventually
consistent algorithms. RSDP provides not only the basis for achieving consensus but also serves as a
distributed coordination framework, allowing for various extensions and handling cluster events.
The difference between classical leader election or consensus-reaching protocols and RSDP is the
intention and flexibility they provide. The former usually concentrate on the process of voting for a
single common value. In the meantime, RSDP provides a foundation not only for single-state
consensus but also for synchronizing complex state setups and merges [22].

2.1. Local Area Network Simulation based on AMQP
RSDP in its core was initially designed on the basis of the local area network simulation based on
the Advanced Message Queuing Protocol. The details of its implementation, efficiency, security,
implications, and resilience are outlined in a separate article. But for the purpose of theoretical
context, a few words have to be said to cover potential questions regarding the reliability of RSDP
and its message passing process [22].
   First and foremost, the said network simulation is built on top of the message queueing protocol.
In that context, AMQP stands as one of the most popular solutions in coordinating and routing
complex message network topologies and is continuously gaining momentum in the field of research
and engineering [23-25].
   Figure 1 shows the conceptual operation basis of AMQP and its components:




                  Figure 1: Advanced Message Queuing Protocol                    .
                                                                                                 122
   The fundamental idea of AMQP is the separation of client and server, which are called producer
and consumer. Instead of direct communication, the message goes through the broker and its queue,
thus allowing for alleviating direct dependency between clients and servers. Additionally, AMQP
describes the achievement of fundamental communication properties such as resilience, durability,
congestion control, security, etc. [23-25].
   Local Area Network Simulation (SLAN) leverages these capabilities to establish a secure, resilient,
and isolated LAN-like environment. In its basis, SLAN describes the provisioning of the two basic
communication media: direct communication links and a broadcast link.
   Figure 2 shows the interaction media of the SLAN:




                    Figure 2: Local Area Network Simulation based on AMQP.

   SLAN operation basis relies on the two main capabilities described within the context of AMQP:

binded queues (e.g., broadcast the message) or send the message to a single binded queue by the
routing key (direct communication routing).
   AMQP defines the durability, mirroring, and quorum mechanisms for its queues and is considered
to be a well-documented, tested, resilient, and flexible foundation for communication media. SLAN,
and by implication, RSDP, rely on these properties to build its own abstraction layers [26, 27].

2.2. RSDP phases and consensus process
RSDP has its own dedicated article that describes every operation, state change, and consensus-
oriented process in detail [22]. This section provides a succinct overview of RSDP phases with some
amendments and clarifications for the operation sequences.
   To provide capabilities of cluster-wide state operations, RSDP describes its lifecycle in a few

phases is respectively responsible for the introduction of the new node, state sharing, and final state
derivation. The last phase is responsible for handling the shutdown lifecycle event.
   Since each distinct phase somehow interferes with the cluster state stored on each replica
individually, every instance has a concurrency control mechanism based on the
โ€œ๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ๐‘ƒโ„Ž๐‘Ž๐‘ ๐‘’๐‘€๐‘ข๐‘ก๐‘’๐‘ฅโ€. That mutex prevents multiple simultaneous cluster events from interfering
with each other by restricting stored state access to a single active phase [16-18].
                                                                                                   123
   Figure 3 shows the phases and interaction during RSDP process:




                        Figure 3: Replica State Discover Protocol phases.

                                               โ€œ๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ๐‘ƒโ„Ž๐‘Ž๐‘ ๐‘’๐‘€๐‘ข๐‘ก๐‘’๐‘ฅโ€ that prevents state mutation

announcing its presence in the cluster. This message is sent through the broadcast channel and is
meant to be received by all cluster members.

                                                                                             124
replica would then buffer answers from the cluster members and perform an aggregation of the state
as dictated by the state reducers.

exchange to all the members. The final operation of the initial phase is to release the mutex and wait
for any other cluster-wide events.

cluster. Share messages would then be buffered for a configured amount of time to avoid redundant
replica reloads. After the timeout elapses, the protocol engine would acquire the mutex, and the
share messages would be validated and aggregated, giving a holistic view of the cluster-wide state.
Subsequently, the protocol engine would release the mutex and wait for any other occurring events
in the system.
                                                                              a replica goes down,
it signals the others while others in the cluster would then first acquire the mutex, perform the
necessary state updates, and release the mutex. No additional synchronization is necessary as the
protocol assumes that every operation performed on the state is deterministic and made within the
scope of a set of clean functions that do not rely on side effects.

2.3. RSDP mathematical model
Let โ„› = {๐‘…1 , ๐‘…2 , โ€ฆ , ๐‘…๐‘› } be the set of replicas in the distributed system. The replicas communicate
over a network represented as a graph ๐บ = (โ„›, ๐ธ), where ๐ธ โІ โ„› ร— โ„› denotes the set of
communication channels between replicas.
   Each replica ๐‘…๐‘– maintains a local state ๐‘†๐‘– , which is an element of the global state space ๐’ฎ. The
global state of the system is a tuple ๐’ฎ = (๐‘†1 , ๐‘†2 , โ€ฆ , ๐‘†๐‘› ).


transitions.

                                                                                                            (๐‘–)
   โ€ข                                                                                                       ๐‘šHELLO from
        replica ๐‘…๐‘– .
   โ€ข
                               (๐‘—โ†’๐‘–)
        is denoted as ๐‘šSTATUS (๐‘†๐‘— ) from ๐‘…๐‘— to ๐‘…๐‘– .

                                                                                         (๐‘–)
   During the initiation, each replica ๐‘…๐‘–                                              ๐‘šHELLO to all other replicas.
                       (๐‘–)                                                                (๐‘—โ†’๐‘–)
Upon receiving ๐‘šHELLO , a replica ๐‘…๐‘—                                                    ๐‘šSTATUS (๐‘†๐‘— ) containing its
current state ๐‘†๐‘— .

                                                          ๐‘๐‘Ÿ๐‘œ๐‘Ž๐‘‘๐‘๐‘Ž๐‘ ๐‘ก      (๐‘–)
                                            โˆ€๐‘…๐‘– โˆˆ โ„›, ๐‘…๐‘–        โ†’       ๐‘šHELLO

                                                                   (๐‘–)           (๐‘—โ†’๐‘–)
                             โˆ€๐‘…๐‘— โˆˆ โ„› โˆ– {๐‘…๐‘– }, upon receiving ๐‘šHELLO : ๐‘…๐‘— โ†’ ๐‘šSTATUS (๐‘†๐‘— )

state ๐‘†agg . The aggregation function ๐‘“๐‘Ž๐‘”๐‘” combines individual states:
                                                     (๐‘—โ†’๐‘–)
                              ๐‘†agg = ๐‘“๐‘Ž๐‘”๐‘” ( {๐‘†๐‘— โˆฃโˆฃ ๐‘š       (๐‘†๐‘— ) received} )
                                                              STATUS
                                                                                                  ๐‘†agg :
                                                  ๐‘๐‘Ÿ๐‘œ๐‘Ž๐‘‘๐‘๐‘Ž๐‘ ๐‘ก    (๐‘–)
                                             ๐‘…๐‘–      โ†’        ๐‘šSHARE (๐‘†agg )
                                                                                ๐‘…๐‘– .
                                                                                                                   125
                                               (๐‘˜)     (๐‘˜)     (๐‘˜)
                         ๐‘†๐‘– โ† ๐‘“๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’ ( ๐‘†๐‘– , {๐‘†agg โˆฃโˆฃ ๐‘šSHARE (๐‘†agg ) received} )
                                                        ๐‘…๐‘–
                                                   ๐‘๐‘Ÿ๐‘œ๐‘Ž๐‘‘๐‘๐‘Ž๐‘ ๐‘ก           (๐‘–)
                                        ๐‘…๐‘–     โ†’ ๐‘šCLOSE
   Remaining replicas adjust their states to reflect the departure:
                                โˆ€๐‘…๐‘— โˆˆ โ„› โˆ– {๐‘…๐‘– }, ๐‘†๐‘— โ† ๐‘“๐‘๐‘™๐‘œ๐‘ ๐‘’ (๐‘†๐‘— , ๐‘…๐‘– )
   Where ๐‘“๐‘๐‘™๐‘œ๐‘ ๐‘’ is a function that removes references to ๐‘…๐‘– from ๐‘†๐‘— .

2.4. RSDP formal definition and properties
Define ๐’ฎ as a set of possible states for a replica. Each state ๐‘†๐‘– may include:

   โ€ข   Membership list ๐‘€๐‘– โІ โ„›;
   โ€ข   Resource utilization ๐‘ˆ๐‘– โˆˆ โ„+ ;
   โ€ข   Other reducer and application-specific data.

   Define โ„ณ as the set of all possible messages:
                    ๐‘š โˆˆ โ„ณ = {๐‘šHELLO , ๐‘šSTATUS , ๐‘šSHARE , ๐‘šCLOSE } ร— Payload
   The aggregation function (๐‘“๐‘Ž๐‘”๐‘” ) combines multiple states:
                        ๐‘“๐‘Ž๐‘”๐‘” ({๐‘†1 , ๐‘†2 , . . . , ๐‘†๐‘— }) = โ‹ƒ Reducerk ({๐‘†1 , ๐‘†2 , . . . , ๐‘†๐‘— })
                                                            ๐‘˜
   This could be defined as:

   โ€ข   For membership lists: ๐‘€agg = โ‹ƒ๐‘— ๐‘€๐‘— ;
                                                    1
   โ€ข   For resource utilization ๐‘ˆagg =                            โˆ‘๐‘— ๐‘ˆ๐‘— .
                                            |{๐‘†1 ,๐‘†2 ,...,๐‘†๐‘— }|


   The state update function ๐‘“๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’ updates the local state based on received aggregated states:
                                                       1      2              (๐‘˜)
                                  ๐‘†๐‘– โ† ๐‘“๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’ (๐‘†๐‘– , {๐‘†agg , ๐‘†agg , . . . , ๐‘†agg })
   This may involve:

                                                           (๐‘˜)
   โ€ข   Updating membership lists: ๐‘€๐‘– โ† โ‹ƒ๐‘˜ ๐‘€agg ;
   โ€ข   Adjusting resource utilization estimates;
   โ€ข   Updating any other application-specific data.

   To prevent race conditions, the โ€œ๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ๐‘ƒโ„Ž๐‘Ž๐‘ ๐‘’๐‘€๐‘ข๐‘ก๐‘’๐‘ฅโ€ is used during critical sections of the
protocol, particularly during state updates.
   Let ฮผ๐‘– be a mutex for replica ๐‘…๐‘– . Then, state updates are performed under the lock ฮผ๐‘– :
                                           ฮผ๐‘–
                                     ๐‘†๐‘– โ† ๐‘“๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’ ({๐‘†1 , ๐‘†2 , . . . , ๐‘†๐‘— })
   As was previously stated, RSDP is based on eventual consistency, where all replicas converge to
the same state after a finite number of message exchanges.
   For any two replicas ๐‘…๐‘– and ๐‘…๐‘— , their states ๐‘†๐‘– and ๐‘†๐‘— satisfy:
                                         lim Pr (๐‘†๐‘– (๐‘ก) = ๐‘†๐‘— (๐‘ก)) = 1
                                        ๐‘กโ†’โˆž
   The protocol ensures that messages are eventually delivered, and state updates occur. If a message
๐‘š is sent from ๐‘…๐‘– to ๐‘…๐‘— , then ๐‘š will be delivered to ๐‘…๐‘— after some finite delay ฮด. In case some messages
will be lost, RSDP defines repeatable synchronization sessions as a contingency.

                                                                                                     126
3. Leader Election Reducer for the RSDP

storing, retrieving, validation, aggregating, and updating a state is defined in a set of preconfigured
reducers. The original article of RSDP describes the benefits of such a modular approach.
Additionally, it provides a thorough explanation of the interface that every reducer must follow to
successfully integrate with the protocol. The list of defined methods includes [22]:

   โ€ข    ๐‘”๐‘’๐‘ก๐ถ๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘†๐‘ก๐‘Ž๐‘ก๐‘’: returns the current state of the state slice;
   โ€ข    ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’
   โ€ข    ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’: transforms the internal state into desired by a client format;
   โ€ข    ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐‘†โ„Ž๐‘Ž๐‘Ÿ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’
        them;
   โ€ข    ๐‘ ๐‘Ž๐‘›๐‘–๐‘ก๐‘–๐‘ง๐‘’๐‘†โ„Ž๐‘Ž๐‘Ÿ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’: verifies the validity of an aggregated state;
   โ€ข    ๐‘ โ„Ž๐‘œ๐‘ข๐‘™๐‘‘๐‘…๐‘’๐‘™๐‘œ๐‘Ž๐‘‘: receives a newly aggregated state and returns a Boolean indicating whether
        a client should be notified about state change;
   โ€ข    ๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’: updates the internal state of a reducer;
   โ€ข    ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐ถ๐‘™๐‘œ๐‘ ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’

   Out of the mentioned methods, ๐‘”๐‘’๐‘ก๐ถ๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘†๐‘ก๐‘Ž๐‘ก๐‘’ is the only method that is not required. This

states. For example, discovery of replica members does not require ๐‘”๐‘’๐‘ก๐ถ๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘†๐‘ก๐‘Ž๐‘ก๐‘’
implementation since this could be derived from the sender addresses.

3.1. Mathematical model of the leader election process
Leader election reducer is an extension that performs a replicated deterministic holistic decision on
the trusted entity based on incoming cluster events. Having discussed the RSDP foundation and the
basis for leader election reducer, let us define an abstract description of the consensus process.
   Assume ๐‘

   โ€ข    ๐ด = {๐‘Ž1 , ๐‘Ž2 , โ€ฆ , ๐‘Ž๐‘› }: the set of all replica addresses;
   โ€ข    ๐‘Žself โˆˆ ๐ด: the address of the current replica instance, ๐‘Žself = ๐‘Ž๐‘ ;
   โ€ข    ๐บ๐‘ โІ ๐ด: the current set of replica members known to the replica;
   โ€ข    ๐ฟ โˆˆ ๐บ๐‘ : the address of the current leader.

   Initially:

   โ€ข    ๐บ๐‘ = โˆ…;
   โ€ข    ๐ฟ = โŠฅ (temporarily undefined).

   Method ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’(๐‘ ๐‘ก๐‘Ž๐‘ก๐‘ข๐‘ ๐‘€๐‘’๐‘ ๐‘ ๐‘Ž๐‘”๐‘’๐ต๐‘ข๐‘“๐‘“๐‘’๐‘Ÿ), as an input, accepts a list of status messages
                                     ๐‘™                        ๐‘–
๐‘€status = [๐‘š1STATUS , ๐‘šSTATUS
                       2
                              , โ€ฆ , ๐‘šSTATUS ], where each ๐‘šSTATUS    contains a sender address
 ๐‘–
๐‘šSTATUS .address โˆˆ ๐ด. During its execution it performs the following operations:

   โ€ข                                   ๐‘–
        Extract addresses: ๐บ โ€ฒ = {๐‘šSTATUS     .address โˆฃ ๐‘– = 1,2, โ€ฆ , ๐‘™};
   โ€ข                                 โ€ฒ
        Sort addresses: arrange ๐บ in ascending order according to a total order ( โ‰ค) on addresses ๐ด,
        ๐บsorted = Sort(๐บ โ€ฒ ); This operation could also include additional sorting criteria.
   โ€ข    Determine leader: ๐ฟโ€ฒ = max(๐บsorted );
   โ€ข    Initialize state (if ๐บ๐‘ is โˆ… or ๐ฟ is โŠฅ): ๐บ๐‘ โ† ๐บsorted , ๐ฟ โ† ๐ฟโ€ฒ .
                                                                                                   127
    As a result of its execution, the new state components are returned as the following tuple:
replicaMembers = ๐บsorted , currentLeader = ๐ฟโ€ฒ .
    Method ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’(๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐ฟ๐‘’๐‘Ž๐‘‘๐‘’๐‘Ÿ) expects an ๐ฟโ€ฒ โˆˆ ๐ด as an input and performs the
following transformation:
                                                  true if Lโ€ฒ = ๐‘Ž๐‘ ๐‘’๐‘™๐‘“
                                      isLeader {
                                                 false if Lโ€ฒ โ‰  ๐‘Ž๐‘ ๐‘’๐‘™๐‘“
    Method aggregateShareState(shareMessageBuffer) expects a list of share messages ๐‘€๐‘ โ„Ž๐‘Ž๐‘Ÿ๐‘’ =
                         ๐‘™                     ๐‘–
[๐‘š1SHARE , ๐‘šSHARE
            2
                  , โ€ฆ , ๐‘šSHARE  ], where each ๐‘šSTATUS                     replicaMembers
 currentLeader       Below are multiple approaches to aggregate the state components
 replicaMembers            currentLeader


   โ€ข                                   ๐‘™
        Select last message: ๐‘šlast = ๐‘šSHARE ;
   โ€ข    Extract state components:
                a. ๐บ โ€ฒ = ๐‘šlast .replicaMembers,
                b. ๐ฟโ€ฒ = ๐‘šlast .currentLeader.

   As result returns replicaMembers = ๐บ โ€ฒ , currentLeader = ๐ฟโ€ฒ .

rather than relying on the latest and could be described as follows:

    โ€ข   Let โ„‹ = โˆ… be a multiset of hashed state components.
    โ€ข   For each message ๐‘š๐‘– โˆˆ ๐‘€๐‘ โ„Ž๐‘Ž๐‘Ÿ๐‘’ :
        โ€ข Extract state components:
            o ๐บ๐‘– = ๐‘š๐‘– .replicaMembers;
            o ๐ฟ๐‘– = ๐‘š๐‘– .currentLeader.
        โ€ข Form state tuple:
            o ๐‘†๐‘– = (๐บ๐‘– , ๐ฟ๐‘– ).
        โ€ข Compute hash of the state tuple:
            o โ„Ž๐‘– = โ„Ž(๐‘†๐‘– ), where โ„Ž is a hash function.
        โ€ข Add โ„Ž๐‘– to โ„‹:
            o โ„‹ โ† โ„‹ โˆช {โ„Ž๐‘– }.
    โ€ข   Identify the hash โ„Žโˆ— with the highest frequency in โ„‹:
            o โ„Žโˆ— = arg max(frequency(โ„Ž๐‘– , โ„‹)).
    โ€ข   Find ๐‘† โˆ— = (๐บ โ€ฒ , ๐ฟโ€ฒ ) such that โ„Ž(๐‘† โˆ— ) = โ„Žโˆ— .

   As result returns replicaMembers = ๐บ โ€ฒ , currentLeader = ๐ฟโ€ฒ .

assigning points from each of the participants. Consider a set of share messages defined as the
                                       ๐‘™
๐‘€๐‘ โ„Ž๐‘Ž๐‘Ÿ๐‘’ = [๐‘š1SHARE , ๐‘šSHARE
                      2
                              , โ€ฆ , ๐‘šSHARE  ], where โˆ€๐‘š๐‘– โˆˆ ๐‘€๐‘ โ„Ž๐‘Ž๐‘Ÿ๐‘’ contains a ranked list of replica
members ๐บ๐‘– = [๐‘Ž๐‘–,1 , ๐‘Ž๐‘–,2 , โ€ฆ , ๐‘Ž๐‘–,๐‘›๐‘– ], where ๐‘›๐‘– = |๐บ๐‘– | be the number of candidates in ๐‘š๐‘– , with ๐‘Ž๐‘–,1 being
the most desired leader and ๐‘Ž๐‘–,๐‘›๐‘– being the least desired leader.

    โ€ข   For each message ๐‘š๐‘– โˆˆ ๐‘€๐‘ โ„Ž๐‘Ž๐‘Ÿ๐‘’ .
             o For each candidate ๐‘Ž๐‘–,๐‘˜ at position ๐‘˜ in ๐บ๐‘– compute the weight ๐‘ค๐‘–,๐‘˜ = 2๐‘›๐‘–โˆ’๐‘˜ .
    โ€ข   Initialize a score set ๐ถ๐‘– = {๐‘๐‘–,1 , ๐‘๐‘–,2 , โ€ฆ , ๐‘๐‘–,๐‘›๐‘– }, where ๐‘๐‘–,๐‘˜ = 0 for ๐‘˜ = 1,2, โ€ฆ , ๐‘›๐‘– .
    โ€ข   For each candidate ๐‘Ž๐‘–,๐‘˜ :
             o Update the candidate's score ๐‘๐‘–,๐‘˜ โ† ๐‘๐‘–,๐‘˜ + ๐‘ค๐‘–,๐‘˜ ;
                                                                                                         128
            o   Let ๐‘๐‘— โˆˆ ๐ถ be the total score, where ๐ถ is a set of total scores for each candidate, then
                                            ๐‘›๐‘–
                for ๐‘Ž๐‘–,๐‘— โˆˆ ๐บ๐‘– , ๐‘๐‘— = โˆ‘๐‘™๐‘–=1 โˆ‘๐‘˜=1 ฮด๐‘Ž๐‘–,๐‘— ,๐‘Ž๐‘–,๐‘˜ โ‹… ๐‘ค๐‘–,๐‘˜ ;
            o   where ฮด๐‘Ž๐‘–,๐‘—,๐‘Ž๐‘–,๐‘˜ is the Kronecker delta function:
                                    1 if ๐‘Ž๐‘–,๐‘— = ๐‘Ž๐‘–,๐‘˜
                    โ–ช   ๐›ฟ๐‘Ž๐‘–,๐‘—,๐‘Ž๐‘–,๐‘˜ {
                                    0 if ๐‘Ž๐‘–,๐‘— โ‰  ๐‘Ž๐‘–,๐‘˜
    โ€ข   Determine the leader:
           o Identify the candidate ๐ฟโ€ฒ with the highest total score arg max(๐‘๐‘— )
                                                                            ๐‘๐‘— โˆˆ๐ถ
            o   In case of a tie, apply a deterministic tie-breaker, such as selecting the candidate with
                the highest address according to the total order โ‰ค on ๐ด.

   As result returns replicaMembers = ๐บ โ€ฒ , currentLeader = ๐ฟโ€ฒ.
   Method ๐‘ ๐‘Ž๐‘›๐‘–๐‘ก๐‘–๐‘ง๐‘’๐‘†โ„Ž๐‘Ž๐‘Ÿ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’(๐‘Ÿ๐‘’๐‘๐‘™๐‘–๐‘๐‘Ž๐‘€๐‘’๐‘š๐‘๐‘’๐‘Ÿ๐‘ , ๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐ฟ๐‘’๐‘Ž๐‘‘๐‘’๐‘Ÿ) expects a set ๐บ โ€ฒ โІ ๐ด and a
leader ๐ฟโ€ฒ โˆˆ ๐ด.

   โ€ข    Validate Members: areValidMembers = (๐บ โ€ฒ โ‰  โˆ…) โˆง (๐‘Žself โˆˆ ๐บ โ€ฒ );
   โ€ข    Validate Leader: isLeaderValid = ๐ฟโ€ฒ โˆˆ ๐บ โ€ฒ .

   Output could be described then as:

   โ€ข    (areValidMembers โˆง isLeaderValid) โŸน {replicaMembers: ๐บ โ€ฒ ,currentLeader: ๐ฟโ€ฒ };
   โ€ข    ยฌ(areValidMembers โˆง isLeaderValid) โŸน โˆ….

   Method ๐‘ โ„Ž๐‘œ๐‘ข๐‘™๐‘‘๐‘…๐‘’๐‘™๐‘œ๐‘Ž๐‘‘(๐‘Ÿ๐‘’๐‘๐‘™๐‘–๐‘๐‘Ž๐‘€๐‘’๐‘š๐‘๐‘’๐‘Ÿ๐‘ , ๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐ฟ๐‘’๐‘Ž๐‘‘๐‘’๐‘Ÿ) expects ๐บ โ€ฒ โІ ๐ด and ๐ฟโ€ฒ โˆˆ ๐ด.
   During its execution it performs the following operations:

   โ€ข    Compare Members: membersChanged = (๐บ โ€ฒ โ‰  ๐บ๐‘ );
   โ€ข    Compare Leader: leaderChanged = (๐ฟโ€ฒ โ‰  ๐ฟ);
   โ€ข    Determine Reload Necessity: shouldReload = membersChanged โˆจ leaderChanged.

    Returns a Boolean that indicates whether a client should be notified about the state change.
    Method updateState(replicaMembers, currentLeader) expects a ๐บ โ€ฒ โІ ๐ด and ๐ฟโ€ฒ โˆˆ ๐ด. During its
execution it performs: if (๐บ โ€ฒ โ‰  โˆ… โˆง ๐ฟโ€ฒ โˆˆ ๐บ โ€ฒ ) then (๐บ๐‘ โ† ๐บ โ€ฒ , ๐ฟ โ† ๐ฟโ€ฒ ).
    Method ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐ถ๐‘™๐‘œ๐‘ ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’(๐‘๐‘™๐‘œ๐‘ ๐‘’๐‘€๐‘’๐‘ ๐‘ ๐‘Ž๐‘”๐‘’๐ต๐‘ข๐‘“๐‘“๐‘’๐‘Ÿ) depends on the implementation of a
leader election function and whether it is completely deterministic. If the sorting is done with an
inclusion of a locally asserted context, this method is supposed to trigger the initial phase of the
RSDP to achieve consistency.
                                                                                    ๐‘
    Otherwise it expects a list of close messages ๐‘€๐‘๐‘™๐‘œ๐‘ ๐‘’ = [๐‘š1CLOSE , ๐‘šCLOSE
                                                                           2
                                                                             , โ€ฆ , ๐‘šCLOSE ], where each
  ๐‘–                            ๐‘–
๐‘šCLOSE  has sender address ๐‘šCLOSE    .address โˆˆ ๐ด.
    During its execution it performs the following operations:

   โ€ข                                        ๐‘–
        Extract Closing Addresses: Acl = {๐‘šCLOSE .address โˆฃ ๐‘– = 1,2, โ€ฆ , ๐‘};
   โ€ข    Update Replica Members: ๐บ๐‘ โ† ๐บ๐‘ โˆ– Acl ;
   โ€ข    Recalculate Leader: ๐ฟ โ† max(๐บ๐‘ ) if ๐บ๐‘ โ‰  โˆ… else โŠฅ.

   Different approaches towards implementation of aggregateShareState(shareMessageBuffer)
outlined in this section contribute to different properties of the election process. Method based on
the latest source of truth is characterized by low computational intensity, but while reasonable in
trusted and stable environments, is susceptible to network congestion or failures. This approach is
not suitable for networks that have strict requirements for Byzantine fault tolerance.
                                                                                                 129
    Method based on a popular vote provides higher resilience for both intentional and unintentional
discrepancies during the consensus process. It is a suitable approach for systems that are beset with
an unstable or restricted environment. It is additionally characterized by increased computational
complexity, though it could be reduced by taking into account only scalar values to avoid hashing
overhead. This approach is the most resilient among the others against intentionally hostile behavior
since, to perform an action, you have to gather the majority of votes.
    Finally, the last method provides a solution based on the electoral points approach. It is the most
stable and resilient option among the previous three in the context of unstable connections due to
its ability to downgrade votes that have lost some portion of a state and hence have a limited view
of the global state set. Though it is not resilient towards intentionally hostile actions since the state
set cardinality could be superficially inflated.

3.2. Implementation of the leader election reducer
The following section describes practical implementation of the leader election reducer using
JavaScript and the ๐ต๐‘Ž๐‘ ๐‘’๐‘…๐‘’๐‘๐‘™๐‘–๐‘๐‘Ž๐‘†๐‘ก๐‘Ž๐‘ก๐‘’๐‘€๐‘Ž๐‘›๐‘Ž๐‘”๐‘’๐‘Ÿ๐‘…๐‘’๐‘‘๐‘ข๐‘๐‘’๐‘Ÿ interface defined by the RSDP. The
following algorithm assumes that every cluster member can trust the environment and implements
the first approach from the previous section. Such an assumption is a common case when building
coordinated replication between internal services to achieve high availability. From this point on we
will refer
    Figure 4 shows the initial aggregation implementation logic:




                  Figure 4: Initial aggregation logic of the leader election reducer.

   The initial state of the LER, as defined by the model, is comprised of an empty set of replica
members and an undefined leader. This serves as an example of an abstract derived reducer subset
since it does not have an initial ๐‘”๐‘’๐‘ก๐ถ๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘†๐‘ก๐‘Ž๐‘ก๐‘’ to share with the cluster.
   Method ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’ hence relies not on the data provided by the cluster members but on
the messages themselves and their metadata. Its operation is simple; the new leader is defined as the
replica that has the highest address. The sorting operation here is not redundant, since to achieve
determinism, every node must have the same dataset and ordering. Subsequently, ๐‘›๐‘œ๐‘Ÿ๐‘š๐‘Ž๐‘™๐‘–๐‘ง๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’
abstracts out the internal store and provides a simple answer to the client, whether he is a leader.
                                                                                                     130
   Figure 5 shows the share aggregation implementation logic:




               Figure 5: State share aggregation logic of the leader election reducer.

   The ๐‘ ๐‘Ž๐‘›๐‘–๐‘ก๐‘–๐‘ง๐‘’๐‘†โ„Ž๐‘Ž๐‘Ÿ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’ method is responsible for verifying the consistency of the data received
from the aggregated state. A leader is valid if it is in a set of known members, and the members are
valid if it is not an empty set. Then, the ๐‘ โ„Ž๐‘œ๐‘ข๐‘™๐‘‘๐‘…๐‘’๐‘™๐‘œ๐‘Ž๐‘‘ method decides whether the state has changed
and whether the client should be notified. At last, ๐‘ข๐‘๐‘‘๐‘Ž๐‘ก๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’ simply sets a new state if it was
provided.
   Figure 6 shows the close aggregation implementation logic:




              Figure 6: Replica close aggregation logic of the leader election reducer.
                                                                                                131
    The ๐‘Ž๐‘”๐‘”๐‘Ÿ๐‘’๐‘”๐‘Ž๐‘ก๐‘’๐ถ๐‘™๐‘œ๐‘ ๐‘’๐‘†๐‘ก๐‘Ž๐‘ก๐‘’
Its primary goal is to deterministically determine a new set of cluster members and a leader. The
leader election logic here is the same as during the initial aggregation. RSDP continuously
resynchronizes the state of the cluster, so any discrepancies caused by the lost messages will
eventually be resolved due to the principle of eventual consistency.
    To conclude, RSDP provides a well-defined model for an arbitrary logical extension. The

Farther research could lead to the different invariants of this protocol that could potentially be
applicable in decentralized environments.

3.3. Leader election reducer duration and failure probability
Failure probability and consensus duration are two of the most important metrics for the consensus-
achieving algorithm. The following section provides mathematical models that describe these
properties. We will start with a model of time required to reach a consensus. The following
assumptions are made:

   โ€ข   Let ๐‘› be the total number of replicas in the system.
   โ€ข   Let ๐‘‘ be the maximum one-way network delay between any two replicas.
   โ€ข   Let ๐‘ก๐‘ be the maximum time a replica takes to process a message.
   โ€ข   Phases to achieve initial consensus include โ€œDEBATESโ€ and โ€œSHAREโ€.
   โ€ข   All message delays and processing times are bounded and known.
   โ€ข   The SLAN layer provides guarantees of message delivery.
   โ€ข   The probability of the communication media coordinator failure is negligent.

                 DEBATESโ€ phase each replica sends a โ€œHELLOโ€ message to all other replicas
where time for a โ€œHELLOโ€ message to reach other replicas: ๐‘‘. After receiving a โ€œHELLOโ€ message,
each replica processes it in time (n โˆ’ 1)t p and sends back a โ€œSTATUSโ€ message where time for a
โ€œSTATUSโ€ message to reach the original sender is ๐‘‘.
   For a replica to receive โ€œSTATUSโ€ messages from all others, the time is d + (n โˆ’ 1)t p + d =
2d + (n โˆ’ 1)t p and since there are ๐‘› โˆ’ 1 replicas sending โ€œSTATUSโ€ messages, processing them
takes (๐‘› โˆ’ 1)๐‘ก๐‘ .
   Subsequently, the total โ€œDEBATESโ€ phase time (๐‘‡DEBATES ) is:

                TDEBATES = 2d + (n โˆ’ 1)t p + (n โˆ’ 1)t p = 2d + 2t p (n โˆ’ 1)                    (1)
    During the โ€œSHAREโ€ phase, after aggregating the received โ€œSTATUSโ€ messages, each replica
broadcasts a โ€œSHAREโ€ message to all others. Time for โ€œSHAREโ€ message to reach other replicas
is ๐‘‘. After that each replica processes incoming โ€œSHAREโ€ messages from ๐‘› โˆ’ 1 replicas in time
(๐‘› โˆ’ 1)๐‘ก๐‘ .
    Then the total โ€œSHAREโ€ phase time (๐‘‡SHARE ) is:

                                  TSHARE = d + (n โˆ’ 1)t p                                      (2)
   Hence, the total time to reach consensus (๐‘‡consensus ) is:

          Tconsensus = TDEBATES + TSHARE = (2d + 2t p (n โˆ’ 1)) + (d + (n โˆ’ 1)t p )
                                                                                               (3)
                          = 3d + 3t p (n โˆ’ 1) = 3 (๐‘‘ + t p (n โˆ’ 1))
   It is obvious then that the consensus achieving time is linearly dependent on the amount of cluster
members. Additionally, network delay ๐‘‘ and processing time ๐‘ก๐‘ are critical factors, but since the
protocol is built on top of deterministic principles and clean functions, ๐‘ก๐‘ should be negligent.
                                                                                                  132
   As for the failure probability, the following assumptions are made:

   โ€ข   Let ๐‘๐‘™ be the probability that a message is lost.
   โ€ข   Let ๐‘๐‘“ be the probability that a replica fails during the consensus process.
   โ€ข   Message losses and replica failures are independent.
   โ€ข   Each replica must receive โ€œHELLOโ€, โ€œSTATUSโ€, and โ€œSHAREโ€ messages from all the
       replicas.
   โ€ข   A replica successfully participates if it can send and receive all required messages.

   The probability that a single message is successfully transmitted is:

                                            Pmsg = 1 โˆ’ pl                                        (4)
   During the consensus achieving stages, each replica must send ๐‘› โˆ’ 1 โ€œHELLOโ€ and ๐‘› โˆ’ 1
โ€œSHAREโ€ messages. Consequently, every replica expects to receive ๐‘› โˆ’ 1 โ€œHELLOโ€, ๐‘› โˆ’ 1
โ€œSTATUSโ€, ๐‘› โˆ’ 1 โ€œSHAREโ€ messages. Then the total messages received per replica could be
represented as:

                                 Mtotal = 3(n โˆ’ 1) = 3n โˆ’ 3                                      (5)
   Having the total amount of required messages to successfully achieve consensus, the probability
that a replica successfully sends and receives all messages:

                            Preplica = (1 โˆ’ pf ) ร— (Pmsg )
                                                            Mtotal                               (6)
   Consequently, the probability that all replicas will successfully participate is:
                                             n
                  Pall replicas = (Preplica ) = [(1 โˆ’ pf ) ร— (1 โˆ’ pl )3nโˆ’3 ]n                    (7)
   Then the probability of consensus failure for the first election method:

                       Plast state failure = 1 โˆ’ [(1 โˆ’ pf ) ร— (1 โˆ’ pl )3nโˆ’3 ]n                 (8)
   The probability of failure is dependent on key characteristics of the network and the underlying
infrastructure. It is obvious that such an approach is suitable only in cases of stable network
connections. SLAN layer provides delivery recovery mechanisms but does not solve every issue
related to the message loss. Since the
          do not require successful participation of every node, the probability could be reevaluated
and considered in the following way:
   Let ๐‘ž be the minimum number of replicas required for consensus (the quorum). For a simple
majority:

                                                  ๐‘›                                              (9)
                                          q = โŒˆ โŒ‰
                                                  2
   Consequently, the probability that at least ๐‘ž replicas successfully participate is:
                                        n
                                          n             k               nโˆ’k
                                                                                                 (10)
                   Pquorum consensus = โˆ‘ ( ) (Preplica ) (1 โˆ’ Preplica )
                                          k
                                      k=q
   Then the probability of consensus failure:

                               Pvote failure = 1 โˆ’ Pquorum consensus                               (11)
    Evidently, vote-based approaches are significantly more resilient than the method based on the
last state decision. In such systems, it is possible to withstand partial failure of participating nodes
during the consensus process.
                                                                                                       133
4. Stateful Cluster Failover Models
A stateful cluster in the context of this article is a distributed system where each node has its own
subset of the system state. The subset might be either a unique unknown portion for every other
cluster member or, as a more common case, a subset of anot
establish a leader-follower model to achieve high consistency [13-15].
    While the entire cluster follows a single leader, it becomes a single point of failure. The principles
of fault tolerance in that context require establishing a failover mechanism as a contingency. During
                                                     luster nodes should be probed and tested to detect
any issues promptly. As soon as the critical event on the leader node is detected, the mechanism
switches to the active phase of achieving consensus. The entire network has to agree upon a new
leader of a cluster to continue its operation [28-30].
    The following list includes common definitions used to model every subsequent failover method
and their properties:

   โ€ข    ๐‘: Number of instances in the cluster.
   โ€ข    ๐‘€: Number of external observers (Method 2 & 3).
   โ€ข    โ„Ž: Health-check interval between instances (Method 1).
   โ€ข    โ„Ž๐‘œ : Health-check interval by observers (Methods 2 and 3).
   โ€ข    ๐‘‡๐‘‘ : Failure detection time.
   โ€ข    ๐‘‡๐‘ : Time to perform the failover procedure.
   โ€ข    ๐‘‡๐‘ : Consensus achieving time (an independent parameter).
   โ€ข    ๐‘๐‘– : Probability of an instance failing during the observation window.
   โ€ข    ๐‘๐‘œ : Probability of an observer failing during the observation window.
   โ€ข    ๐‘ƒconsensus: Probability of failure in achieving consensus.
   โ€ข    ๐‘‡๐‘œ๐‘๐‘  : Total observation time.
   โ€ข    ๐ถ๐‘š : Average size of a message (in bytes)

    Each node/observer sends (๐‘ โˆ’ 1) ๐‘œ๐‘Ÿ (๐‘€ โˆ’ 1) messages three times during consensus. In the
following subsection, each failover topology will be described in terms of failover delay, total failure
probability, and communication overhead.
as an optimization phase to avoid going through the entire consensus cycle every time a node leaves
the cluster.

4.1. Self-regulated mutual health evaluation
We will first evaluate a model based on a single logical plane. Each node in such a cluster is
responsible for operational execution, monitoring, and governance. In such topology, every node
must have a communication link with every other in the system to successfully achieve consensus
and monitor other instances [31, 32].
    The cluster could be preconfigured to initiate health probes in a specified interval but with
different initial timestamp shifts based on a node position in the network. This allows to efficiently
utilize the repetitive status probes and decrease failure detection time. Additionally, since every node
conducts the monitoring, the network can tolerate to up to ๐‘ โ€“ 1 failed nodes, where ๐‘ is a node set
cardinality.
    In that regard, LER provides all the necessary data needed to establish successful monitoring and
election in an automated way. The capabilities of LER already provided data for the dynamic node
discovery. Hence, every node has a list of the cluster members that they must probe. LER
automatically adjusts the states of nodes in a cluster as soon as some subset leaves, but the new
leader election cycle could also be triggered in case the nodes suffered critical events.

                                                                                                      134
   Figure 7 shows the topology of a self-regulated cluster of nodes:




                          Figure 7: Self-regulated mutual health evaluation.

   Each instance performs health checks on every other instance at intervals of โ„Ž. The expected time
for the first instance to detect a failure is the minimum time any instance takes to detect it. Assuming
health checks and uniformly distributed, the expected detection time is:

                                                     h                                          (12)
                                            E[Td1 ] =
                                                    2N
   After detecting failure, instances need to reach consensus. The consensus achieving time (๐‘‡๐‘ ) is
considered an independent parameter to simplify the model and concentrate directly on the factors
directly tied to the topology. The total time from failure occurrence ๐‘‡๐‘“1 to failover completion is the
sum of detection time, consensus time, and failover procedure time:

                                                       h                                            (13)
                            Tf1 = E[Td1 ] + Tc + Tp =     + Tc + Tp
                                                      2N
   The probability that all instances fail simultaneously (๐‘ƒall instances ) could be represented simply as
an exponent of a single instance failure:

                                            Pall instances = ๐‘๐‘–๐‘                              (14)
    Then the total failure probability ๐‘ƒ๐‘“1 would be described in terms of consensus (Pconsensus ) and
all instances (Pall instances ) failure probabilities:

                   Pf1 = Pconsensus + Pall instances โˆ’ (Pconsensus ร— Pall instances )              (15)
   Then each instance sends health checks to ๐‘ โˆ’ 1 other instances. During consensus, each
instance sends ๐‘ โˆ’ 1 messages three times. The overall amount of the messages sent during
consensus is:

                                  Mconsensus = N ร— (N โˆ’ 1) ร— 3                                     (16)
   The total number of messages exchanged during the observation period (๐‘‡๐‘œ๐‘๐‘  ) is:

                                   Tobs                                                            (17)
                            Mtotal1 = ( ร— Mhealth ) + Mconsensus
                                    h
   Total communication overhead in bytes:

                                    Overhead1 = Mtotal1 ร— Cm                                       (18)

the models that are tied directly to the system failover properties. But to achieve a holistic view,
consensus time and maintenance time models must be included.
                                                                                                      135
4.2. Centralized observer health monitoring
The topology with a centralized observer includes a set of stateful worker nodes in a cluster and a
single coordinating machine that manages the entire network. This topology introduces the division
between operational and control planes and thus fosters the single responsibility principle in the
system [33, 34].
   Figure 8 shows the topology of a cluster monitored and coordinated by a single observer:




                          Figure 8: Centralized observer health monitoring.

    Let us consider failure detection time ๐‘‡๐‘‘2 . Assume that the observer performs health checks on
all instances at intervals of โ„Ž๐‘œ . Then the expected time to detect a failure is half the health-check
interval:

                                                ho                                                (19)
                                          E[Td2 ] =
                                                 2
   Since there is no consensus process among instances, the total failover time is:

                                                           ho                                        (20)
                                 Tf2 = E[Td2 ] + Tp =           + Tp
                                                            2
    The system relies on a single observer; its failure directly impacts the system's ability to perform
failover. Probability of all Instances failing (๐‘ƒall_instances ) is the same as in the first method.
    The total failure probability includes the observer failure and all instances failing:

                         Pf2 = po + Pall instances โˆ’ (po ร— Pall instances )                       (21)
   Moving on to the communication overhead model, the observer sends health checks to all ๐‘
instances. Messages per health-check interval โ„Ž๐‘œ :

                                           Mhealth = N                                            (22)
   Then the total messages over observation time ๐‘‡๐‘œ๐‘๐‘  :

                                          Tobs                                                    (23)
                                    Mtotal2 =  ร— Mhealth
                                           ho
   Consequently, the total communication overhead in bytes:

                                   Overhead2 = Mtotal2 ร— Cm                           (24)
   This model imposes smaller communication overhead and reduces coordination complexity but
suffers from a single point of failure.
                                                                                                     136
4.3. Distributed observer health assessment
The following discussed topology is also comprised of two distinct execution plains. In such
networks, a consensus algorithm is used between observers themselves and decides on the
responsible node that must coordinate the workers plane [35-37].
   Figure 9 shows the topology of a cluster monitored and coordinated by a group of observers:




                             Figure 9: Distributed observer health assessment.

                                                       ๐‘‡๐‘‘3 . With ๐‘€ observers, the expected time to detect a
failure is:

                                                   ho                                             (25)
                                               E[Td3 ] =
                                                  2M
    Observers need to reach consensus after detecting a failure that takes ๐‘‡๐‘ . Then the total time from
failure occurrence to failover completion:

                                                          ho                                      (26)
                              Tf3 = E[Td3 ] + Tc + Tp =      + Tc + Tp
                                                         2M
   The probability that all observers fail simultaneously is ๐‘ƒall_observers = ๐‘๐‘œ๐‘€ . Given that the of all
instances failing (๐‘ƒall_observers ) is the same as in the previous methods, the total failure probability
(๐‘ƒ๐‘“3 ) is:

     Pf3 = Pconsensus + Pall observers + Pall instances โˆ’ (Pconsensus ร— Pall observers ร— Pall instances )   (27)
   Each of ๐‘€ observers sends health checks to all ๐‘ instances. Then the Mhealth is:

                                             Mhealth = M ร— N                                                (28)
   Additionally, each observer sends ๐‘€ โˆ’ 1 messages three times during consensus:

                                    Mconsensus = M ร— (M โˆ’ 1) ร— 3                                            (29)
   Then the total messages during ๐‘‡๐‘œ๐‘๐‘  is:

                                  Tobs                                                                      (30)
                               Mtotal3 = (
                                       ร— Mhealth ) + Mconsensus
                                   ho
   Consequently, the communication overhead (Overhead3 ):

                              Overhead3 = Mtotal3 ร— Cm                              (31)
   This model provides a balance between communication overhead, complexity, separation of
concerns and fault tolerance.
                                                                                                              137
5. Conclusions
Achieving consensus within the confines of a Decentralized Coordination Network based on
homogenous multiagent system is a critical process that is gaining momentum in the research field
due to the ever-growing sizes of distributed systems and requirements towards high availability.
Within the context of this article, a novel leader election protocol was created and modeled based on
the Replica State Discovery Protocol.
    One of the primary goals of this article is to introduce a leader election state reducer as a logical
extension of RSDP to address the rapid failover problem. As a result, multiple viable approaches were
proposed towards building such a reducer that are based on different properties of common state
aggregation. The first proposed method of leader election is based upon the supposition that the
network is controlled, and fault tolerance against malicious action during consensus is not expected.
That is a reasonable expectation since RSDP is built on top of LAN simulation, which in turn is based
on AMQP provider. Such providers often come with a set of authentication and authorization
mechanisms of their own. This method is characterized by its low computational overhead and
finality characteristics in case of an extremely dynamic network.
    The following two methods are based on quorum approach towards handling the leader election
process. The method based on a popular vote is the most suitable approach to ensure resilience
against both intentional and accidental failures during consensus interactions. Popular vote is a
common solution in networks that require Byzantine fault tolerance.
    The third proposed leader election method based on RSDP, in its foundation relies on the
weighted election algorithm. The approach is characterized by a greater degree of resilience in highly
congested and unreliable networks where random packet losses occur frequently due to its ability of
partial inclusion.
    Given the mathematical models and graphs for the three different cluster failover models and
approaches, it is fair to assume that none of those could be called objectively superior in every plain
of comparison. Method involving self-regulated mutual health evaluation is most suitable when high
additional infrastructure incurrence and the critical event detection time are the most influential
metrics of the successful system operation. Though it is worth noting that the communication
overhead grows exponentially with the number of cluster members.
    The second method, based on centralized observer health monitoring, is an appropriate solution
only in cases where higher infrastructure and communication overhead costs are a primary decision
factor. Since the entire stability of the system depends on a single centralized external observer, the
very same observer becomes a single point of failure, which could lead to a disaster when high
availability is a hard requirement.
    Lastly, a method based on distributed observer health assessment serves as a trade-off between
high availability, infrastructure cost incurrence, communication overhead, and failover delay by
offloading the decision-making process to the parallel distributed layer of coordination. Such an
approach is mostly suitable for current cloud infrastructure demands due to its flexibility and clearly
established separation of concerns.
    Overall, this article aims to inspire a surge of further research in the complex, exciting, and
extremely relevant field of distributed computing and management. The implications and results of
this research allow to bolster the security of modern critical infrastructure by effectively describing
novel ways of achieving resilience through redundancy and the distribution of responsibility.
Provided mathematical and graphical models are provided to help in the complex decision-making
process and reduce possible risks when choosing an appropriate model and approach for rapid
cluster failover.

Declaration on Generative AI
The authors have not employed any Generative AI tools.
                                                                                                     138
References
[1] S. Gilbert, N. Lynch, The CAP theorem, Computer, vol. 45, no. 2, pp. 30-36, Feb. 2012. doi:
     10.1109/MC.2011.389.
[2] M. Kleppmann, A critique of the CAP theorem, arXiv:1509.05393v2 [cs.DC], 2015. doi:
     10.48550/arXiv.1509.05393.
[3] E. Brewer, A certain freedom: thoughts on the CAP theorem, PODC '10: Proceedings of the 29th
     ACM SIGACT-SIGOPS symposium on Principles of distributed computing, p. 335, 2010. doi:
     10.1145/1835698.1835701.
[4] E. A. Lee, S. Bateni, S. Lin, M. Lohstroh, C. Menard, Quantifying and generalizing the CAP
     theorem, arXiv:2109.07771v1 [cs.DC], 2021. doi: 10.48550/arXiv.2109.07771.
[5] F. D. Muรฑoz-Escoรญ, et al, CAP theorem: revision of its related consistency models, The Computer
     Journal, vol. 62, no. 6, pp. 943-960, June 2019. doi: 10.1093/comjnl/bxy142.
[6] A. Lewis-Pye, T. Roughgarden, Resource pools and the CAP theorem, arXiv:2006.10698v1
     [cs.DC], 2020. doi: 10.48550/arXiv.2006.10698.
[7] L. Frank, R. U. Pedersen, C. H. Frank, N. J. Larsson, The CAP theorem versus databases with
     relaxed ACID properties, Proceedings of the 8th International Conference on Ubiquitous
     Information Management and Communication (ICUIMC '14), article no. 78, pp. 1-7, Jan. 2014.
     doi: 10.1145/2557977.2557981.
[8] Fault-tolerance by replication in distributed systems, Reliable Software Technologies Ada-
     Europe '96 (Ada-Europe 1996), conference paper, pp. 38-57, Jan. 2005.
[9] K. P. Birman, T. A. Joseph, Exploiting replication in distributed systems, NASA Contractor
     Report CR-186410, Jan. 1989. doi: NASA-CR-186410.
[10] B. Ciciani, D. M. Dias, P. S. Yu, Analysis of replication in distributed database systems, IEEE
     Transactions on Knowledge and Data Engineering, vol. 2, pp. 247-261, Jun. 1990. doi:
     10.1109/69.54723.
[11] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, I.
     Stoica, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster
     computing, Proceedings of the 9th USENIX Symposium on Networked Systems Design and
                                         -14, University of California, Berkeley, 2012.
[12] F. Cristian, Understanding fault-tolerant distributed systems, Communications of the ACM, vol.
     34, no. 2, pp. 56-78, Feb. 1991. doi: 10.1145/102792.102801.
[13] A. Luntovskyy, B. Shubyn, T. Maksymuk, M. Klymash, Highly-distributed systems: What is
     inside?, Proceedings of the 2020 IEEE International Conference on Problems of
     Infocommunications. Science and Technology (PIC S&T), Kharkiv, Ukraine, Oct. 2020. doi:
     10.1109/PICST51311.2020.9467890.
[14] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, E. Nahum, Locality-
     aware request distribution in cluster-based network servers, ACM SIGOPS Operating Systems
     Review, vol. 32, no. 5, pp. 205-216, Oct. 1998. doi: 10.1145/384265.291048.
[15] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, J. Wilkes, Large-scale cluster
     management at Google with Borg, EuroSys '15: Proceedings of the Tenth European Conference
     on Computer Systems, article no. 18, pp. 1-17, Apr. 2015. doi: 10.1145/2741948.2741964.
[16] A. Holub, "The mutex and lock management," in Taming Java Threads, Apress, Berkeley, CA,
     2000. doi:10.1007/978-1-4302-1129-7_3.
[17] M. Walmsley, "Semaphores," in Multi-Threaded Programming in C++, Springer, London, 2000.
     doi:10.1007/978-1-4471-0725-5_5.
[18] S. Plagnol, "Beyond mutexes, semaphores, and critical sections," in Embedded Real Time
                                                                             -
[19] M. Petrescu, "Replication in Raft vs Apache Zookeeper," in Soft Computing Applications (SOFA
     2020), Advances in Intelligent Systems and Computing, vol. 1438, Springer, 2023, pp. 426 435.
     doi:10.1007/978-3-030-55556-4_44.
                                                                                                  139
[20] H. Howard, R. Mortier, "Paxos vs Raft: Have we reached consensus on distributed consensus?,"
     Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data
     (PaPoC '20), Article No. 8, pp. 1 9, April 2020. doi:10.1145/3380787.3393681.
[21] F. Junqueira, B. Reed, ZooKeeper: Distributed Process Coordination, O'Reilly Media, Inc., 2013.
[22] M. Kotov, S. Toliupa, V. Nakonechnyi, "Replica State Discovery Protocol Based on Advanced
     Message Queuing Protocol," Cybersecurity: Education, Science, Technique, vol. 3, no. 23, 2024.
     doi:10.28925/2663-4023.2024.23.156171.
[23] N. Naik, "Choice of effective messaging protocols for IoT systems: MQTT, CoAP, AMQP and
     HTTP," in 2017 IEEE International Systems Engineering Symposium (ISSE), Vienna, Austria,
     Oct. 2017, pp. 426-435. doi:10.1109/SysEng.2017.8088251.
[24] J. L. Fernandes, I. C. Lopes, J. J. P. C. Rodrigues, and S. Ullah, "Performance evaluation of RESTful
     web services and AMQP protocol," in 2013 Fifth International Conference on Ubiquitous and
     Future Networks (ICUFN), Da Nang, Vietnam, Jul. 2013. doi:10.1109/ICUFN.2013.6614932.
[25] N. Q. Uy and V. H. Nam, "A comparison of AMQP and MQTT protocols for Internet of Things,"
     in 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi,
     Vietnam, Dec. 2019. doi:10.1109/NICS48868.2019.9023812.
[26] A. Prajapati, "AMQP and beyond," in 2021 International Conference on Smart Applications,
     Communications and Networking (SmartNets), Glasgow, United Kingdom, Sep. 2021.
     doi:10.1109/SmartNets50376.2021.9555419.
[27] I. N. McAteer, M. I. Malik, Z. Baig, and P. Hannay, "Security vulnerabilities and cyber threat
     analysis of the AMQP protocol for the Internet of Things," in Australian Information Security
     Management Conference, 2017. ISBN: 978-0-6481270-8-6.
[28] A. Stanik, M. Hรถger, and O. Kao, "Failover pattern with a self-healing mechanism for high
     availability cloud solutions," in 2013 International Conference on Cloud Computing and Big
     Data, Fuzhou, China, Dec. 2013. doi:10.1109/CLOUDCOM-ASIA.2013.63.
[29] W. Lin, H. Jiang, N. Zhao, and J. Zhang, "An optimized multi-Paxos protocol with centralized
     failover mechanism for cloud storage applications," in Collaborative Computing: Networking,
     Applications and Worksharing (CollaborateCom 2018), Lecture Notes of the Institute for
     Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 268, Feb.
     2019, pp. 610 625. doi:10.1007/978-3-030-30146-8_41.
[30] P. Somasekaram, R. Calinescu, and R. Buyya, "High-availability clusters: A taxonomy, survey,
     and future directions," Journal of Systems and Software, vol. 187, May 2022, 111208.
     doi:10.1016/j.jss.2021.111208.
[31] C. K. High-availability (HA) PostgreSQL Cluster with Patroni. Medium, Jan 14, 2024. URL:
     https://medium.com/@chriskevin_80184/high-availability-ha-postgresql-cluster-with-patroni-
     1af7a528c6be.
[32] Percona Distribution for PostgreSQL, "High availability." Percona Documentation, version 15.8.
     URL: https://docs.percona.com/postgresql/15/solutions/high-availability.html.
[33] "Introduction to pg_auto_failover." pg_auto_failover Documentation. URL: https://pg-auto-
     failover.readthedocs.io/en/main/intro.html.
[34] L. Fittl, "Introducing pg_auto_failover: Open source extension for automated failover and high-
     availability in PostgreSQL." Microsoft Azure Blog, May 6, 2019. URL:
     https://opensource.microsoft.com/blog/2019/05/06/introducing-pg_auto_failover-postgresql-
     open-source-extension-automated-failover-high-availability/.
[35] A.E. Nocentino, B. Weissman, "Kubernetes Architecture," in SQL Server on Kubernetes, Apress,
     Berkeley, CA, 2021. doi:10.1007/978-1-4842-7192-6_3.
[36] Kubernetes Documentation, "Cluster Architecture," Kubernetes Documentation. URL:
     https://kubernetes.io/docs/concepts/architecture/.
[37] L. Larsson, H. Gustafsson, C. Klein, and E. Elmroth, "Decentralized Kubernetes Federation
     Control Plane," 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing
     (UCC), Leicester, UK, 2020, pp. 354-359. doi:10.1109/UCC48980.2020.00056.
                                                                                                       140