<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Availability Analysis of the ONOS Architecture</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Köhler-Bußmeier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HAW Hamburg</institution>
          ,
          <addr-line>Berliner Tor 7, 20099 Hamburg</addr-line>
        </aff>
      </contrib-group>
      <fpage>41</fpage>
      <lpage>64</lpage>
      <abstract>
        <p>In this work, we compare two ONOS architectures, old (before v1.14) and new (v1.14 and after), in terms of their availability and answer the question if the new outperforms the old architecture. ONOS is a widely used and popular open source SDN controller that changed his architecture with version 1.14 to enable in service software upgrades. For this we create a GSPN model, upon which we can present that the new architecture has a higher availability, especially in environments with less available hardware.</p>
      </abstract>
      <kwd-group>
        <kwd>SDN</kwd>
        <kwd>ONOS</kwd>
        <kwd>Consensus</kwd>
        <kwd>Raft</kwd>
        <kwd>Availability</kwd>
        <kwd>GSPN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>respectively. The last two Sections 6 and 7 answer the research question, conclude the paper
and point out interesting future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Basics</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Dependability</title>
        <p>In this section relevant basics are presented and explained, so the reader is well suited for
the upcoming technicalities.</p>
        <p>
          Dependability is a group of concepts and attributes, the most relevant are explained in this
section. [
          <xref ref-type="bibr" rid="ref2 ref24">2, 24</xref>
          ]
        </p>
        <p>
          Reliability is the continuity of correct service [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. A highly reliable system is a system that
continuously works as expected for a long period of time, at best forever. An important metric
for this attribute is ’Mean Time To Failure’ (MTTF), as the name suggests, this represents
the average of how long the service works without interruptions. The measurement of how
long it takes to repair the interruption is ’Mean Time To Repair’ (MTTR) [
          <xref ref-type="bibr" rid="ref24 ref7">7, 24</xref>
          ].
        </p>
        <p>
          Availability is the probability for correct service in a given moment. A highly available
system is a system that has a high probability to work as expected in any given moment,
at best always. This probability can be calculated with MT TMFT+TMFT T R [
          <xref ref-type="bibr" rid="ref2 ref24 ref7">2, 7, 24</xref>
          ]. Once the
system is stabilized and the availability is roughly a constant value, we can talk of it as being
the ’Steady State Availability’. This is used in previous publications as the basis of their
availability evaluations [
          <xref ref-type="bibr" rid="ref2 ref20 ref21">2, 20, 21</xref>
          ].
        </p>
        <p>Threats to the dependability of a system are faults, errors and failures.</p>
        <p>
          Faults are the basis of the threats, they can activate errors. They are either introduced
during development, via incorrect code, due to physical problems, e.g. an old hard drive stops
working, or are generated from outside the system, e.g. during interaction with the user [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Errors are part of the system’s state and may activate failures if the error has external
consequences. Errors can be detected if they generate any kind of message or signal [
          <xref ref-type="bibr" rid="ref2 ref24">2, 24</xref>
          ].
        </p>
        <p>
          A Failure is in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] defined as ’[...] an event that occurs when the delivered service deviates
from correct service’, they can lead to the activation of further faults. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] describes failures
as unmeet specifications. An example failure is an uncaught error of a software that leads to
a complete crash of the software.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Consensus</title>
        <p>
          Consensus protocols help to synchronize a state across distributed systems. In consensus
protocols there is one leader which will order incoming updates. It will then ask the participating
systems if a certain order is consistent with their respective state. If the majority of systems
acknowledge the update, it is committed by the leader and each system applies the update.
The same way consensus can be used to elect a participating system to their leader [
          <xref ref-type="bibr" rid="ref21 ref24 ref8">8, 21,
24</xref>
          ]. One such consensus protocol is Raft1. Raft can work if the majority (1 + instancecount )
2
of the participating instances are available [
          <xref ref-type="bibr" rid="ref16 ref21 ref25 ref8">16, 25, 8, 21</xref>
          ].
        </p>
        <sec id="sec-2-2-1">
          <title>1 Raft Visualizations: http://thesecretlivesofdata.com/raft/, https://raft.github.io/</title>
          <p>2.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Software Defined Networks (SDN)</title>
        <p>
          Networks have a control and a data plane, while the control plane is more of a logical nature,
the data plane is more of a physical nature. The control plane can insert, modify or delete
forwarding rules to impact routes through the network. The data plane forwards packets from
one ingress port to certain egress ports according to the set forwarding rules. In traditional
networks, both planes reside in each router. SDNs split these planes, the SDN controller has
the control over the routers which are only left with the data plane. Routers are then called
SDN switches. The control plane can be ’in-band’, on the same links as the data plane, or
’out-of-band’, on own links. As an example, when in an out-of-band control plane a controller
has a direct connection to a certain switch, the same connection in in-band control planes
may require additional switches in between. [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]
        </p>
        <p>
          The controller communicates in three major ways [
          <xref ref-type="bibr" rid="ref14 ref15 ref4">4, 14, 15</xref>
          ]. ’Southbound’ (e.g. via
OpenFlow) to the switches, ’Northbound’ (e.g. via HTTP) to SDN applications and
’East/Westbound’ to other SDN controllers. SDNs have the benefit, that protocols and devices
are easier to update or replace. Furthermore, SDNs can improve resource optimization, ease
of maintenance and ease of operation. [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>ONOS</title>
        <p>
          The ’Open Network Operating System’ (ONOS) is, next to OpenDaylight, the largest open
source SDN controller and is widely mentioned in publications [
          <xref ref-type="bibr" rid="ref21 ref23 ref25">25, 23, 21</xref>
          ]. ONOS was created
in 2014 on the basis of the Floodlight SDN controller. Its target was to tackle high throughput,
low latency, large network state sizes and high availability, detailed performance numbers can
be found in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. To handle this in an efficient and partition tolerant manner, the controller
is physically distributed and logically centered. It uses the Raft consensus algorithm to
synchronize the shared network state between instances, currently implemented via Atomix
[
          <xref ref-type="bibr" rid="ref16 ref4 ref9">9, 4, 16</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>This section will present published work with a similar context in Section 3.1 and work with
similar tools or formalism in Section 3.2.
3.1</p>
      <sec id="sec-3-1">
        <title>Performance Evaluations with similar context</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] uses Markov-based reliability prediction in the context of highly adjustable software. Their
model considers the reliability of software, hardware and network. In the first case study they
have artificial failure probabilities and in the second they extend previously published work
by considering e.g. new fault tolerance mechanisms.
        </p>
        <p>
          The authors of [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] predict the steady state availability of SDNs and traditional IP
networks. Their model considers the availability of traditional IP routers, SDN switches, SDN
controllers and links. The model is split in two hierarchical layers to avoid a potential ’[...]
uncontrolled growth in model size [...]’. One contains the connections between the network
elements, based on minimal cut and path sets, and the other contains the failures and
recoveries of each network element, modeled with Markov chains.
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]’s goal is to assess the steady state availability of a generic SDN using a stochastic
availability model. Like [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], they propose a hierarchical availability model. Their model does
not consider failures of the SDN controller. The model was implemented using the Symbolic
Hierarchical Automated Reliability and Performance Evaluator.
        </p>
        <p>
          The goal of [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] is to evaluate the response time and availability of distributed SDN
clusters like the old (pre v1.14) ONOS architecture and the OpenDayLight SDN controller
with Raft as its distributed data store. They use stochastic activity networks as a model
generation framework. Their ’RAFT Recovery SAN Model’ contains hardware and software
failures that can impact the SDN controllers. To evaluate the response time they also model
failure injection to cause further failures that can be correlated. They consider a
coarsegrained static data plane reliability and propose the evaluation of the worst case only, to be
able to scale out this performance evaluation approach for larger models.
        </p>
        <p>
          The goal of [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] is to offer mitigations on the basis of their proposed modelling framework
that considers reliability, availability and security in distributed consensus protocols like
Raft. Their model considers among others, failure probabilities from published work (includes
ONOS related ones), detectability of failures and multiple repair rates.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Performance Evaluations with similar tooling or formalism</title>
        <p>
          The GreatSPN tool and the GSPN formalism are widely used in the context of performance
evaluation, for example in following publications.
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] presents a case study to analyze and evaluate UML diagram types based upon Stochastic
Well-Formed Nets (SWN). GreatSPN is used as a translator with the
’GreatSPN-toPROD’ utility, as a solver with the ’algebra’, ’Multisolve’ utilities and to create images
of the models.
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] evaluates performance characteristics of a LAN with a single bus and multiple devices.
        </p>
        <p>
          GreatSPN and GSPN are used for validation and solving.
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] validates a transaction protocol in a mobile environment. GreatSPN and SWNs are
used for validation and solving with the ’WNSIM’ utility.
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] analyzes the performance of query routing to create a new algorithm for needed
metadata. GreatSPN and SWNs are used for validation, evaluation and simulation with the
’WNSIM’ utility.
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] creates a translator from UML activity diagrams to GSPNs for performance
evaluations, which extends their previous work with similar goals. The created GSPNs are
recommended to be analyzed with GreatSPN.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Model</title>
      <p>In this section, we will present and reason the elements and parameters of our assumed
model.
4.1</p>
      <sec id="sec-4-1">
        <title>The ONOS Architectures</title>
        <p>
          The work on a new ONOS architecture arose 2017 with the formation of the ONOS internal
’In Service Software Upgrade’ team2. An ISSU means to upgrade software that is currently
running and answering requests. The goal is to upgrade without a loss of availability. For
this goal the ISSU team decided to change ONOS’ architecture: ’[...] In past versions, ONOS
embedded Atomix nodes to form Raft clusters, replicate state, and coordinate state changes.
In ONOS 1.14, that functionality is moved into a separate Atomix cluster’ [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <sec id="sec-4-1-1">
          <title>2 Also called ISSU Brigade lead by Jordan Halterman</title>
          <p>The differences between the architectures is best presented with two example clusters. The
example cluster of an old ONOS architecture in Figure 1 contains three item types, ONOS
controllers (C1-C3, squares), SDN switches (S1-S3, circles) and hosts (H1-H3, hexagons).
Each controller has contact to each switch (green dotted lines), but only one controller is
the master of a switch (bold green dotted lines). Also, there is a communication between the
controllers (purple dotted lines). The black lines between switches and hosts are signaling the
connectivity between these items. The example cluster of a new ONOS architecture in Figure
1 is noticeably noisier. Here we have one item type more in the cluster, Atomix instances
(A1-A3, triangles). Additionally, to the Atomix instances, we have more links, these connect
each Atomix instance with each controller (blue dotted lines). Our controllers still have their
controller to controller communication (red dotted lines).</p>
          <p>The interesting point is the mentioned additional noise in the new example cluster. This
separation of Atomix and ONOS brings certain flexibility benefits like dynamic horizontal
scaling of the ONOS instances and separate and easier upgrade of ONOS and Atomix
instances. It also brings some costs in form of additional links, communication, instances and
nodes, if each instance is deployed on its own node. This leads us to question if this new
architecture benefits the overall availability. We want to analyze the following points:
– Do the additional elements in the cluster harm the overall availability?
– If so, does it only decrease cluster availability in certain scenarios?
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Model Elements</title>
        <p>To make the description of the model easier, we classify elements of our model as either
objects or behaviour of these objects. In the following we will present included and excluded
model elements and argue why.</p>
        <p>
          Objects are either up or failed, like for example in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], up objects can fail, failed objects
can recover. Initially all objects are up.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Included model elements :</title>
        <p>
          - We consider five versions of ONOS instances in our work. v1.13, v1.14 and three future
placeholder versions of ONOS that are based upon v1.14. The placeholder versions are
named ’v1.14 vX’ where X is 2, 3 or 4, ’vX’ for short. ’v1.14 v1’ would be the normal
v1.14. Each version has its own software failure rate.
- We also consider Atomix instances. Together with the ONOS instances, these elements
are the core of our model.
- We consider links between Atomix and Atomix and between ONOS and Atomix.
- As an own object we model the consensus protocol status. The consensus protocol
is up, if the majority of Atomix instances are up. Otherwise, it is failed. Reason: This
is due to the basic requirement of the Raft consensus protocol, which needs an active
participation of the majority of instances as explained in Section 2. In the following
sections we will use ’consensus’ and ’consensus protocol’ interchangeably.
- The cluster availability is also modeled as an object. It is up, if consensus and at least
one ONOS instance is up, in any other case it is failed. Reason: Only if consensus is up,
which includes that enough Atomix instances are up, and at least one ONOS instance is
up, the cluster can work as intended and so only then it is available.
- Failure of ONOS and Atomix instances can be due to hardware failures, software
failures and network partitions. Every instance has its own hardware node and all nodes
are equal. Reason: The consideration of these failures is intuitive. Besides that, we
assume that each instance has its own node to achieve a simpler model in contrast when
we would consider shared nodes. This node separation can also be found in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
- If consensus fails, the Atomix and ONOS instances become idle and can no longer
fail from software problems. Reason: This is based upon the behaviour of Atomix and
ONOS described in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], in that case they can not fully answer incoming requests and
only execute a limited amount of their code, which again means less possibility to execute
faulty code.
- For the old ONOS architecture we consider the combined failure of ONOS and
Atomix. If one fails, the other fails too. Reason: This is due to the deeply coupled
deployment of ONOS and Atomix instances in the old architecture as described in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
- We consider worst case network partitions by accounting the number of failed links.
        </p>
        <p>
          Once a certain amount of links has failed, one respective instance fails. Links are either
between the Atomix instances or between Atomix and ONOS instances. In the example
in Figure 1 this would mean that for each two Atomix links one Atomix instance fails,
or that after three failed ONOS-Atomix links one ONOS instance fails. Reason: The
consideration of link failures and partition is an interesting topic for this work as it hits
one of the main points of the architecture change, the additional links. We do not consider
specific node connections of links, as it would drastically increase the models complexity
and also would overstep set time limits of this work. As for why we see this as a failure,
with reference to [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]: Atomix / ONOS instances stop serving requests and wait for the
partition recovery once they are partitioned. This is similar to the case when consensus
fails, as explained above. The ONOS-Atomix partition does not lead to a failed Atomix
instance, as Atomix is not dependent on ONOS instances, but it is the other way around.
- In the new architecture, the cluster can be upgraded if all Atomix and ONOS instances
are up. First Atomix will be upgraded, to be compatible with the new ONOS version,
then ONOS is upgraded. The influence of the upgrades onto ONOS’ availability are
dependent on the used parameters for each version and theoretically could improve or
worsen its availability. We will perform a rolling upgrade, upgrading one instance after
another. This also means that the multiple versions are compatible to each other. The
upgrade can fail, which leads to a longer upgrade time, but it can not fail entirely, so a
rollback is not necessary to consider. Reason: The rolling upgrade procedure described
is based upon their mailing list3 and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. That an upgrade only temporary fails, is based
upon the assumption that an expert performs the upgrade and also verifies the success,
as the ISSU team mentions in their presentation4 of the upgrade procedure.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Excluded model elements :</title>
        <p>
          - Our model focuses on the control plane, so the data plane is not considered. Reason:
As it can be observed from the old cluster to the new cluster in Figure 1 the connections
(green dotted lines) between SDN controller, SDN switches and hosts do not differ. We
reason that the architectural change in the control plane does not affect the data plane in
any way and so the impact of the data plane on the overall availability is equally in both
architectures and so can be omitted as we are only interested in the changes between the
two architectures.
- We model a failure detection probability of 100%, which in other words excludes this
from the model. Reason: This approach simplifies our model and can also be found in
[
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ].
- We do not consider load related software, hardware or link failures. Reason: In our
opinion, this would lead to a much more complex model that would cost sparse time for
creation we rather spend on other areas we think have a greater benefit for this work.
- The stacking of failures is not considered. A stacking failure could be, that on top of
being already partitioned, one ONOS instance also then fails due to a software failure.
Reason: An idea of how complex it is to model this detail, can be seen in the Section
A.1 in the Appendix. It would extend the time limits of this work to incorporate this
into the model.
- As described, Atomix instances must be upgraded before ONOS can, but this has no
impact on the overall availability. Reason: Since this upgrade is done so Atomix is
compatible with the new ONOS version, we do not expect it to also include an impact
of availability for example through bug fixes. Henceforth, we exclude this detail from our
model.
- The old architecture can not be upgraded. Reason: We reason this exclusion with
our goal to compare between less complexity and less flexibility in the old architecture,
against more complexity and more flexibility in the new architecture, as discussed in
Section 4.1.
- The gossip protocol that is implemented in ONOS is excluded. Reason: As the gossiping
is only used for bidirectional exchange of state, it is used to detect smaller state drifts
between instances and to bring new instances faster up-to-date. If consensus is not longer
working as mentioned above, the gossip protocol can’t work either. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
- Unavailability due to attacks or other security related threats to availability are not
modeled. Reason: As security is a whole new topic that has a depth on its own which
would extend the scope of this work too much. This approach can also be found in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
as mentioned in Section 3.
        </p>
        <sec id="sec-4-4-1">
          <title>3 See https://groups.google.com/a/onosproject.org/g/onos-dev/c/iu_iP8pFs</title>
          <p>U/m/OGtYzVy_CwAJ,
https://groups.google.com/a/onosproject.org/g/onosdev/c/hzfjjEyruGo/m/xsnEiMCdAwAJ and
https://docs.google.com/document/d/1xZ3Wnr6VZS34paYVdZhF8WWo3M6VWKNJXQTmBnnM7Y</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4 See https://wiki.onosproject.org/display/ONOS/ISSU within [16]</title>
          <p>4.3</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>Used Parameters</title>
        <p>According to the included elements explained above, we will now present the used parameters.
As for wording we use ’parameter’ and ’rate’ interchangeably.
28.28 per year
22.47 per year
20.10 per year
18.05 per year
2 per hour
30 per year
6 per minute
4 per year
12 per hour
4 per hour
6 per day
2 per year
2 per day
3 per year
4 per hour</p>
        <p>
          The ONOS parameters from [
          <xref ref-type="bibr" rid="ref10 ref13 ref19">10, 13, 19</xref>
          ] do not include Atomix. We can therefore use them
as purely ONOS parameters. This is important to point out as in older versions Atomix and
ONOS were very closely coupled, as described in Section 4.1.
        </p>
        <p>
          For Atomix we could not find any published work that states specific values for it. What
we could find was a software failure rate of one week in [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], but this value is for the complete
SDN controller that implements the Raft consensus protocol. The same programmers that
wrote ONOS wrote Atomix. Therefore, we assume that both have the same reliability, so we
set the Atomix failure rate equal to the failure rate of ONOS v1.13.
        </p>
        <p>The Atomix software repair duration is set by us to 10 seconds. This number is
gathered by a coarse grained test we did by starting and stopping an Atomix container
within a Virtual Machine with 6 vCPUs and 14 GB RAM with an i7-9750H. We assume that
an ONOS deployment could be done via Kubernetes, as they suggest on their website5. The
number above is reasoned like following, one second is the time Kubernetes needs to perform
liveness and readiness probes to check the status of the Atomix instance. If the instance is
detected as unhealthy, it is shutdown which takes up around two seconds and starts up a new
one on the same host which takes up roughly seven seconds. As we deploy the new Atomix
instance on the same node we need to wait for the old one to be shutdown.</p>
        <p>Getting the Atomix and ONOS upgrade durations is hard, as there exists no
documentation for either one online. Furthermore, the tools that an installed ONOS version provides,
do not contain hints on how to upgrade an existent cluster. As the upgrade of Atomix and</p>
        <sec id="sec-4-5-1">
          <title>5 See https://atomix.io/docs/latest/user-manual/deployment/kubernetes/</title>
          <p>ONOS instances is mainly done via shutting down the old version and starting a new version
instance in its place6, one could argue that we plainly could take the time that needs. So for
Atomix that would be nine seconds, as described above minus the one second for failure
detection that is not necessary in this proactive process. For ONOS a restart takes around seven
seconds, measured in the same environment. The problem is that these numbers seem pretty
low for a major version upgrade. For this reason we set the upgrade duration arbitrarily.</p>
          <p>For Atomix upgrade this means five minutes with regard to the easy upgrade as
described in Section 4.2. We set the ONOS upgrade durations to 15 minutes, for the failure
free upgrade, and four hours, when failures are encountered by the expert. 15 minutes, since
we assume that upgrading ONOS is more complex than to upgrade Atomix due to expected
data migration. The four hours are a mean of considering pretty simple errors, like wrong IP
address in configuration, and more complex ones, like (partly) failed data migration.</p>
          <p>In Table 4 we listed all sourced or calculated bugs per hour per ONOS version. Bugs per
hour of v1.10, v1.12 and v1.13 are based upon published work, these are also the basis for
our calculation of v1.11, v1.14 and the upgrades after v1.14. Version 1.14 and the upgrades
are based upon the logarithmic trend, as we expect that fewer bugs are getting fixed for each
release as it gets more time intensive to fix them.
4.4</p>
        </sec>
      </sec>
      <sec id="sec-4-6">
        <title>GSPN model in GreatSPN</title>
        <p>In the following section we will present one part of our GSPN model in detail7. Figure 2
contains the following Atomix elements:</p>
      </sec>
      <sec id="sec-4-7">
        <title>Places</title>
        <p>Aup : Holds tokens that represent up Atomix instances.</p>
        <p>ASW failed : Holds tokens that represent failed Atomix instances due to software failure.
AHW failed : Holds tokens that represent failed Atomix instances due to hardware failure.
COup : Holds at most one token, representing that consensus is up.</p>
        <p>COfailed : Holds at most one token, representing that consensus is failed.</p>
      </sec>
      <sec id="sec-4-8">
        <title>Transition</title>
        <p>FSW : This transition represents a software failure of an Atomix instance. When it fires, it
moves one token from the Aup place into the ASW failed place. Additionally, it has an
inhibitor arc to COfailed and its rate is defined by the AtomixSW F ailRate parameter.
RSW : This transition represents a software recovery of an Atomix instance. When it fires,
it moves one token from the ASW failed place into the Aup place. Its rate is defined by
the AtomixSW RecRate parameter.</p>
        <p>FHW : This transition represents a hardware failure of an Atomix instance. When it fires, it
moves one token from the Aup place into the AHW failed place. Its rate is defined by the
HW F ailRate parameter.</p>
        <p>RHW : This transition represents a hardware recovery of an Atomix instance. When it fires,
it moves one token from the AHW failed place into the Aup place. Its rate is defined by
the HW RecRate parameter.</p>
        <p>FCO : This transition represents the failure of the consensus protocol. When it fires, it moves
one token from the COup place into the COfailed place. Additionally, it has an inhibitor
arc to Aup with the multiplicity of atomixmajority, whereas atomixmajority is equal
to the rounded down result of atomixcount . Its rate is immediate.</p>
        <p>2+1</p>
        <sec id="sec-4-8-1">
          <title>6 See https://wiki.onosproject.org/display/ONOS/ISSU within [16]</title>
        </sec>
        <sec id="sec-4-8-2">
          <title>7 The full model can be seen on GitHub (https://github.com/HansZimmer5000/PNSE21) or in the</title>
          <p>Appendix (Figure 11)
RCO : This transition represents the recovery of the consensus protocol. When it fires, it
moves one token from the COfailed place into the COup place. Additionally, it has one
input and output arc to Aup with the multiplicity of atomixmajority each. Its rate is
immediate.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Validation</title>
      <p>
        We validated our model with simulations. GreatSPN features tools to analyze performance
aspects of a GSPN or SWN model, e.g. with the ’WNSIM’ tool mentioned which was used
by [
        <xref ref-type="bibr" rid="ref22 ref6">6, 22</xref>
        ]. The results of a WNSIM simulation are the mean number of tokens for each place
and the mean throughput for each transition throughout the simulation time. As our cluster
steady state availability, described in Section 2.1, we take the mean number of tokens in the
Clusterup place. This is possible since the number of tokens is at most one, as described
in Section 4.2, which can represent a percentage with its value between 0 and 1. To get
this value, we do not have to wait for the simulation to finish as it constantly creates
finegrained logs from which one can gather the cluster availability. This allows us to interrupt
simulations if they are running too long. For all simulations we use a confidence of 90% and
an approximation of 50%. Lower approximation leads to more precise results, but also to
immensely increased simulation times. With the current value we have simulations that do not
finish within six hours. This time is not negligible, as we execute for example 160 simulations8
8 architectures stepcount repititions = 2 16 5
for the upcoming HW F ailRate parameter sensitivity analysis alone. The simulations are
executed on an i7-9750H CPU with 12 threads and 32 GB RAM.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Sensitivity Analysis</title>
        <p>For our sensitivity analysis, we interrupt a simulation after one hour. Each sensitivity analysis
analyzes one parameter. All others parameters are unaltered, see Section 4.3 for parameter
default values. The value of an analyzed parameter is altered with each ’step’. We configured
our steps to be between -5 and 10 inclusive, each step differs 10% step from the default
value which is used in step 0. For example with the ON OSSW RecRate parameter we have
the different values per step:
Step -5, -50% : 1440 0; 5 = 720
...</p>
        <p>Step -1, -10% : 1440 0; 9 = 1296
Step 0, +0% : 1440 1 = 1440 (default value)
...</p>
        <p>Step 10, +100% : 1440 2 = 2880
For parameters with integer values, the difference is 1 stepno. Each step is repeated five
times. Each repetition starts with its own seed which is the value of BASHs $RANDOM
variable at that moment. So from a different perspective, each analyzed parameter, step,
repetition and architecture is a separate simulation with a unique seed. As an optimization,
not all parameters are executed in all architectures. For example, the upgrade parameters
are only analyzed in the new architecture as they have no impact in the old architecture.</p>
        <p>In all simulations, the step difference did not propagate to the cluster availability. In
other words, when the value of a parameter was increased by 10%, the cluster availability
did change by less than 10%. We present one sensitivity result in more depth, as an example
how these results have to be read and interpreted. In Figure 3 we can see the mean result
of multiple sensitivity analysis of the U pgradeClusterRate parameter. In the sensitivity
analysis we focus on how much a paramter change influences the overall result. The results
have a blue line if the sensitivity analysis was done only once, and it is green like in our
example if the analysis was repeated multiple times, the green line then represents the mean.
Like explained above, the sensitivity analysis has 16 steps, from -5 to 10 inclusive. Three
exemplary steps of the mentioned sensitivity results and their meaning are described next:
Step 0 : The base value of the sensitivity analysis is always step 0 as it represents results
with the default values of our model from Section 4.3. In this case step 0 results in a
cluster availability of 0; 9999872 (99; 99872%).</p>
        <p>Step -5 : The parameter has -50% of its default value (1440 0; 5). This results in a cluster
availability of 0; 9999817. This is a difference to our base cluster availability of about
0; 00055%, this is far from -50%.</p>
        <p>Step 4 : The parameter has +40% of its default value (1440 1; 4). This results in a cluster
availability of 0; 9999904. This is a difference to our base cluster availability of about
0; 00032%, which is far away from +40%.</p>
        <p>So the change of the parameter value did influence to the cluster availability, but not by the
same percentage.</p>
        <p>This can be observed for all other results too. More result examples like the one just
described. Some of them will be discussed multiple times in the upcoming sections, the
figures are placed next to the discussion that depends the most on it.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Simulation Results Plausibility</title>
        <p>Besides looking at sensitivity we are now looking at the plausibility of some simulation results,
partly already created during sensitivity analysis.</p>
        <p>
          In contrast to the sensitivity analysis, it is very important for the plausibility analysis to
consider the context of the analyzed parameter, as different parameters have different impact
on the cluster’s availability. For example an increased recovery rate is expected to increase
the cluster availability while an increased failure rate is expected to decrease it.
For the basic architecture results we take both unaltered architectures, execute each
five times and calculate the mean of all repetitions. Like a sensitivity analysis without steps.
The simulation results reveal an availability of 99,9981% for the new and 99,99550% for the
old architecture. These results are just slightly above of the 99,99% availability that ONOS
specified themselves [
          <xref ref-type="bibr" rid="ref16 ref4">16, 4</xref>
          ]. Keep in mind that these sources are based upon an older version
of ONOS in the old architecture. With reference to Table 4, ONOS’ availability most certainly
improved over the last few years, so we are confident that the basic architecture results are
plausible.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>The plausibility of our sensitivity analysis results is shown in the following.</title>
        <p>Exemplary in the Figures 4 and 6 we can observe the hardware failure rate and link
failure results. The results match our expectations, as the increasing failure rate decreases
the cluster availability in both architectures.</p>
        <p>In Figure 3 we can also observe that sometimes the results line takes rapid changes. This
could be a hint that we have set a too low repetition count, and we need a higher variety of
results per steps to get a more consistent course of the steady state availability. Even though,
the differences between steps are minimal, they begin to differ from the fifth decimal place
onwards. For this reason we do not further investigate this problem.</p>
        <p>The upgrade cluster rate results in Figure 3 show a mean of multiple sensitivity analysis
as the result is counterintuitive, hence a green instead of a blue line. With a faster upgrade
rate the cluster availability decreases slightly. Even though the observed difference from the
start of the trend line to its end is minimal9, we want to give an idea why the trend line
could behave this way. With a higher upgrade rate, the cluster has more often instances that
are being upgraded, which takes the cluster one instance closer to unavailability. By that we
mean, if a cluster has three Atomix instances and one is being upgraded, the cluster is only
left with two up Atomix instances, if one of them fails, the cluster becomes unavailable as
too few up Atomix instances are left to keep the consensus up.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Evaluation</title>
      <p>In this section we will compare the availability of the two ONOS’ architectures. First by
taking a new look at the results from the sensitivity analysis from Section 5.1 in Section
6.1. Afterwards we will present specific deployment scenarios and interpret their results in
Section 6.2. Finally, in Section 6.3 we will answer the research question, if the architecture
change was beneficial for ONOS’ availability. It will turn out that the new architecture has
an overall higher availability than the old architecture.
6.1</p>
      <sec id="sec-6-1">
        <title>Sensitivity Results interesting for architecture comparison</title>
        <p>In the following we will present results of the sensitivity analysis results originally intended
for Section 5, but they fit much better here as they let us compare the availability in both
architectures. These results may not have been gathered if we had executed the simulations
manually like mentioned in Section 5.</p>
        <p>Link failure and recovery rate sensitivity results can be seen in Figures 4 and 5. We can
observe that the new architecture does not depend on link availability as much as the old
architecture does. It also has a higher mean availability throughout the comparison.</p>
        <p>We come to this conclusion for the link failure rate since the new architecture trend line
changes slightly. The old architecture trend line has a bigger difference between its start and
end value. In other words, the higher failure rate is more noticeable in the old architecture.
9 With an absolute value of about 0,0000025 (0; 999985
Additionally, the new architecture trend line stays on a high level between 0,99998 and 1.
The old architecture comes into that range with two steps, but its trend line is way lower,
between 0,99995 and 0,99994.</p>
        <p>For the link recovery rate we can observe and interpret the same. The new architecture
trend line is on a higher level and also changes slightly, while the old architecture trend line
is way lower and has a higher change from the trend line start to end.</p>
        <p>
          At last, one remark to [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. It concludes that link availability is of vital importance to
the SDN steady state availability. In the following, we will present the hardware rates. When
we compare their availability results with the ones of the links, we can see, that in the
new architecture the hardware availability has a greater impact than link availability. We
reason this as the hardware results are generally on a lower level and their trend lines are
steeper. The same could be said about the old architecture, although here the results from
the hardware rates and link rates are noticeably closer.
        </p>
        <p>Hardware failure and recovery rate sensitivity results can be seen in Figures 6 and 7.
Both architectures behave the same when the hardware failure and recovery rates change.
Different is the degree of which the availability is impacted.</p>
        <p>The hardware failure rate sensitivity analysis for the new and old architecture in Figure
6 clearly shows, that the new architecture is less dependent on its hardware, as its trend line
declines less steep when compared to the old architecture.</p>
        <p>In the old architecture the hardware recovery rate develops slightly worse than in the
new architecture, as with better recovery rates the availability does not rise as fast as in the
new architecture. It also does not reach cluster availabilities beyond 0; 99998 as often as the
new architecture does, especially in later steps when the recovery rate is 180% (step 8) to
200% (step 10) of its default value.</p>
        <p>Like observed in the link rate results, the availability of the new architecture is generally
higher and less dependent on other elements, compared to the old architecture.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Scenarios</title>
        <p>Additional to our results from Section 5, we now present scenarios to compare the availability
in both architectures with specific viewpoints for a more in depth comparison.
For the upgrade scenarios we simulate four deployments. Two are the unaltered v1.14
and v1.13 architectures. The other two are altered v1.14 architectures. One has no upgrade
at all, which means that ONOS is never upgraded to improve its availability. The other
considers upgrades, but all ONOS versions have the same availability, so if for example new
bugs are introduced with a release, others are fixed. In the following this scenario is called
’stagnating’.</p>
        <p>The results of this deployment scenario can be seen in Figure 8. On the left side we
can observe that the deployment with a stagnating upgrade in the new architecture has the
worst availability. Second is the old architecture which is followed up by the unaltered new
architecture and the new architecture without any upgrade.</p>
        <p>It makes sense, that the new architecture has a higher availability than the old
architecture, as already presented in Sections 5.1 and 5.2. That a new architecture with unchanging
failure rates after each upgrade performs the worst can be reasoned with the point that an
upgrade always means that during an upgrade of an instance we are one instance closer
to unavailability, as described in Section 5.2. So in other words we ’buy’ the upgrade with
availability, but the availability does not improve afterwards to balance out the cost.</p>
        <p>We expected that the unaltered new architecture would result in the highest availability,
as it contains the improving ONOS availability with each upgrade. Because of this, it is
a surprise that the new architecture without any upgrade has the best availability, even
though the difference is very low. This surprise maybe due to the fact, that the availability
improvement after an upgrade is not high enough to balance out the mentioned cost.
To test if this assumption is true, we simulated the new architecture in four additional
scenarios as an extension to the presented scenarios above. Each ’increase’ scenario increases
the ONOS availability, by decreasing the ONOS software failure rate of the upgrades by
(2 + (increasenumber 1) 6)%. This is done to see, how much the upgrades need to
improve the ONOS availability before we can balance out the cost of an upgrade. All decrease
percentages are listed in Table 2.</p>
        <p>The results can be seen on the right side in Figure 8. We do not further explain the
difference between the ’increase1’ and the basic new architecture results since their difference
is negligible10. We can observe, that with increasing ONOS availability, the results come closer
to the ’no upgrade’ scenario and even can surpass it with ’increase3’ and ’increase4’. So if
our assumed ONOS software failure rates would decrease significantly of about the value of
’increase3’ in Table 2 or more, the cluster availability would benefit from upgrades. In other
words, the ONOS software failure rate would be needed to decrease around 23% for the first,
33% for the second and 41% for the fourth upgrade when compared to the v1 software failure
rate to balance the overhead. An overview of the default and needed software failure rates
and their decreases in percent can be found within Section A.2 in Table A.2.
Different Atomix and ONOS counts are set in our next deployment scenarios. The
combinations of the two counts are arbitrarily set.</p>
        <p>The results can be seen in Figure 9. Scenarios that deploy the old architecture
are named like &lt;onoscount&gt;’o’ and the new architecture deployments are named like
&lt;onoscount&gt;&lt;atomixcount&gt;’n’. We also will use the term of the ’consensus instances’, by
which we mean ONOS instances in the old and Atomix instances in the new architecture.
We are creating this term as it shortens the explanation, and they respectively are the basis
for the consensus status.</p>
        <p>Interesting to see, is that the consensus instance count mainly influences the cluster
availability, to observe between 23n and 57n as they are sorted by the second number. With
one exception to 29n which has a very high consensus instance count, but with a very low
ONOS count which we assume diminishes the availability, because the cluster is close to
unavailability with just two ONOS instances. Especially during an upgrade, with reference
to Section 6.2.</p>
        <p>Although this work only looks at the availability a deployment can offer and does not
consider for example load on the instances or operational costs, we want to point out, that
an availability beyond 99,999% is achieved in the old architecture with fewer instances than
in the new architecture. To reach this availability, the old architectures needs at least five
ONOS instances, while the new architecture needs at least three ONOS and five Atomix
instances. This result could be interesting for future work, e.g. if this is still true once the
model considers load on the instances or the availability is put into context with operational
costs.
The new architecture within ONOS v1.14 or newer, should especially be considered in
deployments with lower hardware and link availability as shown in Section 6.1. Especially if
the hardware’s reliability or maintenance is worse than what we assumed in Section 4.3. This
is important as it could mean in this case to increase unavailability within a year from 26
minutes (99,995%) to 657 minutes (99,875%)11.</p>
        <p>
          That the new architecture performs better in terms of availability than the old
architecture can also be seen in Section 6.2 in our upgrade and deployment scenarios. In our upgrade
scenarios we could observe that the upgrade comes with an availability cost. So operators of
ONOS in the new architecture should not update every release but pick the ones that are
needed, when viewed from the availability point of view. With reference to [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], one should
also wait a few months after the release before upgrading.
        </p>
        <p>When we looked at the instance counts, we could see that for the old architecture at least
five ONOS instances should be deployed. In the new architecture the operator should deploy
at least five instances of Atomix and at least three instances of ONOS from an availability
point of view. In these deployments, both architectures can reach a cluster availability beyond
99,999%.</p>
        <p>So, to answer our research question which architecture has a higher cluster steady state
availability, we strongly recommend ONOS v1.14 or newer with the new architecture over
ONOS v1.13 and older with the old architecture. Especially in environments with worse
hardware or link availability than we assumed. The higher availability of the new architecture
in these environments does mainly come from the increased count of links and hardware
nodes. And so its higher tolerance towards single failures of such elements.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this work we presented and evaluated the research question, if the additional complexity
in the new architecture harms the overall availability and if so, if this is only the case in
certain scenarios.</p>
      <p>This research question was formulated and reasoned in Section 4. In this section we also
defined our model elements. Our GSPN model was then validated in Section 5, with
emphasis on the results of automated simulations. With them, we could show, that no parameter
overwhelmingly influences the results. And we explained that these are plausible. Although
some results were counterintuitive at first sight, they could be reasoned. In Section 6 we took
once more a look at the results of simulations and could observe that the new architecture
performs better than the old architecture in terms of availability. This was confirmed with
specific deployment scenarios. The section was concluded with, besides others,
recommendations for ONOS deployments. Even though, the fact that the new architecture can be easily
upgraded, was not the main factor for the availability gain over the old architecture in our
model. This was foreshadowed in the counterintuitive results in Section 6 mentioned above.
The availability gain over the old architecture is based upon the addition of new hardware
nodes and especially links. Single hardware or link failures can be much easier tolerated as
the cluster has a higher replication degree.</p>
      <p>We conclude this work, with a clear recommendation for the new ONOS architecture in
ONOS version 1.14 and onwards.</p>
      <p>Our work can be extended in multiple ways. One idea is to create a more realistic model,
e.g. include load dependent reliability into the model or include a more sophisticated network
partitioning. Additional ideas can be found in Sections 4.2 and 6.2. Here we want to remind
the reader, that some of our rates, as explained in Section 4.3, are based upon older literature
11 Calculated with (1
availability) 365 24 60.
or are guessed. So our results should be taken with a grain of salt. Further specific and reliable
rates are needed for future work for more realistic performance analyzations.</p>
      <p>
        When the model becomes more complex, a switch from GSPN to Stochastic Well-Formed
nets, may help to improve the models readability and maintainability. Additionally, one could
think about creating a hierarchical model like done in [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] as mentioned in Section 3.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Appendix</title>
      <p>In the appendix we present additional information that may interest readers. We explain
further information of the stacked failure extension in Section A.1, ONOS software failure
rates to surpass the ’no upgrade’ scenario from Section 6.2 in Section A.2, and the full GSPN
model in Section A.3.</p>
      <p>A.1</p>
      <sec id="sec-8-1">
        <title>Stacked Failure Extension</title>
        <p>To model stacked failures, we need to consider the dependencies the have between each
other. For a start, any failure (software, hardware, network partition) can happen on top of
any other, with one exception. Software failures are overwritten by a hardware failure, as a
software cannot be executed on top of failed hardware. This can be seen in Figure 10. To
improve readability, the failure transitions are horizontal and the recovery transitions are
immediate and vertical. This has no semantic meaning.</p>
        <p>This becomes more complex if we want to incorporate that into the presented model, we
need to keep track which instance has currently which failures, to make sure it is recovered
the correct way. For example a partitioned node with a software failure, must recover both
before that instance is up again. Especially for ONOS, this would drastically increase the
amount of transitions and places, as we have to model each specific failure scenario for each
version.</p>
      </sec>
      <sec id="sec-8-2">
        <title>ONOS Software failure rates to surpass ’no upgrade’ scenario</title>
        <p>With reference to Section 6.2, we now list the needed ONOS software failure rate for each
version, so our model could benefit from upgrades in terms of availability. These numbers
are based upon our parameters described in Section 4.3.
In Figure 11 we added the full GSPN model, so that one can better understand how the
squares are connected.</p>
        <p>Version</p>
        <p>Calculated: Based upon logarithmic trend from
Excel: f (x) =
0; 005386747003005 ln(x) + 0; 017517342986504</p>
        <p>Fig. 11. Our model in GSPN and GreatSPN</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. Assessing the maturity of sdn controllers with reliability growth models</article-title>
          (
          <year>Juni 2019</year>
          ), https://wiki.onosproject.org/download/attachments/12422167/tumonfworkshop.pdf?
          <source>version=1&amp;modificationDate=1561629548175&amp;api=v2</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Avizienis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laprie</surname>
            ,
            <given-names>J..</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Randell</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landwehr</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Basic concepts and taxonomy of dependable and secure computing</article-title>
          .
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>11</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2004</year>
          ). https://doi.org/10.1109/TDSC.
          <year>2004</year>
          .2
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ballarini</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donatelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Validation and evaluation of a software solution for fault tolerant distributed synchronization</article-title>
          .
          <source>In: Proceedings International Conference on Dependable Systems and Networks</source>
          . pp.
          <fpage>773</fpage>
          -
          <lpage>782</lpage>
          (
          <year>2002</year>
          ). https://doi.org/10.1109/DSN.
          <year>2002</year>
          .1029023
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Berde</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerola</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Higuchi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koide</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lantz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radoslavov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snow</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parulkar</surname>
          </string-name>
          , G.:
          <article-title>Onos: Towards an open, distributed sdn os</article-title>
          .
          <source>In: Proceedings of the Third Workshop on Hot Topics in Software Defined Networking</source>
          . p.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . HotSDN '14,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2014</year>
          ). https://doi.org/10.1145/2620728.2620744, https://doi.org/10.1145/2620728.2620744
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Brosch</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buhnova</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koziolek</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reussner</surname>
          </string-name>
          , R.:
          <article-title>Reliability prediction for faulttolerant software architectures</article-title>
          . p.
          <fpage>75</fpage>
          -
          <lpage>84</lpage>
          . QoSA-ISARCS '
          <fpage>11</fpage>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA (
          <year>2011</year>
          ). https://doi.org/10.1145/2000259.2000274, https://doi.org/10.1145/2000259.2000274
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Diallo</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sene</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarr</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Freshness-aware metadata management: Performance evaluation with swn models</article-title>
          .
          <source>In: ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2010</year>
          ). https://doi.org/10.1109/AICCSA.
          <year>2010</year>
          .5586954
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siewiorek</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          :
          <article-title>High-availability computer systems</article-title>
          .
          <source>Computer</source>
          <volume>24</volume>
          (
          <issue>9</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          (
          <year>1991</year>
          ). https://doi.org/10.1109/2.84898
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kleppmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Designing Data-Intensive Applications: The Big Ideas Behind Reliable</article-title>
          , Scalable, and
          <string-name>
            <given-names>Maintainable</given-names>
            <surname>Systems. O'Reilly Media</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kobo</surname>
            ,
            <given-names>H.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abu-Mahfouz</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hancke</surname>
            ,
            <given-names>G.P.</given-names>
          </string-name>
          :
          <article-title>Efficient controller placement and reelection mechanism in distributed control system for software defined wireless sensor networks</article-title>
          .
          <source>Transactions on Emerging Telecommunications Technologies</source>
          <volume>30</volume>
          (
          <issue>6</issue>
          ),
          <year>e3588</year>
          (
          <year>2019</year>
          ). https://doi.org/10.1002/ett.3588, https://onlinelibrary.wiley.com/doi/abs/10.1002/ett.3588, e3588 ett.
          <fpage>3588</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kriaa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papillon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadeesan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendiratta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Better safe than sorry: Modeling reliability and security in replicated sdn controllers</article-title>
          .
          <source>In: 2020 16th International Conference on the Design of Reliable Communication Networks DRCN 2020</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2020</year>
          ). https://doi.org/10.1109/DRCN48652.
          <year>2020</year>
          .1570604424
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lai</surname>
          </string-name>
          , R.:
          <article-title>Performance modelling for the csma/cd protocol using gspn</article-title>
          .
          <source>In: Proceedings of IEEE Singapore International Conference on Networks and International Conference on Information Engineering '95</source>
          . pp.
          <fpage>126</fpage>
          -
          <lpage>130</lpage>
          (
          <year>1995</year>
          ). https://doi.org/10.1109/SICON.
          <year>1995</year>
          .526031
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>López-Grao</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Merseguer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campos</surname>
          </string-name>
          , J.:
          <article-title>On the use of formal models in software performance evaluation</article-title>
          . Actas de las X Jornadas de Concurrencia pp.
          <fpage>367</fpage>
          -
          <lpage>387</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mendiratta</surname>
            ,
            <given-names>V.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadeesan</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanmer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>How reliable is my software-defined network? models and failure impacts</article-title>
          .
          <source>In: 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)</source>
          . pp.
          <fpage>83</fpage>
          -
          <lpage>88</lpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/ISSREW.
          <year>2018</year>
          .
          <volume>00</volume>
          -
          <fpage>26</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nencioni</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helvik</surname>
            ,
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heegaard</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamisinski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Availability modelling of software-defined backbone networks</article-title>
          .
          <source>In: 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W)</source>
          . pp.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1109/DSN-W.
          <year>2016</year>
          .28
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eom</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>D.S.:</given-names>
          </string-name>
          <article-title>Availability modeling and analysis for software defined networks</article-title>
          .
          <source>In: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC)</source>
          . pp.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1109/PRDC.
          <year>2015</year>
          .27
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Onos confluence (
          <year>October 2020</year>
          ), https://wiki.onosproject.org/display/ONOS/ONOS
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <article-title>Onos security and performance analysis</article-title>
          (
          <year>Dezember 2017</year>
          ), https://opennetworking.org/wpcontent/uploads/2017/07/ONOS-security
          <article-title>-and-performance-analysis-brigade-report-no1.pdf</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <article-title>Onos security and performance analysis (report no. 2) (November</article-title>
          <year>2018</year>
          ), https://opennetworking.org/wp-content/uploads/2018/11/secperf_report_2.pdf
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <article-title>Security and performance comparison of onos and odl controllers</article-title>
          (
          <year>September 2019</year>
          ), https://opennetworking.org/wp-content/uploads/2019/09/ONOSvsODL-report-4.pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Owre</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rushby</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shankar</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>von Henke</surname>
          </string-name>
          , F.:
          <article-title>Formal verification for fault-tolerant architectures: prolegomena to the design of pvs</article-title>
          .
          <source>IEEE Transactions on Software Engineering</source>
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>125</lpage>
          (
          <year>1995</year>
          ). https://doi.org/10.1109/32.345827
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Sakic</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellerer</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Response time and availability study of raft consensus in distributed sdn control plane</article-title>
          .
          <source>IEEE Transactions on Network and Service Management</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <fpage>304</fpage>
          -
          <lpage>318</lpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/TNSM.
          <year>2017</year>
          .2775061
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Sanghare</surname>
            ,
            <given-names>O.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sene</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodrigues</surname>
            ,
            <given-names>J.J.P.C.</given-names>
          </string-name>
          :
          <article-title>Distributed transactions on mobile systems: Performance evaluation using swn</article-title>
          .
          <source>In: 2011 IEEE International Conference on Communications (ICC)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1109/icc.
          <year>2011</year>
          .5963020
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Scott-Hayward</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Design and deployment of secure, robust, and resilient sdn controllers</article-title>
          .
          <source>In: 2015 1st IEEE Conference on Network Softwarization (NetSoft)</source>
          .
          <source>Institute of Electrical and Electronics Engineers (IEEE) (Apr</source>
          <year>2015</year>
          ). https://doi.org/10.1109/NETSOFT.
          <year>2015</year>
          .
          <volume>7258233</volume>
          , iEEE Conference on Network Softwarization (NetSoft
          <year>2015</year>
          ) ; Conference date:
          <fpage>13</fpage>
          -
          <lpage>04</lpage>
          -2015 Through 17-
          <fpage>04</fpage>
          -2015
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Tanenbaum</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>v</year>
          .: Verteilte Systeme - Prinzipien und Paradigmen. Pearson
          <string-name>
            <surname>Studium</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Vizarreta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trivedi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendiratta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellerer</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mas-Machuca</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Dason: Dependability assessment framework for imperfect distributed sdn implementations</article-title>
          .
          <source>IEEE Transactions on Network and Service Management</source>
          <volume>17</volume>
          (
          <issue>2</issue>
          ),
          <fpage>652</fpage>
          -
          <lpage>667</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>