=Paper= {{Paper |id=Vol-3145/short05 |storemode=property |title=Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications |pdfUrl=https://ceur-ws.org/Vol-3145/paper05.pdf |volume=Vol-3145 |authors=Lukas Miedema,Benjamin Rouxel,Clemens Grelck |dblpUrl=https://dblp.org/rec/conf/cerciras/MiedemaRG21 }} ==Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications== https://ceur-ws.org/Vol-3145/paper05.pdf
Strategy Switching: Smart Fault-tolerance for
Resource-constrained Real-time Applications
Lukas Miedema1 , Benjamin Rouxel1,2 and Clemens Grelck1
1
    University of Amsterdam (UvA), Amsterdam, Netherlands
2
    University of Modena and Reggio Emilia (Unimore), Modena, Italy


                                         Abstract
                                         Software-based fault-tolerance is an attractive alternative to hardware-based fault-tolerance, as it allows
                                         for the use of cheap Commercial Off The Shelf hardware. However, software-based fault-tolerance
                                         comes at a cost, requiring computing the same results multiple times to allow for the detection and
                                         mitigation of faults. Resource-constrained real-time applications may not be able to afford this cost. At
                                         the same time, the domain of a real-time task may allow it to tolerate a fault, provided it does not occur
                                         in consecutive iterations of the task. In this paper, we introduce a new way to deploy fault-tolerance
                                         called strategy switching. Our method targets Single Event Upsets by running different subsets of tasks
                                         under fault-tolerance at different points in time. We do not bound the number of faults in a window,
                                         nor does our method assume that tasks under fault-tolerance cannot still fail. Our technique does not
                                         require a minimal amount of additional compute resources for fault-tolerance. Instead, our method
                                         optimally utilizes any available compute resources for fault-tolerance for resource-constrained real-time
                                         applications.

                                         Keywords
                                         Cyber-physical Systems, Resource Constraints, Fault-tolerance, Single Event Upsets, Weakly Hard Real-
                                         time




1. Introduction
As transistor density increases and gate voltages decreases, the frequency of transient faults or
single event upsets (SEUs) increases [1]. Hence, the need for fault-tolerance against these types
of faults is growing.
   Fault-tolerance techniques can either be implemented in hardware or in software. Software-
based fault-tolerance is attractive due to its ability to protect workloads on Commercial Off
The Shelf (COTS) hardware. However, providing general-purpose fault-tolerance against SEUs
typically requires redundant execution, often in the form of Triple Modular Redundancy [2]
(TMR). TMR uses two-out-of-three voting to obtain a majority and mitigate the effects of a
SEU. TMR can be implemented at different levels of granularity, e.g. at the compiler level like
SWIFT-R [3], but also at the OS task level as implemented in OS Task Level Redundancy [4].
However, the overhead remains: instrumenting a binary with SWIFT-R increases its execution
time by 99 percent. As such, constrained real-time systems may have insufficient processing

CERCIRAS WS01: 1st Workshop on Connecting Education and Research Communities for an Innovative Resource
Aware Society
  l.miedema@uva.nl (L. Miedema); benjamin.rouxel@unimore.it (B. Rouxel); c.grelck@uva.nl (C. Grelck)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
resources to allow all tasks to run with fault-tolerance. However, for applications structured as
a set of periodic tasks, software-based fault-tolerance allows the application of fault-tolerance
to only a subset of the task set.
   Control tasks may be able to tolerate non-consecutive deadline misses, which has led to the
adoption of the weakly hard model [5]. A task that is unable to provide a result may not result in
catastrophe, provided that in the next period it can provide a result. Per the weakly hard model,
each task 𝑖 has an (𝑚𝑖 , 𝑘𝑖 ) constraint, indicating that the task must complete at least 𝑚𝑖 times
successfully out of every 𝑘𝑖 times. 𝑘𝑖 is said to be the window size. We use this (𝑚𝑖 , 𝑘𝑖 ) constraint
with 𝑚𝑖 < 𝑘𝑖 to deliver more effective fault-tolerance to resource-constrained systems. For
example, consider a task set with just two tasks A and B, where only one task can be run under
fault-tolerance at a time. Furthermore, both task A and B can tolerate non-consecutive deadline
misses. Running task A under fault-tolerance and task B without would leave task B vulnerable
to SEUs. However, we could more optimally make use of the scarce fault-tolerance by switching
between protecting task A and task B in successive iterations of the task set. Tasks under
fault-tolerance may still fail (e.g. TMR reaches no majority), and these cases can be detected.
When the fault-tolerance technique has failed to protect task A, task A should be protected
again in the next iteration of the task set to ensure it does not experience a consecutive fault.

Contribution We propose a new approach for improving fault-tolerance for real-time ap-
plications running on resource-constrained systems by strategy switching. We minimize the
effective unmitigated fault-rate by selecting which tasks are to be run under fault-tolerance.
Our approach recognizes that the importance of protecting a task may change over time due
to earlier faults or lack thereof, and as such runs different tasks under fault-tolerance at dif-
ferent times. By exhaustively searching all patterns in which fault-tolerance can be applied,
our method optimally utilizes limited available computational resources for fault-tolerance in
resource-constrained real-time applications.

Organization In section 2 we introduce our task and fault model. Our method uses a state
machine, which is introduced in section 3. In section 4, we formalize the construction of the state
machine. Then, in section 5, we discuss how our method can lower the consecutive fault rate
for an example task set. We explain the flexibility of our method by extending it to alternative
task and fault models in section 6. Several pieces of related work exist, which are covered in
section 7. The paper is concluded in section 8, and finally in section 9 we discuss various future
directions for this technique.


2. System models
Task model We assume a set of periodic tasks Γ = {𝜏1 ...𝜏𝑛 } with a single, global period
and deadline 𝐷 such that the period is equal to or larger than the deadline (no pipelining).
Furthermore, we initially assume that each task can afford one (non-consecutive) deadline miss
(an (𝑚𝑖 , 𝑘𝑖 ) constraint of (1,2)), and that each task is equally important. In section 6, we will
discuss how some of these assumptions can be relaxed to support a wide variety of task models.
Fault model We use the Poisson distribution as an approximation for the worst-case fault
rate of SEUs, which was argued to be a good approximation by Broster et al. [6]. We do not
assume universal fault detection: only when the task runs under a fault-tolerance scheme can a
fault be detected and mitigated. When a task does not run with fault-tolerance, it is not known
whether or not it succeeded. We use the term catastrophic fault to describe an unmitigated
fault occurring in two consecutive iterations of a task that can tolerate a single unmitigated
fault, i.e. the task 𝑖 has an (𝑚𝑖 , 𝑘𝑖 ) = (1,2) constraint. Furthermore, a catastrophic fault also
occurs when a task that cannot tolerate non-consecutive faults experiences an unmitigated
fault, i.e. the task has a (1,1) constraint. We do not consider constraints beyond 𝑘𝑖 = 2 in this
paper.

Fault mitigation We assume the presence and implementation of a particular fault-tolerance
scheme, and that any task can be run under that scheme. In this paper, we assume that
SEUs always go undetected in tasks not under fault-tolerance. Finally, we assume the scheme
implements fault-detection and fault-mitigation. Our model allows the fault mitigation to fail
(e.g. due to successive SEUs during both replicas of a task under TMR), but assumes that it is
known when fault mitigation fails.

Other definitions Given the complexity and number of symbols used in this paper, a table
of all symbols and terms has been compiled in Table 1. Each symbol or term used will be defined
prior to use, as well as being listed in the table.


3. Strategy Switching State Machine
To both swiftly select a new subset of the task set to run under fault-tolerance while also making
optimal decisions, we precompute for each situation the next best subset of tasks to run under
fault-tolerance. The result of this is the strategy state machine, which is made available to the
online component. The strategy state machine is a bipartite state machine, consisting of strategy
states and result states. An example of such a state machine is shown in Figure 1.
   The architecture of our strategy switching approach distinguishes between an online part
at runtime, as well as an offline part executing ahead-of-time not beholden to any real-time
constraints. The offline component prepares the state machine, which is then available for
online playback.

Online We introduce a strategy switching component, which plays back the strategy state
machine, taking transitions based on observed faults as the application runs. At runtime, this
component selects a single strategy 𝑠 ahead of every execution of the task set, which becomes
active. The strategy 𝑠 dictates which tasks run under fault-tolerance (Γ𝑠 ), and which ones do not
(Γ ∖ Γ𝑠 ). Fault-tolerance techniques are typically not a silver bullet solution, and unmitigated
faults may still occur in tasks in Γ𝑠 . Furthermore, fault-tolerance techniques can often report
the fact that they failed to mitigate a fault (e.g. no consensus in triple modular redundancy).
After executing all tasks, the online component uses this information from the execution of the
task set to select the matching result 𝑟 from the state machine. This result reflects the success
Table 1
Definitions of used symbols and terms
 Item                     Meaning
 Task model
 Γ                        Set of all tasks, Γ = {𝜏1 ...𝜏𝑛 }
 𝜏𝑖 ∈ Γ                   Task 𝑖 ∈ Γ, e.g. 𝜏𝐴 is task A
 𝐶𝑖                       Worst Case Execution Time (WCET) of task i
 𝐷                        Global deadline (shared by all tasks)
 Fault model
 𝜆                        Fault rate (Poisson)
 (𝑚𝑖 , 𝑘𝑖 )               Constraint indicating task 𝑖 has to execute successfully for at least 𝑚𝑖 iterations out of every 𝑘𝑖
                          iterations
 Unmitigated fault        Fault in a task not mitigated by a fault-tolerance technique
 Catastrophic fault       Unmitigated fault that leads to the (𝑚𝑖 , 𝑘𝑖 ) constraint of the task being violated
 States in the state machine
 𝒮                        Set of all strategies
 𝑠∈𝒮                      A strategy
 Γ𝑠 ⊂ Γ                   The tasks protected under strategy 𝑠
 𝑠𝐴,𝐵                     A strategy protecting task A and B, i.e. Γ𝑠𝐴,𝐵 = {𝜏𝐴 , 𝜏𝐵 }
 ℛ                        Set of all results
 𝑟∈ℛ                      A result
 𝑟𝐴,𝐵                     A result where task A (𝜏𝐴 ) succeeded and task B (𝜏𝐵 ) failed
 Transitions in the state machine
 Δ                        The transition function for the strategy state machine
 Δ(𝑠)                     The set of successors of strategy 𝑠 as per the transition function Δ. Due to the bipartite nature of
                          the state machine, this is always a set of results.
 Δ(𝑟)                     The successor of result 𝑟 as per transition function Δ. Always a single element, and due to the
                          bipartite nature of Δ it is always a strategy.
 Scoring function
 𝛿(...)                   Scoring function (lower is better), provides steady-state catastrophic fault rate, i.e. the average prob-
                          ability of a catastrophic fault for any iteration of the task set
 𝛿(Δ)                     Scoring function applied to the entire state machine
 𝛿(𝑠, Δ)                  Probability of a catastrophic fault when leaving strategy 𝑠 considering transition function Δ
 𝛿(𝑟, Δ)                  Probability of a catastrophic fault when leaving result 𝑟 considering transition function Δ
 𝛿(𝑟,𝑠)                   Probability of a catastrophic fault when transitioning from 𝑟 to 𝑠
 Probabilities
 𝑝𝑖                       Probability of an unmitigated fault in task 𝜏𝑖 when no fault-tolerance is used
 𝑞𝑖                       Probability of an unmitigated fault in task 𝜏𝑖 when fault-tolerance is used


                                                𝑟𝐴                             𝑟𝐵
                                                𝐴                                 𝐵


                                                𝑠𝐴                             𝑠𝐵
                                                𝐴                                 𝐵


                                                𝑟𝐴                             𝑟𝐵

Figure 1: A strategy state machine for Γ = {𝜏𝐴 ,𝜏𝐵 }


or fail state, or probability thereof, of each of the tasks. Each possible result 𝑟 directly maps to
its best successor strategy, which is applied to the next iteration of the task set.

Offline The full set of strategies 𝑠 ∈ 𝒮 is computed ahead of time, as well as the transition
relation Δ from any given result 𝑟 ∈ ℛ to the next best successor strategy Δ(𝑟) = 𝑠. Strategies
which are not schedulable are not included in 𝒮. Furthermore, strategies which are dominated
by other strategies (i.e. all tasks that are protected by one strategy are also protected by another)
are also not considered in 𝒮.
   We develop an algorithm to compute Δ, such that our choice of Δ provides the lowest
steady-state rate of catastrophic faults.

Example State Machine An example of such a state machine is shown in Figure 1, where
there are two strategies 𝒮 = {𝑠𝐴 , 𝑠𝐵 }. This state machine is not the optimal state machine for
this task set, but gives a non-trivial example of what such a state machine may look like. Such
a state machine is constructed by the offline component, and made available to the runtime.
Each strategy protects only one task (either task 𝐴 or task 𝐵). Finally, fault-tolerance may fail
and if it fails this information is available to the strategy switching component. With these
assumptions, each strategy has two potential successors: one where the task under protection
succeeds and one where it fails (e.g. 𝑟𝐴 and 𝑟𝐴 for strategy 𝑠𝐴 ). Each result state 𝑟 has just one
successor strategy Δ(𝑟), e.g. Δ(𝑟𝐴 ) = 𝑠𝐵 in Figure 1. The following sequence of actions may
take place at runtime, as per Figure 1;

   1. To start, the runtime initializes by picking a random strategy, say 𝑠𝐵 , and applies fault
      tolerance accordingly. By picking 𝑠𝐵 , task 𝜏𝐵 will be executed with fault-tolerance, while
      task 𝜏𝐴 will not.

   2. The task set is executed (iteration 1).

   3. At the end of the iteration, data from the fault-tolerance applied to task 𝜏𝐵 is used to
      select a result. If 𝜏𝐵 fails, the result 𝑟𝐵 is selected. However, let us assume 𝜏𝐵 succeeded,
      and select result 𝑟𝐴 accordingly. No information about the success or failure of 𝜏𝐴 is
      known.

   4. 𝑟𝐵 links to strategy 𝑠𝐴 , which is selected.

   5. By switching to 𝑠𝐴 , task 𝜏𝐴 will be executed with fault-tolerance, while task 𝜏𝐵 will not.

   6. The task set is executed (iteration 2).

   7. At the end of the iteration, data from the fault-tolerance of task 𝜏𝐴 indicates that task 𝜏𝐴
      has failed, letting us select result 𝑟𝐴 .

   8. 𝑟𝐴 links to strategy 𝑠𝐴 , which is selected.

   *. The process continues...

  Note that there is no involved selection procedure for the initial strategy (strategy 𝑠𝐵 in the
example). Our approach is only concerned with obtaining the lowest steady-state fault rate of
the application, which is in no way impacted by the choice of initial strategy.
4. Evaluating State Machines
One way to find the optimal state machine transition function Δ for a given 𝒮 is to enumerate
all possible transition functions, score each transition function, and select the best one. To do
so, we define the state machine scoring function 𝛿(Δ). The output of this scoring function
is the steady-state rate of catastrophic faults. Specifically, 𝛿(Δ) is the weighted probability
of a catastrophic fault occurring during any iteration of the task set when applying strategy
switching according to the transition function Δ.
   To evaluate 𝛿(Δ), we compute the steady-state probability distribution of the transition
function Δ (e.g. using linear algebra). The steady-state probability distribution provides a
probability 𝑃 (𝑠|Δ) of finding the state machine in strategy 𝑠 at an arbitrarily chosen iteration
of the task set given Δ. Intuitively, as the number of iterations of the task set approaches infinity,
𝑃 (𝑠|Δ) is the fraction of ∑︀iterations spent with strategy 𝑠 active. The sum of these fractions
over all strategies is 1, i.e 𝑠∈𝒮 𝑃 (𝑠|Δ) = 1.
   Let 𝛿(𝑠, Δ) be the probability of a catastrophic fault occurring in strategy 𝑠. Together with
𝑃 (𝑠|Δ), we can now compute 𝛿(Δ):
                                             ∑︁
                                    𝛿(Δ) =      𝑃 (𝑠|Δ) · 𝛿(𝑠, Δ)
                                                 𝑠∈𝒮

  Catastrophic faults, by their definition, only occur when an unmitigated fault occurs in two
consecutive iterations of the task set. To compute 𝛿(𝑠, Δ), we must not just consider 𝑠, but also
the successor of 𝑠. The successor of 𝑠 depends on what result 𝑟 is being chosen, which is given
by 𝑃 (𝑟|𝑠,Δ). Each 𝑟 has only one successor strategy Δ(𝑟). Let Δ(𝑟) = 𝑠to . Thus, 𝛿(𝑠,Δ) can
be expressed in terms of 𝑟 and Δ(𝑟) = 𝑠to as follows:
                                                  ∑︁
                                𝛿(𝑠, Δ) =                 𝑃 (𝑟|𝑠) · 𝛿(𝑟, 𝑠𝑡 𝑜)
                                                 𝑟∈Δ(𝑠)

  As 𝑠to is a strategy, 𝛿(𝑟, 𝑠to ) is independent of our choice of Δ. 𝛿(𝑟, 𝑠to ) is equal to the
probability that any task experiences a fault both in 𝑟 as well as in 𝑠to :

                                      ∏︁
                   𝛿(𝑟, 𝑠to ) = 1 −           1 − 𝑃 (catastrophic fault in 𝜏𝑖 |𝑟,𝑠to )
                                      𝜏𝑖 ∈Γ
                                      ∏︁
                            =1−               1 − 𝑃 (fault in 𝜏𝑖 |𝑟) · 𝑃 (fault in 𝜏𝑖 |𝑠to )
                                      𝜏𝑖 ∈Γ

   The probabilities 𝑃 (fault in 𝜏𝑖 |𝑟) and 𝑃 (fault in 𝜏𝑖 |𝑠to ) are computed based on the used
fault-tolerance technique, if 𝜏𝑖 ∈ Γ𝑠to , the rate of faults combined with the WCET of 𝜏𝑖 , and
information available in the result. For example, a task without fault-tolerance may have a
chance of experiencing a fault with 𝑃 (fault in 𝜏𝑖 ) = 1 − 𝑒−𝜆·𝐶𝑖 , where 𝐶𝑖 is the WCET of 𝜏𝑖 .
Likewise, a task which was run under fault-tolerance according to the previous strategy, and
has succeeded according to result 𝑟, will have a 𝑃 (fault in 𝜏𝑖 |𝑟) = 0. We consider all SEUs to be
statistically independent events. To keep independence, use the WCET instead of the average
execution time, as the execution times of tasks in the same task set may not be independent.
As such, our steady-state rate of catastrophic faults 𝛿(Δ) provides an upper bound to the true
steady-state fault rate of the application.

Tractability The approach, as presented here, can easily become intractable for even small
task sets. Many state space reduction techniques can be realized in the way the state machines
are enumerated, which can considerably lower the number of elements. If a candidate state
machine Δ contains disconnected sub-graphs, then no unique steady state can be computed,
and as such this candidate can be pruned. Furthermore, the candidate state machine may
contain strategies that are ignored, i.e. there is no path to that strategy from that same strategy.
When this is the case, the steady-state probability 𝑃 (𝑠|Δ) is always 0. For such an ignored
strategy, the successor strategies of its results do not matter, and its many ways of connecting
to successor strategies need not be individually examined. Finally, the domain itself my allow
for significant state space reduction. For example, if a precedence relation is added between
tasks in the task set over which data is communicated, the failure of a preceding task may imply
that the succeeding task is destined to fail. These modifications impact the lattice of strategies,
and reduce the size of the schedulable and non-redundant set of strategies 𝒮.
   For completeness, we discuss the algorithmic complexity of the (naïve) state machine construc-
tion algorithm as presented in this paper. While the exact size of the state space depends on a
number of factors, such as the number of available strategies 𝒮, the upper bound is considerable.
   The time complexity of 𝛿(Δ) is 𝒪(|ℛ| · ss(|𝒮|)), where:

    • |ℛ| is the number of results

    • |𝒮| is the number of strategies

    • ss(𝑛) provides an upper bound to the computation of the steady state for 𝑛 strategies

   To compute ss(𝑛), the steady-state matrix needs to be computed. For this computation,
the LAPACK driver routine DSEGD1 may be used, with a complexity of 𝒪(𝑛3 ), resulting in
ss(𝑛) = 𝑛3 .
   The number of strategies is up to all combinations of tasks (|𝑆| ∈ 𝒪(|Γ|!)). At the same time,
𝛿(Δ) is computed for every possible state machine. As each result can link to any successor
strategy, this yields up to |𝑆||𝑅| state machines. Each strategy has ≤ 2|Γ𝑠 | results, 2|Γ𝑠 | ∈ 𝒪(2|Γ| )
and thus |𝑅| ∈ 𝒪(|Γ|!·2|Γ| ). Let 𝑛 = |Γ|, i.e. 𝑛 is the number of tasks. Then, the final algorithmic
time complexity is given:

                                     𝒪(|𝑆||𝑅| · |ℛ| · ss(|𝒮|)) =
                                                 |Γ|
                                     𝒪(|Γ|!|Γ|!·2 | · |Γ|! · 2|Γ| · |Γ|!3 ) =
                                             𝑛
                                     𝒪(𝑛!𝑛!·2 · 𝑛! · 2𝑛 · 𝑛!3 ) ⊂
                                             𝑛
                                     𝒪(𝑛!𝑛!·2 · 2𝑛5 ) ≈
                                             𝑛
                                     𝒪(𝑛!𝑛!·2 )

  It must be noted that this is by no means a tight upper bound. In the next section, we will
examine a task set with |Γ| = 𝑛 = 3. This example requires examination of only 64 candidate
    1
        http://www.netlib.org/lapack/lug/node71.html
state machines, even though such an 𝑛 value would appear to be completely intractable as per
the above formulation.


5. Example
Let us consider an example task set Γ = {𝜏𝐴 , 𝜏𝐵 , 𝜏𝐶 }, with WCET values 𝐶𝐴 = 20, 𝐶𝐵 = 10
and 𝐶𝐶 = 10. We set 𝜆 = 10−3 for this example. For brevity, we abbreviate the probability of
no fault in task 𝜏𝑖 when not using fault-tolerance as 𝑝𝑖 :

                                                                      𝐶𝑖
                           𝑃 (no fault in 𝜏𝑖 |no FT) = 𝑒−𝜆·𝐶𝑖 = 𝑒− 103
                                                     = 𝑝𝑖

   As fault-tolerance technique, we use Triple Modular Redundancy (TMR). We require a majority
for TMR to succeed, and assume there is no other way TMR can fail (e.g. assume no unmitigated
faults during voting). A majority for TMR requires two or more copies of the task to be in
agreement. Like 𝑝𝑖 , we abbreviate the probability of no fault in task 𝜏𝑖 when using fault-tolerance
as 𝑞𝑖 :

                                                          (︂ )︂
                                                            3 2
                        𝑃 (no fault in 𝜏𝑖 |TMR) = 𝑝3𝑖 +        𝑝 · (1 − 𝑝𝑖 )
                                                            2 𝑖
                                                = 𝑞𝑖

   With the WCET for each task available, we can then compute the 𝑝𝑖 and 𝑞𝑖 values for all
tasks:

                 𝜏𝐴 : 𝑝𝐴 = 𝑒−20𝜆 = 0.9802      𝑞𝐴 = 𝑝3𝐴 + 3(1 − 𝑝𝐴 )𝑝2𝐴 = 0.9988
                 𝜏𝐵 : 𝑝𝐵 = 𝑒−10𝜆 = 0.9901      𝑞𝐵 = 𝑝3𝐵 + 3(1 − 𝑝𝐵 )𝑝2𝐵 = 0.9997
                 𝜏𝐶 : 𝑝𝐶 = 𝑒−10𝜆 = 0.9901      𝑞𝐶 = 𝑝3𝐶 + 3(1 − 𝑝𝐶 )𝑝2𝐶 = 0.9997

Enumerating strategies For our chosen task set Γ, there are eight possible subsets and as
such eight possible strategies:

                  𝑠∅ :            Γ𝑠∅ = {}           Γ ∖ Γ𝑠∅ = {𝜏𝐴 , 𝜏𝐵 , 𝜏𝐶 }
                  𝑠𝐴 :          Γ𝑠𝐴 = {𝜏𝐴 }           Γ ∖ Γ𝑠𝐴 = {𝜏𝐵 , 𝜏𝐶 }
                  𝑠𝐵 :          Γ𝑠𝐵 = {𝜏𝐵 }           Γ ∖ Γ𝑠𝐵 = {𝜏𝐴 , 𝜏𝐶 }
                  𝑠𝐶 :          Γ𝑠𝐶 = {𝜏𝐶 }           Γ ∖ Γ𝑠𝐶 = {𝜏𝐴 , 𝜏𝐵 }
                  𝑠𝐴,𝐵 :     Γ𝑠𝐴,𝐵 = {𝜏𝐴 , 𝜏𝐵 }        Γ ∖ Γ𝑠𝐴,𝐵 = {𝜏𝐶 }
                  𝑠𝐴,𝐶 :     Γ𝑠𝐴,𝐶 = {𝜏𝐴 , 𝜏𝐶 }        Γ ∖ Γ𝑠𝐴,𝐶 = {𝜏𝐵 }
                  𝑠𝐵,𝐶 :     Γ𝑠𝐵,𝐶 = {𝜏𝐵 , 𝜏𝐶 }        Γ ∖ Γ𝑠𝐵,𝐶 = {𝜏𝐴 }
                  𝑠𝐴,𝐵,𝐶 : Γ𝑠𝐴,𝐵,𝐶 = {𝜏𝐴 , 𝜏𝐵 , 𝜏𝐶 }    Γ ∖ Γ𝑠𝐴,𝐵,𝐶 = {}

  Not all of these may be schedulable. If 𝑠𝐴,𝐵,𝐶 was schedulable there would be no reason to
use strategy switching. Likewise, if only 𝑠∅ was schedulable, there is no way to deploy limited
                              Unschedulable
                                                   𝑠𝐴,𝐵,𝐶


                                            𝑠𝐴,𝐵   𝑠𝐴,𝐶     𝑠𝐵,𝐶

                              Schedulable

                                            𝑠𝐴      𝑠𝐵       𝑠𝐶

                               Redundant

                                                     𝑠∅



Figure 2: Lattice of strategies for Γ = {𝜏𝐴 , 𝜏𝐵 , 𝜏𝐶 }


fault-tolerance to this task set. For the example, we assume that only strategies 𝑠∅ , 𝑠𝐴 , 𝑠𝐵 , 𝑠𝐶
and 𝑠𝐵,𝐶 are determined to be schedulable by a scheduler. The strategies form a lattice with
𝑠∅ as the greatest lower bound, and 𝑠𝐴,𝐵,𝐶 as the least upper bound. The lattice is shown in
Figure 2. Figure 2 also reveals redundant strategies, e.g. there’s no reason to choose 𝑠𝐵 when
𝑠𝐵,𝐶 is schedulable. As such, there is no need to consider them. Let 𝒮 = {𝑠𝐴 , 𝑠𝐵,𝐶 }.
   TMR can fail to reach a consensus, and then this information is available to the strategy
selection component in the form of a result. As such, in this example each strategy 𝑠 has 2|Γ𝑠 |
possible results. For 𝑠𝐴 this is 𝑟𝐴 and 𝑟𝐴 , and for 𝑠𝐵,𝐶 this is 𝑟𝐵,𝐶 , 𝑟𝐵,𝐶 , 𝑟𝐵,𝐶 and 𝑟𝐵,𝐶 . These
results are shown in Figure 3a, which shows the strategy state machine without successor
relations for each result.

Evaluating transitions functions There are |𝒮||ℛ| = 26 = 64 possible transition functions
Δ, as all of the six results needs to be linked to one of the two strategies. We will not enumerate
all of them, but two examples are shown in Figure 3b and Figure 3c.
   In this example, we will show how to evaluate 𝛿(Δ) for the state machine in Figure 3b. Let
this be Δ3b .

   1. Computing the steady-state probability of Δ3b . The result states are “urgent” states, in
      which no time can pass. In other words, the result states themselves are not relevant for
      the steady-state (i.e. the fraction of iterations of the task set spent in state 𝑟 ∈ ℛ is 0).
      As such, we remove these states for the steady-state computation, linking each strategy
      to multiple successor strategies. The resulting state machine is a Discrete-Time Markov
      Chain, where each time step is an iteration of the task set Γ. The steady-state can be
      computed using linear algebra. Let 𝑇 be the transition matrix for Δ3b with all results
      removed:

                               [︂                                                 ]︂
                                          𝑃 (𝑟𝐴 |𝑠𝐴 )             𝑃 (𝑟𝐴 |𝑠𝐴 )
                         𝑇 =
                             𝑃 (𝑟𝐵,𝐶 ∪ 𝑟𝐵,𝐶 ∪ 𝑟𝐵,𝐶 ∪ 𝑟𝐵,𝐶 |𝑠𝐵,𝐶 )     0
                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑠𝐴                             𝑠𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶

                (a) Partial state machine without successor relations for the results

                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑠𝐴                             𝑠𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶


          (b) Switching state machine choosing between 𝑠𝐴 and 𝑠𝐵,𝐶 based on the result

                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑠𝐴                             𝑠𝐵,𝐶
                                𝐴                     𝐵,𝐶             𝐵,𝐶

                              𝑟𝐴                   𝑟𝐵,𝐶               𝑟𝐵,𝐶


                         (c) Degenerate state machine always choosing 𝑠𝐴
Figure 3: Example state machines

                           [︂          ]︂ [︂             ]︂
                             1 − 𝑞𝐴 𝑞𝐴      0.0012 0.9988
                         =               =
                                1    0         1      0



      We can compute the steady-state vector 𝑣 by solving 𝑣 = 𝑣𝑇 .
                                                            (︂ 1 )︂
                                            𝑣 = 𝑣𝑇 ≈          2
                                                              1
                                                              2

      Note that a unique steady-state vector 𝑣 need not exist if there are two or more parts of
      the state machine that are disconnected. For example, all results of 𝑠𝐴 may link back to 𝑠𝐴 ,
      while all results of 𝑠𝐵,𝐶 link back to 𝑠𝐵,𝐶 . In such a case, the steady-state is dependent
      on the initial state. However, we can safely disregard state machines that depend on the
      initial condition, as other state machines with identical steady-state behavior must exist
      as well. If staying in 𝑠𝐴 provides the lowest 𝛿(Δ) value, then a state machine like shown
      in Figure 3c would yield the exact same 𝛿(Δ) value as a state machine where 𝑠𝐴 and 𝑠𝐵,𝐶
      is disconnected with 𝑠𝐴 as the initial state.
2. Compute, for each result → strategy transition, the 𝛿(𝑟, 𝑠to ) catastrophic fault probability.

                         𝛿(𝑟𝐴 , 𝑠𝐵,𝐶 )     = 1−     (1 − (0) · (1 − 𝑝𝐴 ))·
                                                    (1 − (1 − 𝑝𝐵 ) · (1 − 𝑞𝐵 ))·
                                                    (1 − (1 − 𝑝𝐶 ) · (1 − 𝑝𝐶 ))
                                           =        5.87 · 10−6
                           𝛿(𝑟𝐴 , 𝑠𝐴 )    = 1−    (1 − (1) · (1 − 𝑞𝐴 ))·
                                                  (1 − (1 − 𝑝𝐵 ) · (1 − 𝑝𝐵 ))·
                                                  (1 − (1 − 𝑝𝐶 ) · (1 − 𝑝𝐶 ))
                                          =       1358 · 10−6
                          𝛿(𝑟𝐵,𝐶 , 𝑠𝐴 )    = 1− (1 − (1 − 𝑝𝐴 ) · (1 − 𝑞𝐴 ))·
                                                (1 − (0) · (1 − 𝑝𝐵 ))·
                                                (1 − (0) · (1 − 𝑝𝐶 ))
                                           =    22.98 · 10−6
                          𝛿(𝑟𝐵,𝐶 , 𝑠𝐴 )    = 1− (1 − (1 − 𝑝𝐴 ) · (1 − 𝑞𝐴 ))·
                                                (1 − (1) · (1 − 𝑝𝐵 ))·
                                                (1 − (0) · (1 − 𝑝𝐶 ))
                                           =    9973 · 10−6
                          𝛿(𝑟𝐵,𝐶 , 𝑠𝐴 )    = 1− (1 − (1 − 𝑝𝐴 ) · (1 − 𝑞𝐴 ))·
                                                (1 − (0) · (1 − 𝑝𝐵 ))·
                                                (1 − (1) · (1 − 𝑝𝐶 ))
                                           =    9973 · 10−6
                          𝛿(𝑟𝐵,𝐴 , 𝑠𝐴 )    = 1−     (1 − (1 − 𝑝𝐴 ) · (1 − 𝑞𝐴 ))·
                                                    (1 − (1) · (1 − 𝑝𝐵 ))·
                                                    (1 − (1) · (1 − 𝑝𝐶 ))
                                           =        19923 · 10−6

3. Using the calculated 𝛿(𝑟, 𝑠to ) values, compute a 𝛿(𝑠 ∈ 𝒮) value for each strategy.


                             𝛿(𝑠𝐴 ) =𝑃 (𝑟𝐴 |𝑠𝐴 ) · 𝛿(𝑟𝐴 , 𝑠𝐵,𝐶 )+
                                          𝑃 (𝑟𝐴 |𝑠𝐴 ) · 𝛿(𝑟𝐴 , 𝑠𝐴 ) =
                                          7.6077 · 10−6
                           𝛿(𝑠𝐵,𝐶 ) =𝑃 (𝑟𝐵,𝐶 |𝑠𝐵,𝐶 ) · 𝛿(𝑟𝐵,𝐶 , 𝑠𝐴 )+
                                          𝑃 (𝑟𝐵,𝐶 |𝑠𝐵,𝐶 ) · 𝛿(𝑟𝐵,𝐶 , 𝑠𝐵,𝐶 )+
                                          𝑃 (𝑟𝐵,𝐶 |𝑠𝐵,𝐶 ) · 𝛿(𝑟𝐵,𝐶 , 𝑠𝐵,𝐶 )+
                                          𝑃 (𝑟𝐵,𝐶 |𝑠𝐵,𝐶 ) · 𝛿(𝑟𝐵,𝐶 , 𝑠𝐵,𝐶 ) =
                                          29.69985 · 10−6

   𝛿(𝑠) is the probability that a catastrophic fault occurs by selecting strategy 𝑠. The bad
   score of 𝑠𝐵,𝐶 is not surprising: the Δ3b state machine is not particularly clever as it
   chooses to switch to strategy 𝑠𝐴 even with the knowledge that 𝜏𝐵 or 𝜏𝐶 has failed.
                 Table 2
                 Shortened table of all possible Δ transition functions for Figure 3a
                  Δ      𝑟𝐴      𝑟𝐴      𝑟𝐵,𝐶  𝑟𝐵,𝐶      𝑟𝐵,𝐶    𝑟𝐵,𝐶    𝛿(Δ) · 105
                  Δ1     𝑠𝐴      𝑠𝐴      𝑠𝐴    𝑠𝐴        𝑠𝐴      𝑠𝐴      19.9355
                  Δ2     𝑠𝐵𝐶     𝑠𝐴      𝑠𝐴    𝑠𝐴        𝑠𝐴      𝑠𝐴      1.81421
                  Δ3     𝑠𝐴      𝑠𝐵𝐶     𝑠𝐴    𝑠𝐴        𝑠𝐴      𝑠𝐴      22.055
                                                 ...
                  Δ56    𝑠𝐵𝐶     𝑠𝐵𝐶     𝑠𝐵𝐶   𝑠𝐴        𝑠𝐵𝐶     𝑠𝐵𝐶     39.489
                  Δ57    𝑠𝐴      𝑠𝐴      𝑠𝐴    𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     19.935
                  Δ58    𝑠𝐵𝐶     𝑠𝐴      𝑠𝐴    𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     1.54062
                  Δ59    𝑠𝐴      𝑠𝐵𝐶     𝑠𝐴    𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     22.054
                  Δ60    𝑠𝐵𝐶     𝑠𝐵𝐶     𝑠𝐴    𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     2.6115
                  Δ61    𝑠𝐴      𝑠𝐴      𝑠𝐵𝐶   𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     n/a3
                  Δ62    𝑠𝐵𝐶     𝑠𝐴      𝑠𝐵𝐶   𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     39.226
                  Δ63    𝑠𝐴      𝑠𝐵𝐶     𝑠𝐵𝐶   𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     39.226
                  Δ64    𝑠𝐵𝐶     𝑠𝐵𝐶     𝑠𝐵𝐶   𝑠𝐵𝐶       𝑠𝐵𝐶     𝑠𝐵𝐶     39.2265
                  Δ∅                         n/a                         59.0104
                   1
                     Example from Figure 3b (Δ2 = Δ3b )
                   2
                     Best (lowest) steady-state fault rate
                   3
                     No unique steady-state exists for this transition function
                   4
                     𝛿(Δ) of the task set without any form of fault-tolerance
                   5
                     Does not strategy switch, i.e. always stays in the same strategy


   4. Finally, compute 𝛿(Δ3b ) by taking the weighted average of all 𝛿(𝑠) values by multiplying
      𝛿(𝑠) for each strategy by the steady-state probability 𝑃 (𝑠|Δ3b ) of being in that strategy.


                             𝛿(Δ3b ) = 𝑃 (𝑠𝐴 ) · 𝛿(𝑠𝐴 ) + 𝑃 (𝑠𝐵,𝐶 ) · 𝛿(𝑠𝐵,𝐶 )
                                       = 1.814 · 10−5

   This process is repeated for all 64 possible transition functions. For brevity, we will not show
the evaluation of every single transition function here. Instead, Table 2 shows a subset of all
possible Δ values. The columns headed with a result show to which successor strategy that
result maps. For example, for the first evaluated state machine Δ1 (𝑟𝐴 ) = 𝑠𝐴 . The best state
machine is also revealed, listed as Δ58 . Finally, the last item in the table is Δ∅ , added as a
reference. Δ∅ is the state machine with a single strategy 𝑠∅ which protects no tasks, i.e. the
behavior obtained when not using any form of fault-tolerance. The best state machine is a
38.3-fold improvement over this default. That is, the best state machine offers a 38.3 times
lower rate of catastrophic failure when compared to Δ∅ . Finally, the table also shows two state
machines which perform no strategy switching, but still use fault-tolerance. These are Δ1 and
Δ64 , staying in 𝑠𝐴 and 𝑠𝐵,𝐶 respectively. The strategy switching solution Δ58 offers a 12.9
times lower rate of catastrophic faults when compared to the best of these static solutions.
                                            A             B

Figure 4: Task set Γ = {𝜏𝐴 , 𝜏𝐵 } with a precedence relation between 𝜏𝐴 and 𝜏𝐵


6. Extending the fault and task model
6.1. Adding precedence relations
Directed Acyclic Graph scheduling is an extension to our task model where precedence relations
exist between tasks. While this has no impact on the strategy switching algorithm directly as it
need not concern itself with scheduling, it may add dependence to the success probability of a
task. For this extension, we assume that when a predecessor task fails (produces incorrect data),
all successor tasks fail as well due to operating on incorrect data, even when not experiencing a
SEU.
   To support this assumption, the 𝛿(𝑟, 𝑠to ) cost function needs to be modified to consider
precedence relations. Specifically, 𝑃 (fault in 𝜏𝑖 |𝑟) and 𝑃 (fault in 𝜏𝑖 |𝑠to ) must be replaced to not
just consider if 𝜏𝑖 failed in isolation. Let us call this precedence-aware probability the probability
of incorrect output.

                       𝑃 (incorrect output 𝜏𝑖 |𝑥) = 𝑃 (fault in 𝜏𝑖 |𝑥)·
                                                    ∏︁
                                                            𝑃 (fault in 𝜏𝑗 |𝑥)
                                                   𝜏𝑗 ∈pred(𝜏𝑖 )

   Here, 𝑥 ∈ 𝒮 ∪ ℛ is either a strategy or a result. pred (𝜏𝑖 ) is the set of all direct and indirect
predecessors of task 𝜏𝑖 . This probability also includes the behavior of any predecessor tasks.
   Figure 4 shows an example task set with a precedence relation. Consider the task set in
this example with two strategies: 𝒮 = {𝑠𝐴 , 𝑠𝐵 }, protecting task 𝜏𝐴 and 𝜏𝐵 respectively. Each
strategy has two outcomes, hence ℛ = {𝑟𝐴 , 𝑟𝐴 , 𝑟𝐵 , 𝑟𝐵 }. Let us consider 𝑃 (fault in 𝜏𝐵 |𝑟𝐴 ):
the result 𝑟𝐴 does not directly communicate anything about the state of task 𝐵, however due to
the precedence relation we know it effectively failed. As such, 𝑃 (incorrect output 𝜏𝐵 |𝑟𝐴 ) = 1.

6.2. Selective fault-tolerance and criticality
The task model can be extended to support heterogeneity in the ability of tasks to tolerate
non-consecutive faults. Let Γ𝑁 be the set of tasks which cannot tolerate non-consecutive
faults. Then, we update 𝑃 (catastrophic fault in 𝜏𝑖 |𝑟,𝑠to ) to consider non-consecutive faults as
catastrophic faults when 𝜏𝑖 ∈ Γ𝑁 .

                        𝑃 (catastrophic fault in 𝜏𝑖 |𝑟,𝑠to =)
                        {︃
                          𝑃 (fault in 𝜏𝑖 |𝑟) · 𝑃 (fault in 𝜏𝑖 |𝑠to ) if 𝜏𝑖 ∈
                                                                           / Γ𝑁
                          𝑃 (fault in 𝜏𝑖 |𝑟)                         else

 The goal of the strategy state machine is to minimize the rate of catastrophic faults. This
may mean that the rate of catastrophic faults is so low that even a single fault is unlikely to
occur across the lifetime of the system. However, when used in soft real-time deployments,
a “catastrophic fault” may not be catastrophic while still undesirable. In such a deployment,
catastrophic faults may be acceptable. The impact of a task experiencing such a catastrophic
fault may not be the same across all tasks. As such, a priority function 𝑝(𝜏𝑖 ∈ Γ) can be
integrated into 𝛿(𝑟, 𝑠to ).


                       𝛿(𝑟, 𝑠to ) =
                            ∏︁
                       1−        (1 − 𝑃 (catastrophic fault in 𝜏𝑖 |𝑟,𝑠to ))𝑝(𝜏𝑖 )
                           𝜏𝑖 ∈Γ

  The 𝑝(𝜏𝑖 ) function assigns relative priority, where higher is better. A task 𝜏𝑖 with 𝑝(𝜏𝑖 ) = 2
has twice the importance as a task 𝜏𝑗 with 𝑝(𝜏𝑗 ) = 1. Note that this change affects the meaning of
the output 𝛿(Δ), which is no longer the rate of catastrophic faults. If the impact of a catastrophic
fault in a task with 𝑝(𝜏𝑖 ) = 2 is equivalent to two catastrophic faults in another task with
𝑝(𝜏𝑗 ) = 1, then the final rate 𝛿(Δ) can be seen as the rate of catastrophic faults normalized to
faults in 𝜏𝑗 .


7. Related work
(𝑚𝑖 , 𝑘𝑖 ) constraints have been used before in the domain of real-time scheduling. Choi et al. [7]
proposed a scheduler, together with an efficient schedulability algorithm for a sporadic task
set with tasks under (𝑚𝑖 , 𝑘𝑖 ) constraints. This scheduler allows for scheduling task sets that
would normally not be schedulable, but utilizing their (𝑚𝑖 , 𝑘𝑖 ) constraints allows them to be
scheduled.
   Chen et al. [8] proposed a solution that similar to ours. Their method offers fault-tolerance
with the goal of reducing the effective fault rate as well as lowering energy consumption.
Chen et al. proposes a static scheduling technique called Static Pattern-Based Reliable Execution,
ensuring each (𝑚𝑖 , 𝑘𝑖 ) constraint is respected in the presence of transient faults. Furthermore,
they propose delaying the execution of their static pattern if no fault is detected at runtime,
opportunistically running more unprotected instances of the task with the goal of saving energy.
However, if the static pattern is found to be unschedulable as per their schedulability test,
their implementation is unable to provide a schedule that minimizes the fault rate for a given
resource-constrained real-time system. While their approach offers more flexibility in the
task model (specifically the support for (𝑚𝑖 , 𝑘𝑖 ) constraints with 𝑘𝑖 > 2), it does not consider
that fault mitigation may fail. Our approach optimally lowers the fault rate, regardless of the
hardware constrains. Furthermore, our approach recognizes that fault mitigation may fail, and
includes this in the calculation for lowering the fault rate.
   Gujarati et al. [9] contributed a technique for measuring the fault rate of an application
with tasks under (𝑚𝑖 , 𝑘𝑖 ) constraints. Their technique provides an upper bound for the fault
probability per iteration of a Fault-tolerant Single-Input Single-Output (FT-SISO) control loop,
similar to our 𝛿(Δ) output on a task set with precedence relations. Their technique hopes
to provide transparency to system designers, allowing analyzing the impact on the reliablity
when changing the hardware or software. However, while their approach is aware of (𝑚𝑖 , 𝑘𝑖 )
constraints, it does not provide schedules that utilize such constraints. Instead, it merely includes
them in the reliability calculation.
   The domain of strategy switching shares some aspects with Mixed-Criticality (MC) systems.
In an MC system, the system switches between different levels of criticality depending on the
operating conditions of the system. Tasks are assigned a criticality level, and when the system
criticality is higher than that of the task, the task is not scheduled to guarantee the successful
and timely execution of tasks with a higher criticality level. Pathan [10] combines MC with
fault-tolerance against transient faults. As is typical in MC research, as the level of criticality
increases, the pessimism increases. Pathan increases the maximum fault rate when switching
to a higher level of criticality. In our approach we do not vary the pessimism of any parameter.
Instead, we assume the 𝜆 parameter provides a suitable upper bound to the fault rate in all
conditions. Our approach offers some aspects typically not found in MC systems: while one
could argue that each strategy is really a criticality level, it is a criticality level applied to a
subset of the tasks (specifically Γ𝑠 ). Finally, the approach by Pathan requires bounding the
number of faults that can occur in any window. As such, passing their sufficient schedulability
test will (under their fault model) guarantee the system will never experience a fault.


8. Conclusion
In this paper, we have shown how strategy switching can be used to improve fault-tolerance
for resource-constrained systems. Our method makes effective use of the ability to vary which
tasks receive fault-tolerance. It considers at the start of every iteration of the task set what the
best set of tasks is to protect. We have shown how our method computes the optimal strategy
state machine for any given task set, minimizing the rate of catastrophic faults. We have also
shown the flexibility of our method to be extended to support new task and fault models.


9. Future work
In future work, we hope to improve the tractability of our algorithm by both state space reduction
algorithms as well as by using heuristics.
   Furthermore, we hope to extend the fault model to distinguish between deadline misses
and incorrect results. We also hope to integrate tasks with (𝑚𝑖 , 𝑘𝑖 ) constraints with 𝑘𝑖 > 2.
Additionally, we hope to integrate the natural ability of tasks to detect faults into our task model,
as a SEU may for example lead to a segfault. Finally, we hope to validate our approach using
simulation-based analysis.


Acknowledgments
The presentation of this paper was considerably improved in response to comments provided
by the anonymous reviewers, and we gratefully acknowledge their insights and assistance.
  This project has received funding from the European Union’s Horizon 2020 research and
innovation program under grant agreement No. 871259 (ADMORPH project). Additionally, this
work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.
References
 [1] I. Oz, S. Arslan, A survey on multithreading alternatives for soft error fault tolerance,
     ACM Computing Surveys (2019).
 [2] R. E. Lyons, W. Vanderkulk, The use of triple-modular redundancy to improve computer
     reliability, IBM journal of research and development 6 (1962) 200–209.
 [3] J. Chang, G. A. Reis, D. I. August, Automatic instruction-level software-only recovery, in:
     International Conference on Dependable Systems and Networks (DSN’06), IEEE, 2006, pp.
     83–92.
 [4] S. A. Asghari, M. Binesh Marvasti, A. M. Rahmani, Enhancing transient fault tolerance in
     embedded systems through an OS task level redundancy approach, Future Generation
     Computer Systems 87 (2018) 58–65. doi:10.1016/j.future.2018.04.049.
 [5] G. Bernat, A. Burns, A. Liamosi, Weakly hard real-time systems, IEEE transactions on
     Computers 50 (2001) 308–321.
 [6] I. Broster, A. Burns, G. Rodriguez-Navas, Timing analysis of real-time communication
     under electromagnetic interference, Real-Time Systems 30 (2005) 55–81.
 [7] H. Choi, H. Kim, Q. Zhu, Job-class-level fixed priority scheduling of weakly-hard real-time
     systems, in: 2019 IEEE Real-Time and Embedded Technology and Applications Symposium
     (RTAS), 2019, pp. 241–253. doi:10.1109/RTAS.2019.00028.
 [8] K.-H. Chen, B. Bönninghoff, J.-J. Chen, P. Marwedel, Compensate or ignore? Meeting
     control robustness requirements through adaptive soft-error handling, in: Proceedings of
     the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory
     for Embedded Systems, LCTES 2016, Association for Computing Machinery, New York,
     NY, USA, 2016, pp. 82–91. doi:10.1145/2907950.2907952.
 [9] A. Gujarati, M. Nasri, B. B. Brandenburg, Quantifying the resiliency of fail-operational real-
     time networked control systems, in: S. Altmeyer (Ed.), 30th Euromicro Conference on Real-
     Time Systems (ECRTS 2018), volume 106 of Leibniz International Proceedings in Informatics
     (LIPIcs), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018, pp.
     16:1–16:24. doi:10.4230/LIPIcs.ECRTS.2018.16.
[10] R. M. Pathan, Fault-tolerant and real-time scheduling for mixed-criticality systems, Real-
     Time Systems 50 (2014) 509–547.