=Paper= {{Paper |id=Vol-3145/short05 |storemode=property |title=Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications |pdfUrl=https://ceur-ws.org/Vol-3145/paper05.pdf |volume=Vol-3145 |authors=Lukas Miedema,Benjamin Rouxel,Clemens Grelck |dblpUrl=https://dblp.org/rec/conf/cerciras/MiedemaRG21 }} ==Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications== https://ceur-ws.org/Vol-3145/paper05.pdf
Strategy Switching: Smart Fault-tolerance for
Resource-constrained Real-time Applications
Lukas Miedema1 , Benjamin Rouxel1,2 and Clemens Grelck1
1
    University of Amsterdam (UvA), Amsterdam, Netherlands
2
    University of Modena and Reggio Emilia (Unimore), Modena, Italy


                                         Abstract
                                         Software-based fault-tolerance is an attractive alternative to hardware-based fault-tolerance, as it allows
                                         for the use of cheap Commercial Off The Shelf hardware. However, software-based fault-tolerance
                                         comes at a cost, requiring computing the same results multiple times to allow for the detection and
                                         mitigation of faults. Resource-constrained real-time applications may not be able to afford this cost. At
                                         the same time, the domain of a real-time task may allow it to tolerate a fault, provided it does not occur
                                         in consecutive iterations of the task. In this paper, we introduce a new way to deploy fault-tolerance
                                         called strategy switching. Our method targets Single Event Upsets by running different subsets of tasks
                                         under fault-tolerance at different points in time. We do not bound the number of faults in a window,
                                         nor does our method assume that tasks under fault-tolerance cannot still fail. Our technique does not
                                         require a minimal amount of additional compute resources for fault-tolerance. Instead, our method
                                         optimally utilizes any available compute resources for fault-tolerance for resource-constrained real-time
                                         applications.

                                         Keywords
                                         Cyber-physical Systems, Resource Constraints, Fault-tolerance, Single Event Upsets, Weakly Hard Real-
                                         time




1. Introduction
As transistor density increases and gate voltages decreases, the frequency of transient faults or
single event upsets (SEUs) increases [1]. Hence, the need for fault-tolerance against these types
of faults is growing.
   Fault-tolerance techniques can either be implemented in hardware or in software. Software-
based fault-tolerance is attractive due to its ability to protect workloads on Commercial Off
The Shelf (COTS) hardware. However, providing general-purpose fault-tolerance against SEUs
typically requires redundant execution, often in the form of Triple Modular Redundancy [2]
(TMR). TMR uses two-out-of-three voting to obtain a majority and mitigate the effects of a
SEU. TMR can be implemented at different levels of granularity, e.g. at the compiler level like
SWIFT-R [3], but also at the OS task level as implemented in OS Task Level Redundancy [4].
However, the overhead remains: instrumenting a binary with SWIFT-R increases its execution
time by 99 percent. As such, constrained real-time systems may have insufficient processing

CERCIRAS WS01: 1st Workshop on Connecting Education and Research Communities for an Innovative Resource
Aware Society
  l.miedema@uva.nl (L. Miedema); benjamin.rouxel@unimore.it (B. Rouxel); c.grelck@uva.nl (C. Grelck)
                                       ยฉ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
resources to allow all tasks to run with fault-tolerance. However, for applications structured as
a set of periodic tasks, software-based fault-tolerance allows the application of fault-tolerance
to only a subset of the task set.
   Control tasks may be able to tolerate non-consecutive deadline misses, which has led to the
adoption of the weakly hard model [5]. A task that is unable to provide a result may not result in
catastrophe, provided that in the next period it can provide a result. Per the weakly hard model,
each task ๐‘– has an (๐‘š๐‘– , ๐‘˜๐‘– ) constraint, indicating that the task must complete at least ๐‘š๐‘– times
successfully out of every ๐‘˜๐‘– times. ๐‘˜๐‘– is said to be the window size. We use this (๐‘š๐‘– , ๐‘˜๐‘– ) constraint
with ๐‘š๐‘– < ๐‘˜๐‘– to deliver more effective fault-tolerance to resource-constrained systems. For
example, consider a task set with just two tasks A and B, where only one task can be run under
fault-tolerance at a time. Furthermore, both task A and B can tolerate non-consecutive deadline
misses. Running task A under fault-tolerance and task B without would leave task B vulnerable
to SEUs. However, we could more optimally make use of the scarce fault-tolerance by switching
between protecting task A and task B in successive iterations of the task set. Tasks under
fault-tolerance may still fail (e.g. TMR reaches no majority), and these cases can be detected.
When the fault-tolerance technique has failed to protect task A, task A should be protected
again in the next iteration of the task set to ensure it does not experience a consecutive fault.

Contribution We propose a new approach for improving fault-tolerance for real-time ap-
plications running on resource-constrained systems by strategy switching. We minimize the
effective unmitigated fault-rate by selecting which tasks are to be run under fault-tolerance.
Our approach recognizes that the importance of protecting a task may change over time due
to earlier faults or lack thereof, and as such runs different tasks under fault-tolerance at dif-
ferent times. By exhaustively searching all patterns in which fault-tolerance can be applied,
our method optimally utilizes limited available computational resources for fault-tolerance in
resource-constrained real-time applications.

Organization In section 2 we introduce our task and fault model. Our method uses a state
machine, which is introduced in section 3. In section 4, we formalize the construction of the state
machine. Then, in section 5, we discuss how our method can lower the consecutive fault rate
for an example task set. We explain the flexibility of our method by extending it to alternative
task and fault models in section 6. Several pieces of related work exist, which are covered in
section 7. The paper is concluded in section 8, and finally in section 9 we discuss various future
directions for this technique.


2. System models
Task model We assume a set of periodic tasks ฮ“ = {๐œ1 ...๐œ๐‘› } with a single, global period
and deadline ๐ท such that the period is equal to or larger than the deadline (no pipelining).
Furthermore, we initially assume that each task can afford one (non-consecutive) deadline miss
(an (๐‘š๐‘– , ๐‘˜๐‘– ) constraint of (1,2)), and that each task is equally important. In section 6, we will
discuss how some of these assumptions can be relaxed to support a wide variety of task models.
Fault model We use the Poisson distribution as an approximation for the worst-case fault
rate of SEUs, which was argued to be a good approximation by Broster et al. [6]. We do not
assume universal fault detection: only when the task runs under a fault-tolerance scheme can a
fault be detected and mitigated. When a task does not run with fault-tolerance, it is not known
whether or not it succeeded. We use the term catastrophic fault to describe an unmitigated
fault occurring in two consecutive iterations of a task that can tolerate a single unmitigated
fault, i.e. the task ๐‘– has an (๐‘š๐‘– , ๐‘˜๐‘– ) = (1,2) constraint. Furthermore, a catastrophic fault also
occurs when a task that cannot tolerate non-consecutive faults experiences an unmitigated
fault, i.e. the task has a (1,1) constraint. We do not consider constraints beyond ๐‘˜๐‘– = 2 in this
paper.

Fault mitigation We assume the presence and implementation of a particular fault-tolerance
scheme, and that any task can be run under that scheme. In this paper, we assume that
SEUs always go undetected in tasks not under fault-tolerance. Finally, we assume the scheme
implements fault-detection and fault-mitigation. Our model allows the fault mitigation to fail
(e.g. due to successive SEUs during both replicas of a task under TMR), but assumes that it is
known when fault mitigation fails.

Other definitions Given the complexity and number of symbols used in this paper, a table
of all symbols and terms has been compiled in Table 1. Each symbol or term used will be defined
prior to use, as well as being listed in the table.


3. Strategy Switching State Machine
To both swiftly select a new subset of the task set to run under fault-tolerance while also making
optimal decisions, we precompute for each situation the next best subset of tasks to run under
fault-tolerance. The result of this is the strategy state machine, which is made available to the
online component. The strategy state machine is a bipartite state machine, consisting of strategy
states and result states. An example of such a state machine is shown in Figure 1.
   The architecture of our strategy switching approach distinguishes between an online part
at runtime, as well as an offline part executing ahead-of-time not beholden to any real-time
constraints. The offline component prepares the state machine, which is then available for
online playback.

Online We introduce a strategy switching component, which plays back the strategy state
machine, taking transitions based on observed faults as the application runs. At runtime, this
component selects a single strategy ๐‘  ahead of every execution of the task set, which becomes
active. The strategy ๐‘  dictates which tasks run under fault-tolerance (ฮ“๐‘  ), and which ones do not
(ฮ“ โˆ– ฮ“๐‘  ). Fault-tolerance techniques are typically not a silver bullet solution, and unmitigated
faults may still occur in tasks in ฮ“๐‘  . Furthermore, fault-tolerance techniques can often report
the fact that they failed to mitigate a fault (e.g. no consensus in triple modular redundancy).
After executing all tasks, the online component uses this information from the execution of the
task set to select the matching result ๐‘Ÿ from the state machine. This result reflects the success
Table 1
Definitions of used symbols and terms
 Item                     Meaning
 Task model
 ฮ“                        Set of all tasks, ฮ“ = {๐œ1 ...๐œ๐‘› }
 ๐œ๐‘– โˆˆ ฮ“                   Task ๐‘– โˆˆ ฮ“, e.g. ๐œ๐ด is task A
 ๐ถ๐‘–                       Worst Case Execution Time (WCET) of task i
 ๐ท                        Global deadline (shared by all tasks)
 Fault model
 ๐œ†                        Fault rate (Poisson)
 (๐‘š๐‘– , ๐‘˜๐‘– )               Constraint indicating task ๐‘– has to execute successfully for at least ๐‘š๐‘– iterations out of every ๐‘˜๐‘–
                          iterations
 Unmitigated fault        Fault in a task not mitigated by a fault-tolerance technique
 Catastrophic fault       Unmitigated fault that leads to the (๐‘š๐‘– , ๐‘˜๐‘– ) constraint of the task being violated
 States in the state machine
 ๐’ฎ                        Set of all strategies
 ๐‘ โˆˆ๐’ฎ                      A strategy
 ฮ“๐‘  โŠ‚ ฮ“                   The tasks protected under strategy ๐‘ 
 ๐‘ ๐ด,๐ต                     A strategy protecting task A and B, i.e. ฮ“๐‘ ๐ด,๐ต = {๐œ๐ด , ๐œ๐ต }
 โ„›                        Set of all results
 ๐‘Ÿโˆˆโ„›                      A result
 ๐‘Ÿ๐ด,๐ต                     A result where task A (๐œ๐ด ) succeeded and task B (๐œ๐ต ) failed
 Transitions in the state machine
 ฮ”                        The transition function for the strategy state machine
 ฮ”(๐‘ )                     The set of successors of strategy ๐‘  as per the transition function ฮ”. Due to the bipartite nature of
                          the state machine, this is always a set of results.
 ฮ”(๐‘Ÿ)                     The successor of result ๐‘Ÿ as per transition function ฮ”. Always a single element, and due to the
                          bipartite nature of ฮ” it is always a strategy.
 Scoring function
 ๐›ฟ(...)                   Scoring function (lower is better), provides steady-state catastrophic fault rate, i.e. the average prob-
                          ability of a catastrophic fault for any iteration of the task set
 ๐›ฟ(ฮ”)                     Scoring function applied to the entire state machine
 ๐›ฟ(๐‘ , ฮ”)                  Probability of a catastrophic fault when leaving strategy ๐‘  considering transition function ฮ”
 ๐›ฟ(๐‘Ÿ, ฮ”)                  Probability of a catastrophic fault when leaving result ๐‘Ÿ considering transition function ฮ”
 ๐›ฟ(๐‘Ÿ,๐‘ )                   Probability of a catastrophic fault when transitioning from ๐‘Ÿ to ๐‘ 
 Probabilities
 ๐‘๐‘–                       Probability of an unmitigated fault in task ๐œ๐‘– when no fault-tolerance is used
 ๐‘ž๐‘–                       Probability of an unmitigated fault in task ๐œ๐‘– when fault-tolerance is used


                                                ๐‘Ÿ๐ด                             ๐‘Ÿ๐ต
                                                ๐ด                                 ๐ต


                                                ๐‘ ๐ด                             ๐‘ ๐ต
                                                ๐ด                                 ๐ต


                                                ๐‘Ÿ๐ด                             ๐‘Ÿ๐ต

Figure 1: A strategy state machine for ฮ“ = {๐œ๐ด ,๐œ๐ต }


or fail state, or probability thereof, of each of the tasks. Each possible result ๐‘Ÿ directly maps to
its best successor strategy, which is applied to the next iteration of the task set.

Offline The full set of strategies ๐‘  โˆˆ ๐’ฎ is computed ahead of time, as well as the transition
relation ฮ” from any given result ๐‘Ÿ โˆˆ โ„› to the next best successor strategy ฮ”(๐‘Ÿ) = ๐‘ . Strategies
which are not schedulable are not included in ๐’ฎ. Furthermore, strategies which are dominated
by other strategies (i.e. all tasks that are protected by one strategy are also protected by another)
are also not considered in ๐’ฎ.
   We develop an algorithm to compute ฮ”, such that our choice of ฮ” provides the lowest
steady-state rate of catastrophic faults.

Example State Machine An example of such a state machine is shown in Figure 1, where
there are two strategies ๐’ฎ = {๐‘ ๐ด , ๐‘ ๐ต }. This state machine is not the optimal state machine for
this task set, but gives a non-trivial example of what such a state machine may look like. Such
a state machine is constructed by the offline component, and made available to the runtime.
Each strategy protects only one task (either task ๐ด or task ๐ต). Finally, fault-tolerance may fail
and if it fails this information is available to the strategy switching component. With these
assumptions, each strategy has two potential successors: one where the task under protection
succeeds and one where it fails (e.g. ๐‘Ÿ๐ด and ๐‘Ÿ๐ด for strategy ๐‘ ๐ด ). Each result state ๐‘Ÿ has just one
successor strategy ฮ”(๐‘Ÿ), e.g. ฮ”(๐‘Ÿ๐ด ) = ๐‘ ๐ต in Figure 1. The following sequence of actions may
take place at runtime, as per Figure 1;

   1. To start, the runtime initializes by picking a random strategy, say ๐‘ ๐ต , and applies fault
      tolerance accordingly. By picking ๐‘ ๐ต , task ๐œ๐ต will be executed with fault-tolerance, while
      task ๐œ๐ด will not.

   2. The task set is executed (iteration 1).

   3. At the end of the iteration, data from the fault-tolerance applied to task ๐œ๐ต is used to
      select a result. If ๐œ๐ต fails, the result ๐‘Ÿ๐ต is selected. However, let us assume ๐œ๐ต succeeded,
      and select result ๐‘Ÿ๐ด accordingly. No information about the success or failure of ๐œ๐ด is
      known.

   4. ๐‘Ÿ๐ต links to strategy ๐‘ ๐ด , which is selected.

   5. By switching to ๐‘ ๐ด , task ๐œ๐ด will be executed with fault-tolerance, while task ๐œ๐ต will not.

   6. The task set is executed (iteration 2).

   7. At the end of the iteration, data from the fault-tolerance of task ๐œ๐ด indicates that task ๐œ๐ด
      has failed, letting us select result ๐‘Ÿ๐ด .

   8. ๐‘Ÿ๐ด links to strategy ๐‘ ๐ด , which is selected.

   *. The process continues...

  Note that there is no involved selection procedure for the initial strategy (strategy ๐‘ ๐ต in the
example). Our approach is only concerned with obtaining the lowest steady-state fault rate of
the application, which is in no way impacted by the choice of initial strategy.
4. Evaluating State Machines
One way to find the optimal state machine transition function ฮ” for a given ๐’ฎ is to enumerate
all possible transition functions, score each transition function, and select the best one. To do
so, we define the state machine scoring function ๐›ฟ(ฮ”). The output of this scoring function
is the steady-state rate of catastrophic faults. Specifically, ๐›ฟ(ฮ”) is the weighted probability
of a catastrophic fault occurring during any iteration of the task set when applying strategy
switching according to the transition function ฮ”.
   To evaluate ๐›ฟ(ฮ”), we compute the steady-state probability distribution of the transition
function ฮ” (e.g. using linear algebra). The steady-state probability distribution provides a
probability ๐‘ƒ (๐‘ |ฮ”) of finding the state machine in strategy ๐‘  at an arbitrarily chosen iteration
of the task set given ฮ”. Intuitively, as the number of iterations of the task set approaches infinity,
๐‘ƒ (๐‘ |ฮ”) is the fraction of โˆ‘๏ธ€iterations spent with strategy ๐‘  active. The sum of these fractions
over all strategies is 1, i.e ๐‘ โˆˆ๐’ฎ ๐‘ƒ (๐‘ |ฮ”) = 1.
   Let ๐›ฟ(๐‘ , ฮ”) be the probability of a catastrophic fault occurring in strategy ๐‘ . Together with
๐‘ƒ (๐‘ |ฮ”), we can now compute ๐›ฟ(ฮ”):
                                             โˆ‘๏ธ
                                    ๐›ฟ(ฮ”) =      ๐‘ƒ (๐‘ |ฮ”) ยท ๐›ฟ(๐‘ , ฮ”)
                                                 ๐‘ โˆˆ๐’ฎ

  Catastrophic faults, by their definition, only occur when an unmitigated fault occurs in two
consecutive iterations of the task set. To compute ๐›ฟ(๐‘ , ฮ”), we must not just consider ๐‘ , but also
the successor of ๐‘ . The successor of ๐‘  depends on what result ๐‘Ÿ is being chosen, which is given
by ๐‘ƒ (๐‘Ÿ|๐‘ ,ฮ”). Each ๐‘Ÿ has only one successor strategy ฮ”(๐‘Ÿ). Let ฮ”(๐‘Ÿ) = ๐‘ to . Thus, ๐›ฟ(๐‘ ,ฮ”) can
be expressed in terms of ๐‘Ÿ and ฮ”(๐‘Ÿ) = ๐‘ to as follows:
                                                  โˆ‘๏ธ
                                ๐›ฟ(๐‘ , ฮ”) =                 ๐‘ƒ (๐‘Ÿ|๐‘ ) ยท ๐›ฟ(๐‘Ÿ, ๐‘ ๐‘ก ๐‘œ)
                                                 ๐‘Ÿโˆˆฮ”(๐‘ )

  As ๐‘ to is a strategy, ๐›ฟ(๐‘Ÿ, ๐‘ to ) is independent of our choice of ฮ”. ๐›ฟ(๐‘Ÿ, ๐‘ to ) is equal to the
probability that any task experiences a fault both in ๐‘Ÿ as well as in ๐‘ to :

                                      โˆ๏ธ
                   ๐›ฟ(๐‘Ÿ, ๐‘ to ) = 1 โˆ’           1 โˆ’ ๐‘ƒ (catastrophic fault in ๐œ๐‘– |๐‘Ÿ,๐‘ to )
                                      ๐œ๐‘– โˆˆฮ“
                                      โˆ๏ธ
                            =1โˆ’               1 โˆ’ ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ) ยท ๐‘ƒ (fault in ๐œ๐‘– |๐‘ to )
                                      ๐œ๐‘– โˆˆฮ“

   The probabilities ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ) and ๐‘ƒ (fault in ๐œ๐‘– |๐‘ to ) are computed based on the used
fault-tolerance technique, if ๐œ๐‘– โˆˆ ฮ“๐‘ to , the rate of faults combined with the WCET of ๐œ๐‘– , and
information available in the result. For example, a task without fault-tolerance may have a
chance of experiencing a fault with ๐‘ƒ (fault in ๐œ๐‘– ) = 1 โˆ’ ๐‘’โˆ’๐œ†ยท๐ถ๐‘– , where ๐ถ๐‘– is the WCET of ๐œ๐‘– .
Likewise, a task which was run under fault-tolerance according to the previous strategy, and
has succeeded according to result ๐‘Ÿ, will have a ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ) = 0. We consider all SEUs to be
statistically independent events. To keep independence, use the WCET instead of the average
execution time, as the execution times of tasks in the same task set may not be independent.
As such, our steady-state rate of catastrophic faults ๐›ฟ(ฮ”) provides an upper bound to the true
steady-state fault rate of the application.

Tractability The approach, as presented here, can easily become intractable for even small
task sets. Many state space reduction techniques can be realized in the way the state machines
are enumerated, which can considerably lower the number of elements. If a candidate state
machine ฮ” contains disconnected sub-graphs, then no unique steady state can be computed,
and as such this candidate can be pruned. Furthermore, the candidate state machine may
contain strategies that are ignored, i.e. there is no path to that strategy from that same strategy.
When this is the case, the steady-state probability ๐‘ƒ (๐‘ |ฮ”) is always 0. For such an ignored
strategy, the successor strategies of its results do not matter, and its many ways of connecting
to successor strategies need not be individually examined. Finally, the domain itself my allow
for significant state space reduction. For example, if a precedence relation is added between
tasks in the task set over which data is communicated, the failure of a preceding task may imply
that the succeeding task is destined to fail. These modifications impact the lattice of strategies,
and reduce the size of the schedulable and non-redundant set of strategies ๐’ฎ.
   For completeness, we discuss the algorithmic complexity of the (naรฏve) state machine construc-
tion algorithm as presented in this paper. While the exact size of the state space depends on a
number of factors, such as the number of available strategies ๐’ฎ, the upper bound is considerable.
   The time complexity of ๐›ฟ(ฮ”) is ๐’ช(|โ„›| ยท ss(|๐’ฎ|)), where:

    โ€ข |โ„›| is the number of results

    โ€ข |๐’ฎ| is the number of strategies

    โ€ข ss(๐‘›) provides an upper bound to the computation of the steady state for ๐‘› strategies

   To compute ss(๐‘›), the steady-state matrix needs to be computed. For this computation,
the LAPACK driver routine DSEGD1 may be used, with a complexity of ๐’ช(๐‘›3 ), resulting in
ss(๐‘›) = ๐‘›3 .
   The number of strategies is up to all combinations of tasks (|๐‘†| โˆˆ ๐’ช(|ฮ“|!)). At the same time,
๐›ฟ(ฮ”) is computed for every possible state machine. As each result can link to any successor
strategy, this yields up to |๐‘†||๐‘…| state machines. Each strategy has โ‰ค 2|ฮ“๐‘  | results, 2|ฮ“๐‘  | โˆˆ ๐’ช(2|ฮ“| )
and thus |๐‘…| โˆˆ ๐’ช(|ฮ“|!ยท2|ฮ“| ). Let ๐‘› = |ฮ“|, i.e. ๐‘› is the number of tasks. Then, the final algorithmic
time complexity is given:

                                     ๐’ช(|๐‘†||๐‘…| ยท |โ„›| ยท ss(|๐’ฎ|)) =
                                                 |ฮ“|
                                     ๐’ช(|ฮ“|!|ฮ“|!ยท2 | ยท |ฮ“|! ยท 2|ฮ“| ยท |ฮ“|!3 ) =
                                             ๐‘›
                                     ๐’ช(๐‘›!๐‘›!ยท2 ยท ๐‘›! ยท 2๐‘› ยท ๐‘›!3 ) โŠ‚
                                             ๐‘›
                                     ๐’ช(๐‘›!๐‘›!ยท2 ยท 2๐‘›5 ) โ‰ˆ
                                             ๐‘›
                                     ๐’ช(๐‘›!๐‘›!ยท2 )

  It must be noted that this is by no means a tight upper bound. In the next section, we will
examine a task set with |ฮ“| = ๐‘› = 3. This example requires examination of only 64 candidate
    1
        http://www.netlib.org/lapack/lug/node71.html
state machines, even though such an ๐‘› value would appear to be completely intractable as per
the above formulation.


5. Example
Let us consider an example task set ฮ“ = {๐œ๐ด , ๐œ๐ต , ๐œ๐ถ }, with WCET values ๐ถ๐ด = 20, ๐ถ๐ต = 10
and ๐ถ๐ถ = 10. We set ๐œ† = 10โˆ’3 for this example. For brevity, we abbreviate the probability of
no fault in task ๐œ๐‘– when not using fault-tolerance as ๐‘๐‘– :

                                                                      ๐ถ๐‘–
                           ๐‘ƒ (no fault in ๐œ๐‘– |no FT) = ๐‘’โˆ’๐œ†ยท๐ถ๐‘– = ๐‘’โˆ’ 103
                                                     = ๐‘๐‘–

   As fault-tolerance technique, we use Triple Modular Redundancy (TMR). We require a majority
for TMR to succeed, and assume there is no other way TMR can fail (e.g. assume no unmitigated
faults during voting). A majority for TMR requires two or more copies of the task to be in
agreement. Like ๐‘๐‘– , we abbreviate the probability of no fault in task ๐œ๐‘– when using fault-tolerance
as ๐‘ž๐‘– :

                                                          (๏ธ‚ )๏ธ‚
                                                            3 2
                        ๐‘ƒ (no fault in ๐œ๐‘– |TMR) = ๐‘3๐‘– +        ๐‘ ยท (1 โˆ’ ๐‘๐‘– )
                                                            2 ๐‘–
                                                = ๐‘ž๐‘–

   With the WCET for each task available, we can then compute the ๐‘๐‘– and ๐‘ž๐‘– values for all
tasks:

                 ๐œ๐ด : ๐‘๐ด = ๐‘’โˆ’20๐œ† = 0.9802      ๐‘ž๐ด = ๐‘3๐ด + 3(1 โˆ’ ๐‘๐ด )๐‘2๐ด = 0.9988
                 ๐œ๐ต : ๐‘๐ต = ๐‘’โˆ’10๐œ† = 0.9901      ๐‘ž๐ต = ๐‘3๐ต + 3(1 โˆ’ ๐‘๐ต )๐‘2๐ต = 0.9997
                 ๐œ๐ถ : ๐‘๐ถ = ๐‘’โˆ’10๐œ† = 0.9901      ๐‘ž๐ถ = ๐‘3๐ถ + 3(1 โˆ’ ๐‘๐ถ )๐‘2๐ถ = 0.9997

Enumerating strategies For our chosen task set ฮ“, there are eight possible subsets and as
such eight possible strategies:

                  ๐‘ โˆ… :            ฮ“๐‘ โˆ… = {}           ฮ“ โˆ– ฮ“๐‘ โˆ… = {๐œ๐ด , ๐œ๐ต , ๐œ๐ถ }
                  ๐‘ ๐ด :          ฮ“๐‘ ๐ด = {๐œ๐ด }           ฮ“ โˆ– ฮ“๐‘ ๐ด = {๐œ๐ต , ๐œ๐ถ }
                  ๐‘ ๐ต :          ฮ“๐‘ ๐ต = {๐œ๐ต }           ฮ“ โˆ– ฮ“๐‘ ๐ต = {๐œ๐ด , ๐œ๐ถ }
                  ๐‘ ๐ถ :          ฮ“๐‘ ๐ถ = {๐œ๐ถ }           ฮ“ โˆ– ฮ“๐‘ ๐ถ = {๐œ๐ด , ๐œ๐ต }
                  ๐‘ ๐ด,๐ต :     ฮ“๐‘ ๐ด,๐ต = {๐œ๐ด , ๐œ๐ต }        ฮ“ โˆ– ฮ“๐‘ ๐ด,๐ต = {๐œ๐ถ }
                  ๐‘ ๐ด,๐ถ :     ฮ“๐‘ ๐ด,๐ถ = {๐œ๐ด , ๐œ๐ถ }        ฮ“ โˆ– ฮ“๐‘ ๐ด,๐ถ = {๐œ๐ต }
                  ๐‘ ๐ต,๐ถ :     ฮ“๐‘ ๐ต,๐ถ = {๐œ๐ต , ๐œ๐ถ }        ฮ“ โˆ– ฮ“๐‘ ๐ต,๐ถ = {๐œ๐ด }
                  ๐‘ ๐ด,๐ต,๐ถ : ฮ“๐‘ ๐ด,๐ต,๐ถ = {๐œ๐ด , ๐œ๐ต , ๐œ๐ถ }    ฮ“ โˆ– ฮ“๐‘ ๐ด,๐ต,๐ถ = {}

  Not all of these may be schedulable. If ๐‘ ๐ด,๐ต,๐ถ was schedulable there would be no reason to
use strategy switching. Likewise, if only ๐‘ โˆ… was schedulable, there is no way to deploy limited
                              Unschedulable
                                                   ๐‘ ๐ด,๐ต,๐ถ


                                            ๐‘ ๐ด,๐ต   ๐‘ ๐ด,๐ถ     ๐‘ ๐ต,๐ถ

                              Schedulable

                                            ๐‘ ๐ด      ๐‘ ๐ต       ๐‘ ๐ถ

                               Redundant

                                                     ๐‘ โˆ…



Figure 2: Lattice of strategies for ฮ“ = {๐œ๐ด , ๐œ๐ต , ๐œ๐ถ }


fault-tolerance to this task set. For the example, we assume that only strategies ๐‘ โˆ… , ๐‘ ๐ด , ๐‘ ๐ต , ๐‘ ๐ถ
and ๐‘ ๐ต,๐ถ are determined to be schedulable by a scheduler. The strategies form a lattice with
๐‘ โˆ… as the greatest lower bound, and ๐‘ ๐ด,๐ต,๐ถ as the least upper bound. The lattice is shown in
Figure 2. Figure 2 also reveals redundant strategies, e.g. thereโ€™s no reason to choose ๐‘ ๐ต when
๐‘ ๐ต,๐ถ is schedulable. As such, there is no need to consider them. Let ๐’ฎ = {๐‘ ๐ด , ๐‘ ๐ต,๐ถ }.
   TMR can fail to reach a consensus, and then this information is available to the strategy
selection component in the form of a result. As such, in this example each strategy ๐‘  has 2|ฮ“๐‘  |
possible results. For ๐‘ ๐ด this is ๐‘Ÿ๐ด and ๐‘Ÿ๐ด , and for ๐‘ ๐ต,๐ถ this is ๐‘Ÿ๐ต,๐ถ , ๐‘Ÿ๐ต,๐ถ , ๐‘Ÿ๐ต,๐ถ and ๐‘Ÿ๐ต,๐ถ . These
results are shown in Figure 3a, which shows the strategy state machine without successor
relations for each result.

Evaluating transitions functions There are |๐’ฎ||โ„›| = 26 = 64 possible transition functions
ฮ”, as all of the six results needs to be linked to one of the two strategies. We will not enumerate
all of them, but two examples are shown in Figure 3b and Figure 3c.
   In this example, we will show how to evaluate ๐›ฟ(ฮ”) for the state machine in Figure 3b. Let
this be ฮ”3b .

   1. Computing the steady-state probability of ฮ”3b . The result states are โ€œurgentโ€ states, in
      which no time can pass. In other words, the result states themselves are not relevant for
      the steady-state (i.e. the fraction of iterations of the task set spent in state ๐‘Ÿ โˆˆ โ„› is 0).
      As such, we remove these states for the steady-state computation, linking each strategy
      to multiple successor strategies. The resulting state machine is a Discrete-Time Markov
      Chain, where each time step is an iteration of the task set ฮ“. The steady-state can be
      computed using linear algebra. Let ๐‘‡ be the transition matrix for ฮ”3b with all results
      removed:

                               [๏ธ‚                                                 ]๏ธ‚
                                          ๐‘ƒ (๐‘Ÿ๐ด |๐‘ ๐ด )             ๐‘ƒ (๐‘Ÿ๐ด |๐‘ ๐ด )
                         ๐‘‡ =
                             ๐‘ƒ (๐‘Ÿ๐ต,๐ถ โˆช ๐‘Ÿ๐ต,๐ถ โˆช ๐‘Ÿ๐ต,๐ถ โˆช ๐‘Ÿ๐ต,๐ถ |๐‘ ๐ต,๐ถ )     0
                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘ ๐ด                             ๐‘ ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ

                (a) Partial state machine without successor relations for the results

                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘ ๐ด                             ๐‘ ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ


          (b) Switching state machine choosing between ๐‘ ๐ด and ๐‘ ๐ต,๐ถ based on the result

                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘ ๐ด                             ๐‘ ๐ต,๐ถ
                                ๐ด                     ๐ต,๐ถ             ๐ต,๐ถ

                              ๐‘Ÿ๐ด                   ๐‘Ÿ๐ต,๐ถ               ๐‘Ÿ๐ต,๐ถ


                         (c) Degenerate state machine always choosing ๐‘ ๐ด
Figure 3: Example state machines

                           [๏ธ‚          ]๏ธ‚ [๏ธ‚             ]๏ธ‚
                             1 โˆ’ ๐‘ž๐ด ๐‘ž๐ด      0.0012 0.9988
                         =               =
                                1    0         1      0



      We can compute the steady-state vector ๐‘ฃ by solving ๐‘ฃ = ๐‘ฃ๐‘‡ .
                                                            (๏ธ‚ 1 )๏ธ‚
                                            ๐‘ฃ = ๐‘ฃ๐‘‡ โ‰ˆ          2
                                                              1
                                                              2

      Note that a unique steady-state vector ๐‘ฃ need not exist if there are two or more parts of
      the state machine that are disconnected. For example, all results of ๐‘ ๐ด may link back to ๐‘ ๐ด ,
      while all results of ๐‘ ๐ต,๐ถ link back to ๐‘ ๐ต,๐ถ . In such a case, the steady-state is dependent
      on the initial state. However, we can safely disregard state machines that depend on the
      initial condition, as other state machines with identical steady-state behavior must exist
      as well. If staying in ๐‘ ๐ด provides the lowest ๐›ฟ(ฮ”) value, then a state machine like shown
      in Figure 3c would yield the exact same ๐›ฟ(ฮ”) value as a state machine where ๐‘ ๐ด and ๐‘ ๐ต,๐ถ
      is disconnected with ๐‘ ๐ด as the initial state.
2. Compute, for each result โ†’ strategy transition, the ๐›ฟ(๐‘Ÿ, ๐‘ to ) catastrophic fault probability.

                         ๐›ฟ(๐‘Ÿ๐ด , ๐‘ ๐ต,๐ถ )     = 1โˆ’     (1 โˆ’ (0) ยท (1 โˆ’ ๐‘๐ด ))ยท
                                                    (1 โˆ’ (1 โˆ’ ๐‘๐ต ) ยท (1 โˆ’ ๐‘ž๐ต ))ยท
                                                    (1 โˆ’ (1 โˆ’ ๐‘๐ถ ) ยท (1 โˆ’ ๐‘๐ถ ))
                                           =        5.87 ยท 10โˆ’6
                           ๐›ฟ(๐‘Ÿ๐ด , ๐‘ ๐ด )    = 1โˆ’    (1 โˆ’ (1) ยท (1 โˆ’ ๐‘ž๐ด ))ยท
                                                  (1 โˆ’ (1 โˆ’ ๐‘๐ต ) ยท (1 โˆ’ ๐‘๐ต ))ยท
                                                  (1 โˆ’ (1 โˆ’ ๐‘๐ถ ) ยท (1 โˆ’ ๐‘๐ถ ))
                                          =       1358 ยท 10โˆ’6
                          ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ด )    = 1โˆ’ (1 โˆ’ (1 โˆ’ ๐‘๐ด ) ยท (1 โˆ’ ๐‘ž๐ด ))ยท
                                                (1 โˆ’ (0) ยท (1 โˆ’ ๐‘๐ต ))ยท
                                                (1 โˆ’ (0) ยท (1 โˆ’ ๐‘๐ถ ))
                                           =    22.98 ยท 10โˆ’6
                          ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ด )    = 1โˆ’ (1 โˆ’ (1 โˆ’ ๐‘๐ด ) ยท (1 โˆ’ ๐‘ž๐ด ))ยท
                                                (1 โˆ’ (1) ยท (1 โˆ’ ๐‘๐ต ))ยท
                                                (1 โˆ’ (0) ยท (1 โˆ’ ๐‘๐ถ ))
                                           =    9973 ยท 10โˆ’6
                          ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ด )    = 1โˆ’ (1 โˆ’ (1 โˆ’ ๐‘๐ด ) ยท (1 โˆ’ ๐‘ž๐ด ))ยท
                                                (1 โˆ’ (0) ยท (1 โˆ’ ๐‘๐ต ))ยท
                                                (1 โˆ’ (1) ยท (1 โˆ’ ๐‘๐ถ ))
                                           =    9973 ยท 10โˆ’6
                          ๐›ฟ(๐‘Ÿ๐ต,๐ด , ๐‘ ๐ด )    = 1โˆ’     (1 โˆ’ (1 โˆ’ ๐‘๐ด ) ยท (1 โˆ’ ๐‘ž๐ด ))ยท
                                                    (1 โˆ’ (1) ยท (1 โˆ’ ๐‘๐ต ))ยท
                                                    (1 โˆ’ (1) ยท (1 โˆ’ ๐‘๐ถ ))
                                           =        19923 ยท 10โˆ’6

3. Using the calculated ๐›ฟ(๐‘Ÿ, ๐‘ to ) values, compute a ๐›ฟ(๐‘  โˆˆ ๐’ฎ) value for each strategy.


                             ๐›ฟ(๐‘ ๐ด ) =๐‘ƒ (๐‘Ÿ๐ด |๐‘ ๐ด ) ยท ๐›ฟ(๐‘Ÿ๐ด , ๐‘ ๐ต,๐ถ )+
                                          ๐‘ƒ (๐‘Ÿ๐ด |๐‘ ๐ด ) ยท ๐›ฟ(๐‘Ÿ๐ด , ๐‘ ๐ด ) =
                                          7.6077 ยท 10โˆ’6
                           ๐›ฟ(๐‘ ๐ต,๐ถ ) =๐‘ƒ (๐‘Ÿ๐ต,๐ถ |๐‘ ๐ต,๐ถ ) ยท ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ด )+
                                          ๐‘ƒ (๐‘Ÿ๐ต,๐ถ |๐‘ ๐ต,๐ถ ) ยท ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ต,๐ถ )+
                                          ๐‘ƒ (๐‘Ÿ๐ต,๐ถ |๐‘ ๐ต,๐ถ ) ยท ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ต,๐ถ )+
                                          ๐‘ƒ (๐‘Ÿ๐ต,๐ถ |๐‘ ๐ต,๐ถ ) ยท ๐›ฟ(๐‘Ÿ๐ต,๐ถ , ๐‘ ๐ต,๐ถ ) =
                                          29.69985 ยท 10โˆ’6

   ๐›ฟ(๐‘ ) is the probability that a catastrophic fault occurs by selecting strategy ๐‘ . The bad
   score of ๐‘ ๐ต,๐ถ is not surprising: the ฮ”3b state machine is not particularly clever as it
   chooses to switch to strategy ๐‘ ๐ด even with the knowledge that ๐œ๐ต or ๐œ๐ถ has failed.
                 Table 2
                 Shortened table of all possible ฮ” transition functions for Figure 3a
                  ฮ”      ๐‘Ÿ๐ด      ๐‘Ÿ๐ด      ๐‘Ÿ๐ต,๐ถ  ๐‘Ÿ๐ต,๐ถ      ๐‘Ÿ๐ต,๐ถ    ๐‘Ÿ๐ต,๐ถ    ๐›ฟ(ฮ”) ยท 105
                  ฮ”1     ๐‘ ๐ด      ๐‘ ๐ด      ๐‘ ๐ด    ๐‘ ๐ด        ๐‘ ๐ด      ๐‘ ๐ด      19.9355
                  ฮ”2     ๐‘ ๐ต๐ถ     ๐‘ ๐ด      ๐‘ ๐ด    ๐‘ ๐ด        ๐‘ ๐ด      ๐‘ ๐ด      1.81421
                  ฮ”3     ๐‘ ๐ด      ๐‘ ๐ต๐ถ     ๐‘ ๐ด    ๐‘ ๐ด        ๐‘ ๐ด      ๐‘ ๐ด      22.055
                                                 ...
                  ฮ”56    ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ   ๐‘ ๐ด        ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     39.489
                  ฮ”57    ๐‘ ๐ด      ๐‘ ๐ด      ๐‘ ๐ด    ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     19.935
                  ฮ”58    ๐‘ ๐ต๐ถ     ๐‘ ๐ด      ๐‘ ๐ด    ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     1.54062
                  ฮ”59    ๐‘ ๐ด      ๐‘ ๐ต๐ถ     ๐‘ ๐ด    ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     22.054
                  ฮ”60    ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     ๐‘ ๐ด    ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     2.6115
                  ฮ”61    ๐‘ ๐ด      ๐‘ ๐ด      ๐‘ ๐ต๐ถ   ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     n/a3
                  ฮ”62    ๐‘ ๐ต๐ถ     ๐‘ ๐ด      ๐‘ ๐ต๐ถ   ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     39.226
                  ฮ”63    ๐‘ ๐ด      ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ   ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     39.226
                  ฮ”64    ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ   ๐‘ ๐ต๐ถ       ๐‘ ๐ต๐ถ     ๐‘ ๐ต๐ถ     39.2265
                  ฮ”โˆ…                         n/a                         59.0104
                   1
                     Example from Figure 3b (ฮ”2 = ฮ”3b )
                   2
                     Best (lowest) steady-state fault rate
                   3
                     No unique steady-state exists for this transition function
                   4
                     ๐›ฟ(ฮ”) of the task set without any form of fault-tolerance
                   5
                     Does not strategy switch, i.e. always stays in the same strategy


   4. Finally, compute ๐›ฟ(ฮ”3b ) by taking the weighted average of all ๐›ฟ(๐‘ ) values by multiplying
      ๐›ฟ(๐‘ ) for each strategy by the steady-state probability ๐‘ƒ (๐‘ |ฮ”3b ) of being in that strategy.


                             ๐›ฟ(ฮ”3b ) = ๐‘ƒ (๐‘ ๐ด ) ยท ๐›ฟ(๐‘ ๐ด ) + ๐‘ƒ (๐‘ ๐ต,๐ถ ) ยท ๐›ฟ(๐‘ ๐ต,๐ถ )
                                       = 1.814 ยท 10โˆ’5

   This process is repeated for all 64 possible transition functions. For brevity, we will not show
the evaluation of every single transition function here. Instead, Table 2 shows a subset of all
possible ฮ” values. The columns headed with a result show to which successor strategy that
result maps. For example, for the first evaluated state machine ฮ”1 (๐‘Ÿ๐ด ) = ๐‘ ๐ด . The best state
machine is also revealed, listed as ฮ”58 . Finally, the last item in the table is ฮ”โˆ… , added as a
reference. ฮ”โˆ… is the state machine with a single strategy ๐‘ โˆ… which protects no tasks, i.e. the
behavior obtained when not using any form of fault-tolerance. The best state machine is a
38.3-fold improvement over this default. That is, the best state machine offers a 38.3 times
lower rate of catastrophic failure when compared to ฮ”โˆ… . Finally, the table also shows two state
machines which perform no strategy switching, but still use fault-tolerance. These are ฮ”1 and
ฮ”64 , staying in ๐‘ ๐ด and ๐‘ ๐ต,๐ถ respectively. The strategy switching solution ฮ”58 offers a 12.9
times lower rate of catastrophic faults when compared to the best of these static solutions.
                                            A             B

Figure 4: Task set ฮ“ = {๐œ๐ด , ๐œ๐ต } with a precedence relation between ๐œ๐ด and ๐œ๐ต


6. Extending the fault and task model
6.1. Adding precedence relations
Directed Acyclic Graph scheduling is an extension to our task model where precedence relations
exist between tasks. While this has no impact on the strategy switching algorithm directly as it
need not concern itself with scheduling, it may add dependence to the success probability of a
task. For this extension, we assume that when a predecessor task fails (produces incorrect data),
all successor tasks fail as well due to operating on incorrect data, even when not experiencing a
SEU.
   To support this assumption, the ๐›ฟ(๐‘Ÿ, ๐‘ to ) cost function needs to be modified to consider
precedence relations. Specifically, ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ) and ๐‘ƒ (fault in ๐œ๐‘– |๐‘ to ) must be replaced to not
just consider if ๐œ๐‘– failed in isolation. Let us call this precedence-aware probability the probability
of incorrect output.

                       ๐‘ƒ (incorrect output ๐œ๐‘– |๐‘ฅ) = ๐‘ƒ (fault in ๐œ๐‘– |๐‘ฅ)ยท
                                                    โˆ๏ธ
                                                            ๐‘ƒ (fault in ๐œ๐‘— |๐‘ฅ)
                                                   ๐œ๐‘— โˆˆpred(๐œ๐‘– )

   Here, ๐‘ฅ โˆˆ ๐’ฎ โˆช โ„› is either a strategy or a result. pred (๐œ๐‘– ) is the set of all direct and indirect
predecessors of task ๐œ๐‘– . This probability also includes the behavior of any predecessor tasks.
   Figure 4 shows an example task set with a precedence relation. Consider the task set in
this example with two strategies: ๐’ฎ = {๐‘ ๐ด , ๐‘ ๐ต }, protecting task ๐œ๐ด and ๐œ๐ต respectively. Each
strategy has two outcomes, hence โ„› = {๐‘Ÿ๐ด , ๐‘Ÿ๐ด , ๐‘Ÿ๐ต , ๐‘Ÿ๐ต }. Let us consider ๐‘ƒ (fault in ๐œ๐ต |๐‘Ÿ๐ด ):
the result ๐‘Ÿ๐ด does not directly communicate anything about the state of task ๐ต, however due to
the precedence relation we know it effectively failed. As such, ๐‘ƒ (incorrect output ๐œ๐ต |๐‘Ÿ๐ด ) = 1.

6.2. Selective fault-tolerance and criticality
The task model can be extended to support heterogeneity in the ability of tasks to tolerate
non-consecutive faults. Let ฮ“๐‘ be the set of tasks which cannot tolerate non-consecutive
faults. Then, we update ๐‘ƒ (catastrophic fault in ๐œ๐‘– |๐‘Ÿ,๐‘ to ) to consider non-consecutive faults as
catastrophic faults when ๐œ๐‘– โˆˆ ฮ“๐‘ .

                        ๐‘ƒ (catastrophic fault in ๐œ๐‘– |๐‘Ÿ,๐‘ to =)
                        {๏ธƒ
                          ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ) ยท ๐‘ƒ (fault in ๐œ๐‘– |๐‘ to ) if ๐œ๐‘– โˆˆ
                                                                           / ฮ“๐‘
                          ๐‘ƒ (fault in ๐œ๐‘– |๐‘Ÿ)                         else

 The goal of the strategy state machine is to minimize the rate of catastrophic faults. This
may mean that the rate of catastrophic faults is so low that even a single fault is unlikely to
occur across the lifetime of the system. However, when used in soft real-time deployments,
a โ€œcatastrophic faultโ€ may not be catastrophic while still undesirable. In such a deployment,
catastrophic faults may be acceptable. The impact of a task experiencing such a catastrophic
fault may not be the same across all tasks. As such, a priority function ๐‘(๐œ๐‘– โˆˆ ฮ“) can be
integrated into ๐›ฟ(๐‘Ÿ, ๐‘ to ).


                       ๐›ฟ(๐‘Ÿ, ๐‘ to ) =
                            โˆ๏ธ
                       1โˆ’        (1 โˆ’ ๐‘ƒ (catastrophic fault in ๐œ๐‘– |๐‘Ÿ,๐‘ to ))๐‘(๐œ๐‘– )
                           ๐œ๐‘– โˆˆฮ“

  The ๐‘(๐œ๐‘– ) function assigns relative priority, where higher is better. A task ๐œ๐‘– with ๐‘(๐œ๐‘– ) = 2
has twice the importance as a task ๐œ๐‘— with ๐‘(๐œ๐‘— ) = 1. Note that this change affects the meaning of
the output ๐›ฟ(ฮ”), which is no longer the rate of catastrophic faults. If the impact of a catastrophic
fault in a task with ๐‘(๐œ๐‘– ) = 2 is equivalent to two catastrophic faults in another task with
๐‘(๐œ๐‘— ) = 1, then the final rate ๐›ฟ(ฮ”) can be seen as the rate of catastrophic faults normalized to
faults in ๐œ๐‘— .


7. Related work
(๐‘š๐‘– , ๐‘˜๐‘– ) constraints have been used before in the domain of real-time scheduling. Choi et al. [7]
proposed a scheduler, together with an efficient schedulability algorithm for a sporadic task
set with tasks under (๐‘š๐‘– , ๐‘˜๐‘– ) constraints. This scheduler allows for scheduling task sets that
would normally not be schedulable, but utilizing their (๐‘š๐‘– , ๐‘˜๐‘– ) constraints allows them to be
scheduled.
   Chen et al. [8] proposed a solution that similar to ours. Their method offers fault-tolerance
with the goal of reducing the effective fault rate as well as lowering energy consumption.
Chen et al. proposes a static scheduling technique called Static Pattern-Based Reliable Execution,
ensuring each (๐‘š๐‘– , ๐‘˜๐‘– ) constraint is respected in the presence of transient faults. Furthermore,
they propose delaying the execution of their static pattern if no fault is detected at runtime,
opportunistically running more unprotected instances of the task with the goal of saving energy.
However, if the static pattern is found to be unschedulable as per their schedulability test,
their implementation is unable to provide a schedule that minimizes the fault rate for a given
resource-constrained real-time system. While their approach offers more flexibility in the
task model (specifically the support for (๐‘š๐‘– , ๐‘˜๐‘– ) constraints with ๐‘˜๐‘– > 2), it does not consider
that fault mitigation may fail. Our approach optimally lowers the fault rate, regardless of the
hardware constrains. Furthermore, our approach recognizes that fault mitigation may fail, and
includes this in the calculation for lowering the fault rate.
   Gujarati et al. [9] contributed a technique for measuring the fault rate of an application
with tasks under (๐‘š๐‘– , ๐‘˜๐‘– ) constraints. Their technique provides an upper bound for the fault
probability per iteration of a Fault-tolerant Single-Input Single-Output (FT-SISO) control loop,
similar to our ๐›ฟ(ฮ”) output on a task set with precedence relations. Their technique hopes
to provide transparency to system designers, allowing analyzing the impact on the reliablity
when changing the hardware or software. However, while their approach is aware of (๐‘š๐‘– , ๐‘˜๐‘– )
constraints, it does not provide schedules that utilize such constraints. Instead, it merely includes
them in the reliability calculation.
   The domain of strategy switching shares some aspects with Mixed-Criticality (MC) systems.
In an MC system, the system switches between different levels of criticality depending on the
operating conditions of the system. Tasks are assigned a criticality level, and when the system
criticality is higher than that of the task, the task is not scheduled to guarantee the successful
and timely execution of tasks with a higher criticality level. Pathan [10] combines MC with
fault-tolerance against transient faults. As is typical in MC research, as the level of criticality
increases, the pessimism increases. Pathan increases the maximum fault rate when switching
to a higher level of criticality. In our approach we do not vary the pessimism of any parameter.
Instead, we assume the ๐œ† parameter provides a suitable upper bound to the fault rate in all
conditions. Our approach offers some aspects typically not found in MC systems: while one
could argue that each strategy is really a criticality level, it is a criticality level applied to a
subset of the tasks (specifically ฮ“๐‘  ). Finally, the approach by Pathan requires bounding the
number of faults that can occur in any window. As such, passing their sufficient schedulability
test will (under their fault model) guarantee the system will never experience a fault.


8. Conclusion
In this paper, we have shown how strategy switching can be used to improve fault-tolerance
for resource-constrained systems. Our method makes effective use of the ability to vary which
tasks receive fault-tolerance. It considers at the start of every iteration of the task set what the
best set of tasks is to protect. We have shown how our method computes the optimal strategy
state machine for any given task set, minimizing the rate of catastrophic faults. We have also
shown the flexibility of our method to be extended to support new task and fault models.


9. Future work
In future work, we hope to improve the tractability of our algorithm by both state space reduction
algorithms as well as by using heuristics.
   Furthermore, we hope to extend the fault model to distinguish between deadline misses
and incorrect results. We also hope to integrate tasks with (๐‘š๐‘– , ๐‘˜๐‘– ) constraints with ๐‘˜๐‘– > 2.
Additionally, we hope to integrate the natural ability of tasks to detect faults into our task model,
as a SEU may for example lead to a segfault. Finally, we hope to validate our approach using
simulation-based analysis.


Acknowledgments
The presentation of this paper was considerably improved in response to comments provided
by the anonymous reviewers, and we gratefully acknowledge their insights and assistance.
  This project has received funding from the European Unionโ€™s Horizon 2020 research and
innovation program under grant agreement No. 871259 (ADMORPH project). Additionally, this
work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.
References
 [1] I. Oz, S. Arslan, A survey on multithreading alternatives for soft error fault tolerance,
     ACM Computing Surveys (2019).
 [2] R. E. Lyons, W. Vanderkulk, The use of triple-modular redundancy to improve computer
     reliability, IBM journal of research and development 6 (1962) 200โ€“209.
 [3] J. Chang, G. A. Reis, D. I. August, Automatic instruction-level software-only recovery, in:
     International Conference on Dependable Systems and Networks (DSNโ€™06), IEEE, 2006, pp.
     83โ€“92.
 [4] S. A. Asghari, M. Binesh Marvasti, A. M. Rahmani, Enhancing transient fault tolerance in
     embedded systems through an OS task level redundancy approach, Future Generation
     Computer Systems 87 (2018) 58โ€“65. doi:10.1016/j.future.2018.04.049.
 [5] G. Bernat, A. Burns, A. Liamosi, Weakly hard real-time systems, IEEE transactions on
     Computers 50 (2001) 308โ€“321.
 [6] I. Broster, A. Burns, G. Rodriguez-Navas, Timing analysis of real-time communication
     under electromagnetic interference, Real-Time Systems 30 (2005) 55โ€“81.
 [7] H. Choi, H. Kim, Q. Zhu, Job-class-level fixed priority scheduling of weakly-hard real-time
     systems, in: 2019 IEEE Real-Time and Embedded Technology and Applications Symposium
     (RTAS), 2019, pp. 241โ€“253. doi:10.1109/RTAS.2019.00028.
 [8] K.-H. Chen, B. Bรถnninghoff, J.-J. Chen, P. Marwedel, Compensate or ignore? Meeting
     control robustness requirements through adaptive soft-error handling, in: Proceedings of
     the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory
     for Embedded Systems, LCTES 2016, Association for Computing Machinery, New York,
     NY, USA, 2016, pp. 82โ€“91. doi:10.1145/2907950.2907952.
 [9] A. Gujarati, M. Nasri, B. B. Brandenburg, Quantifying the resiliency of fail-operational real-
     time networked control systems, in: S. Altmeyer (Ed.), 30th Euromicro Conference on Real-
     Time Systems (ECRTS 2018), volume 106 of Leibniz International Proceedings in Informatics
     (LIPIcs), Schloss Dagstuhlโ€“Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018, pp.
     16:1โ€“16:24. doi:10.4230/LIPIcs.ECRTS.2018.16.
[10] R. M. Pathan, Fault-tolerant and real-time scheduling for mixed-criticality systems, Real-
     Time Systems 50 (2014) 509โ€“547.