<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lukas Miedema</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Rouxel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Clemens Grelck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam (UvA)</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Modena and Reggio Emilia (Unimore)</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Software-based fault-tolerance is an attractive alternative to hardware-based fault-tolerance, as it allows for the use of cheap Commercial Of The Shelf hardware. However, software-based fault-tolerance comes at a cost, requiring computing the same results multiple times to allow for the detection and mitigation of faults. Resource-constrained real-time applications may not be able to aford this cost. At the same time, the domain of a real-time task may allow it to tolerate a fault, provided it does not occur in consecutive iterations of the task. In this paper, we introduce a new way to deploy fault-tolerance called strategy switching. Our method targets Single Event Upsets by running diferent subsets of tasks under fault-tolerance at diferent points in time. We do not bound the number of faults in a window, nor does our method assume that tasks under fault-tolerance cannot still fail. Our technique does not require a minimal amount of additional compute resources for fault-tolerance. Instead, our method optimally utilizes any available compute resources for fault-tolerance for resource-constrained real-time applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cyber-physical Systems</kwd>
        <kwd>Resource Constraints</kwd>
        <kwd>Fault-tolerance</kwd>
        <kwd>Single Event Upsets</kwd>
        <kwd>Weakly Hard Realtime</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>resources to allow all tasks to run with fault-tolerance. However, for applications structured as
a set of periodic tasks, software-based fault-tolerance allows the application of fault-tolerance
to only a subset of the task set.</p>
      <p>Control tasks may be able to tolerate non-consecutive deadline misses, which has led to the
adoption of the weakly hard model [5]. A task that is unable to provide a result may not result in
catastrophe, provided that in the next period it can provide a result. Per the weakly hard model,
each task  has an (, ) constraint, indicating that the task must complete at least  times
successfully out of every  times.  is said to be the window size. We use this (, ) constraint
with  &lt;  to deliver more efective fault-tolerance to resource-constrained systems. For
example, consider a task set with just two tasks A and B, where only one task can be run under
fault-tolerance at a time. Furthermore, both task A and B can tolerate non-consecutive deadline
misses. Running task A under fault-tolerance and task B without would leave task B vulnerable
to SEUs. However, we could more optimally make use of the scarce fault-tolerance by switching
between protecting task A and task B in successive iterations of the task set. Tasks under
fault-tolerance may still fail (e.g. TMR reaches no majority), and these cases can be detected.
When the fault-tolerance technique has failed to protect task A, task A should be protected
again in the next iteration of the task set to ensure it does not experience a consecutive fault.
Contribution We propose a new approach for improving fault-tolerance for real-time
applications running on resource-constrained systems by strategy switching. We minimize the
efective unmitigated fault-rate by selecting which tasks are to be run under fault-tolerance.
Our approach recognizes that the importance of protecting a task may change over time due
to earlier faults or lack thereof, and as such runs diferent tasks under fault-tolerance at
different times. By exhaustively searching all patterns in which fault-tolerance can be applied,
our method optimally utilizes limited available computational resources for fault-tolerance in
resource-constrained real-time applications.</p>
      <p>Organization In section 2 we introduce our task and fault model. Our method uses a state
machine, which is introduced in section 3. In section 4, we formalize the construction of the state
machine. Then, in section 5, we discuss how our method can lower the consecutive fault rate
for an example task set. We explain the flexibility of our method by extending it to alternative
task and fault models in section 6. Several pieces of related work exist, which are covered in
section 7. The paper is concluded in section 8, and finally in section 9 we discuss various future
directions for this technique.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System models</title>
      <p>
        Task model We assume a set of periodic tasks Γ = { 1... } with a single, global period
and deadline  such that the period is equal to or larger than the deadline (no pipelining).
Furthermore, we initially assume that each task can aford one (non-consecutive) deadline miss
(an (, ) constraint of (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        )), and that each task is equally important. In section 6, we will
discuss how some of these assumptions can be relaxed to support a wide variety of task models.
Fault model We use the Poisson distribution as an approximation for the worst-case fault
rate of SEUs, which was argued to be a good approximation by Broster et al. [6]. We do not
assume universal fault detection: only when the task runs under a fault-tolerance scheme can a
fault be detected and mitigated. When a task does not run with fault-tolerance, it is not known
whether or not it succeeded. We use the term catastrophic fault to describe an unmitigated
fault occurring in two consecutive iterations of a task that can tolerate a single unmitigated
fault, i.e. the task  has an (, ) = (
        <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
        ) constraint. Furthermore, a catastrophic fault also
occurs when a task that cannot tolerate non-consecutive faults experiences an unmitigated
fault, i.e. the task has a (
        <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
        ) constraint. We do not consider constraints beyond  = 2 in this
paper.
      </p>
      <p>Fault mitigation We assume the presence and implementation of a particular fault-tolerance
scheme, and that any task can be run under that scheme. In this paper, we assume that
SEUs always go undetected in tasks not under fault-tolerance. Finally, we assume the scheme
implements fault-detection and fault-mitigation. Our model allows the fault mitigation to fail
(e.g. due to successive SEUs during both replicas of a task under TMR), but assumes that it is
known when fault mitigation fails.</p>
      <p>Other definitions Given the complexity and number of symbols used in this paper, a table
of all symbols and terms has been compiled in Table 1. Each symbol or term used will be defined
prior to use, as well as being listed in the table.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Strategy Switching State Machine</title>
      <p>To both swiftly select a new subset of the task set to run under fault-tolerance while also making
optimal decisions, we precompute for each situation the next best subset of tasks to run under
fault-tolerance. The result of this is the strategy state machine, which is made available to the
online component. The strategy state machine is a bipartite state machine, consisting of strategy
states and result states. An example of such a state machine is shown in Figure 1.</p>
      <p>The architecture of our strategy switching approach distinguishes between an online part
at runtime, as well as an ofline part executing ahead-of-time not beholden to any real-time
constraints. The ofline component prepares the state machine, which is then available for
online playback.</p>
      <p>Online We introduce a strategy switching component, which plays back the strategy state
machine, taking transitions based on observed faults as the application runs. At runtime, this
component selects a single strategy  ahead of every execution of the task set, which becomes
active. The strategy  dictates which tasks run under fault-tolerance (Γ), and which ones do not
(Γ ∖ Γ). Fault-tolerance techniques are typically not a silver bullet solution, and unmitigated
faults may still occur in tasks in Γ. Furthermore, fault-tolerance techniques can often report
the fact that they failed to mitigate a fault (e.g. no consensus in triple modular redundancy).
After executing all tasks, the online component uses this information from the execution of the
task set to select the matching result  from the state machine. This result reflects the success
or fail state, or probability thereof, of each of the tasks. Each possible result  directly maps to
its best successor strategy, which is applied to the next iteration of the task set.</p>
      <sec id="sec-3-1">
        <title>Ofline</title>
        <p>relation Δ from any given result  ∈ ℛ to the next best successor strategy Δ() = . Strategies</p>
        <p>The full set of strategies  ∈  is computed ahead of time, as well as the transition
which are not schedulable are not included in . Furthermore, strategies which are dominated
by other strategies (i.e. all tasks that are protected by one strategy are also protected by another)
are also not considered in .</p>
        <p>We develop an algorithm to compute Δ, such that our choice of Δ provides the lowest
steady-state rate of catastrophic faults.</p>
        <p>Example State Machine An example of such a state machine is shown in Figure 1, where
there are two strategies  = {, }. This state machine is not the optimal state machine for
this task set, but gives a non-trivial example of what such a state machine may look like. Such
a state machine is constructed by the ofline component, and made available to the runtime.
Each strategy protects only one task (either task  or task ). Finally, fault-tolerance may fail
and if it fails this information is available to the strategy switching component. With these
assumptions, each strategy has two potential successors: one where the task under protection
succeeds and one where it fails (e.g.  and  for strategy ). Each result state  has just one
successor strategy Δ(), e.g. Δ() =  in Figure 1. The following sequence of actions may
take place at runtime, as per Figure 1;
1. To start, the runtime initializes by picking a random strategy, say , and applies fault
tolerance accordingly. By picking , task   will be executed with fault-tolerance, while
task   will not.
2. The task set is executed (iteration 1).
3. At the end of the iteration, data from the fault-tolerance applied to task   is used to
select a result. If   fails, the result  is selected. However, let us assume   succeeded,
and select result  accordingly. No information about the success or failure of   is
known.
4.  links to strategy , which is selected.
5. By switching to , task   will be executed with fault-tolerance, while task   will not.
6. The task set is executed (iteration 2).
7. At the end of the iteration, data from the fault-tolerance of task   indicates that task  
has failed, letting us select result .
8.  links to strategy , which is selected.
* . The process continues...</p>
        <p>Note that there is no involved selection procedure for the initial strategy (strategy  in the
example). Our approach is only concerned with obtaining the lowest steady-state fault rate of
the application, which is in no way impacted by the choice of initial strategy.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluating State Machines</title>
      <p>One way to find the optimal state machine transition function Δ for a given  is to enumerate
all possible transition functions, score each transition function, and select the best one. To do
so, we define the state machine scoring function  (Δ). The output of this scoring function
is the steady-state rate of catastrophic faults. Specifically,  (Δ) is the weighted probability
of a catastrophic fault occurring during any iteration of the task set when applying strategy
switching according to the transition function Δ.</p>
      <p>To evaluate  (Δ), we compute the steady-state probability distribution of the transition
function Δ (e.g. using linear algebra). The steady-state probability distribution provides a
probability  (|Δ) of finding the state machine in strategy  at an arbitrarily chosen iteration
of the task set given Δ. Intuitively, as the number of iterations of the task set approaches infinity,
 (|Δ) is the fraction of iterations spent with strategy  active. The sum of these fractions
over all strategies is 1, i.e ∑︀∈  (|Δ) = 1.</p>
      <p>Let  (, Δ) be the probability of a catastrophic fault occurring in strategy . Together with
 (|Δ), we can now compute  (Δ):</p>
      <p>(Δ) = ∑︁  (|Δ) ·  (, Δ)</p>
      <p>Catastrophic faults, by their definition, only occur when an unmitigated fault occurs in two
consecutive iterations of the task set. To compute  (, Δ), we must not just consider , but also
the successor of . The successor of  depends on what result  is being chosen, which is given
by  (|,Δ). Each  has only one successor strategy Δ(). Let Δ() = to . Thus,  (,Δ) can
be expressed in terms of  and Δ() = to as follows:
 (, Δ) =</p>
      <p>∑︁  (|) ·  (, )</p>
      <p>As to is a strategy,  (, to ) is independent of our choice of Δ.  (, to ) is equal to the
probability that any task experiences a fault both in  as well as in to :
∈
∈Δ()
 (, to ) = 1 −
= 1 −
∏︁ 1 −  (catastrophic fault in  |,to )
 ∈Γ
∏︁ 1 −  (fault in  |) ·  (fault in  |to )</p>
      <p>The probabilities  (fault in  |) and  (fault in  |to ) are computed based on the used
fault-tolerance technique, if   ∈ Γto , the rate of faults combined with the WCET of  , and
information available in the result. For example, a task without fault-tolerance may have a
chance of experiencing a fault with  (fault in  ) = 1 − −  ·  , where  is the WCET of  .
Likewise, a task which was run under fault-tolerance according to the previous strategy, and
has succeeded according to result , will have a  (fault in  |) = 0. We consider all SEUs to be
statistically independent events. To keep independence, use the WCET instead of the average
execution time, as the execution times of tasks in the same task set may not be independent.
As such, our steady-state rate of catastrophic faults  (Δ) provides an upper bound to the true
steady-state fault rate of the application.</p>
      <p>Tractability The approach, as presented here, can easily become intractable for even small
task sets. Many state space reduction techniques can be realized in the way the state machines
are enumerated, which can considerably lower the number of elements. If a candidate state
machine Δ contains disconnected sub-graphs, then no unique steady state can be computed,
and as such this candidate can be pruned. Furthermore, the candidate state machine may
contain strategies that are ignored, i.e. there is no path to that strategy from that same strategy.
When this is the case, the steady-state probability  (|Δ) is always 0. For such an ignored
strategy, the successor strategies of its results do not matter, and its many ways of connecting
to successor strategies need not be individually examined. Finally, the domain itself my allow
for significant state space reduction. For example, if a precedence relation is added between
tasks in the task set over which data is communicated, the failure of a preceding task may imply
that the succeeding task is destined to fail. These modifications impact the lattice of strategies,
and reduce the size of the schedulable and non-redundant set of strategies .</p>
      <p>For completeness, we discuss the algorithmic complexity of the (naïve) state machine
construction algorithm as presented in this paper. While the exact size of the state space depends on a
number of factors, such as the number of available strategies , the upper bound is considerable.</p>
      <p>The time complexity of  (Δ) is (|ℛ| · ss(||)), where:
• |ℛ| is the number of results
• || is the number of strategies
• ss() provides an upper bound to the computation of the steady state for  strategies
To compute ss(), the steady-state matrix needs to be computed. For this computation,
the LAPACK driver routine DSEGD1 may be used, with a complexity of (3), resulting in
ss() = 3.</p>
      <p>The number of strategies is up to all combinations of tasks (|| ∈ (|Γ|!)). At the same time,
 (Δ) is computed for every possible state machine. As each result can link to any successor
strategy, this yields up to |||| state machines. Each strategy has ≤ 2|Γ| results, 2|Γ| ∈ (2|Γ|)
and thus || ∈ (|Γ|!· 2|Γ|). Let  = |Γ|, i.e.  is the number of tasks. Then, the final algorithmic
time complexity is given:
(|||| · |ℛ| · ss(||)) =
(|Γ|!|Γ|!· 2|Γ|| · | Γ|! · 2|Γ| · | Γ|!3) =
(!!· 2 · ! · 2 · !3) ⊂
(!!· 2 · 25) ≈
(!!· 2 )</p>
      <p>It must be noted that this is by no means a tight upper bound. In the next section, we will
examine a task set with |Γ| =  = 3. This example requires examination of only 64 candidate
1http://www.netlib.org/lapack/lug/node71.html
state machines, even though such an  value would appear to be completely intractable as per
the above formulation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Example</title>
      <p>Let us consider an example task set Γ = { ,  ,   }, with WCET values  = 20,  = 10
and  = 10. We set  = 10− 3 for this example. For brevity, we abbreviate the probability of
no fault in task   when not using fault-tolerance as :

 (no fault in  |no FT) = −  ·  = − 103
=</p>
      <p>
        As fault-tolerance technique, we use Triple Modular Redundancy (TMR). We require a majority
for TMR to succeed, and assume there is no other way TMR can fail (e.g. assume no unmitigated
faults during voting). A majority for TMR requires two or more copies of the task to be in
agreement. Like , we abbreviate the probability of no fault in task   when using fault-tolerance
as :
 (no fault in  |TMR) = 3 +
︂(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )︂
      </p>
      <p>2 2 · (1 − )
=</p>
      <p>With the WCET for each task available, we can then compute the  and  values for all
tasks:
  :  = − 20 = 0.9802  = 3 + 3(1 − )2 = 0.9988
  :  = − 10 = 0.9901  = 3 + 3(1 − )2 = 0.9997
  :  = − 10 = 0.9901  = 3 + 3(1 −  )2 = 0.9997
Enumerating strategies For our chosen task set Γ, there are eight possible subsets and as
such eight possible strategies:
∅ : Γ∅ = {}
 : Γ = { }
 : Γ = { }
 : Γ = {  }
, : Γ, = { ,  }
, : Γ, = { ,   }
, : Γ, = { ,   }
,, : Γ,, = { ,  ,   }
Γ ∖ Γ∅ = { ,  ,   }
Γ ∖ Γ = { ,   }
Γ ∖ Γ = { ,   }
Γ ∖ Γ = { ,  }
Γ ∖ Γ, = {  }
Γ ∖ Γ, = { }
Γ ∖ Γ, = { }
Γ ∖ Γ,, = {}</p>
      <p>Not all of these may be schedulable. If ,, was schedulable there would be no reason to
use strategy switching. Likewise, if only ∅ was schedulable, there is no way to deploy limited
Unschedulable
Schedulable
Redundant
fault-tolerance to this task set. For the example, we assume that only strategies  , , , 
∅
and , are determined to be schedulable by a scheduler. The strategies form a lattice with
 as the greatest lower bound, and ,, as the least upper bound. The lattice is shown in
, is schedulable. As such, there is no need to consider them. Let  = {, , }.</p>
      <p>TMR can fail to reach a consensus, and then this information is available to the strategy
selection component in the form of a result. As such, in this example each strategy  has 2|Γ|
possible results. For  this is  and , and for , this is , , , , , and , . These
results are shown in Figure 3a, which shows the strategy state machine without successor
relations for each result.</p>
      <sec id="sec-5-1">
        <title>Evaluating transitions functions</title>
        <p>There are |||ℛ| = 26 = 64 possible transition functions
Δ, as all of the six results needs to be linked to one of the two strategies. We will not enumerate
all of them, but two examples are shown in Figure 3b and Figure 3c.</p>
        <p>In this example, we will show how to evaluate  (Δ) for the state machine in Figure 3b. Let
this be Δ3b.</p>
        <p>1. Computing the steady-state probability of Δ3b. The result states are “urgent” states, in
which no time can pass. In other words, the result states themselves are not relevant for
the steady-state (i.e. the fraction of iterations of the task set spent in state  ∈ ℛ is 0).
As such, we remove these states for the steady-state computation, linking each strategy
to multiple successor strategies. The resulting state machine is a Discrete-Time Markov
Chain, where each time step is an iteration of the task set Γ. The steady-state can be
computed using linear algebra. Let  be the transition matrix for Δ3b with all results
removed:
 =
︂[</p>
        <p>(|)
 (, ∪ , ∪ , ∪ , |, )
 (|)︂]
(b) Switching state machine choosing between  and , based on the result</p>
        <p>(c) Degenerate state machine always choosing 
We can compute the steady-state vector  by solving  =  .</p>
        <p>=  ≈
Note that a unique steady-state vector  need not exist if there are two or more parts of
the state machine that are disconnected. For example, all results of  may link back to ,
while all results of , link back to , . In such a case, the steady-state is dependent
on the initial state. However, we can safely disregard state machines that depend on the
initial condition, as other state machines with identical steady-state behavior must exist
as well. If staying in  provides the lowest  (Δ) value, then a state machine like shown
in Figure 3c would yield the exact same  (Δ) value as a state machine where  and ,
is disconnected with  as the initial state.
2. Compute, for each result → strategy transition, the  (,to) catastrophic fault probability.
3. Using the calculated  (,to) values, compute a  ( ∈ ) value for each strategy.
 () =(|) ·  (,,)+
(|) ·  (,) =
 (,) =(,|,) ·  (,,)+
(,|,) ·  (,,,)+
(,|,) ·  (,,,)+
(,|,) ·  (,,,) =
29.69985 · 10− 6
 () is the probability that a catastrophic fault occurs by selecting strategy . The bad
score of , is not surprising: the Δ3b state machine is not particularly clever as it
chooses to switch to strategy  even with the knowledge that   or   has failed.
,</p>
        <p>...









n/a
ΔΔΔΔΔΔΔΔΔΔ666665555∅324108796      335321213n..9999299/56a.......2202094413221253801660645965254
1 Example from Figure 3b (Δ2 = Δ3b)
2 Best (lowest) steady-state fault rate
3 No unique steady-state exists for this transition function
4  (Δ) of the task set without any form of fault-tolerance
5 Does not strategy switch, i.e. always stays in the same strategy
4. Finally, compute  (Δ3b) by taking the weighted average of all  () values by multiplying
 () for each strategy by the steady-state probability  (|Δ3b) of being in that strategy.
 (Δ3b) =  () ·  () +  (, ) ·  (, )</p>
        <p>This process is repeated for all 64 possible transition functions. For brevity, we will not show
the evaluation of every single transition function here. Instead, Table 2 shows a subset of all
possible Δ values. The columns headed with a result show to which successor strategy that
result maps. For example, for the first evaluated state machine Δ1() = . The best state
machine is also revealed, listed as Δ58. Finally, the last item in the table is Δ∅, added as a
reference. Δ∅ is the state machine with a single strategy ∅ which protects no tasks, i.e. the
behavior obtained when not using any form of fault-tolerance. The best state machine is a
38.3-fold improvement over this default. That is, the best state machine ofers a 38.3 times
lower rate of catastrophic failure when compared to Δ∅. Finally, the table also shows two state
machines which perform no strategy switching, but still use fault-tolerance. These are Δ1 and
Δ64, staying in  and , respectively. The strategy switching solution Δ58 ofers a 12.9
times lower rate of catastrophic faults when compared to the best of these static solutions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Extending the fault and task model</title>
      <sec id="sec-6-1">
        <title>6.1. Adding precedence relations</title>
        <p>Directed Acyclic Graph scheduling is an extension to our task model where precedence relations
exist between tasks. While this has no impact on the strategy switching algorithm directly as it
need not concern itself with scheduling, it may add dependence to the success probability of a
task. For this extension, we assume that when a predecessor task fails (produces incorrect data),
all successor tasks fail as well due to operating on incorrect data, even when not experiencing a
SEU.</p>
        <p>To support this assumption, the  (, to) cost function needs to be modified to consider
precedence relations. Specifically,  (fault in  |) and  (fault in  |to) must be replaced to not
just consider if   failed in isolation. Let us call this precedence-aware probability the probability
of incorrect output.</p>
        <p>(incorrect output  |) =  (fault in  |)·
∏︁</p>
        <p>(fault in   |)
 ∈pred( )</p>
        <p>Here,  ∈  ∪ ℛ is either a strategy or a result. pred ( ) is the set of all direct and indirect
predecessors of task  . This probability also includes the behavior of any predecessor tasks.</p>
        <p>Figure 4 shows an example task set with a precedence relation. Consider the task set in
this example with two strategies:  = {, }, protecting task   and   respectively. Each
strategy has two outcomes, hence ℛ = {, , , }. Let us consider  (fault in  |):
the result  does not directly communicate anything about the state of task , however due to
the precedence relation we know it efectively failed. As such,  (incorrect output  |) = 1.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Selective fault-tolerance and criticality</title>
        <p>The task model can be extended to support heterogeneity in the ability of tasks to tolerate
non-consecutive faults. Let Γ be the set of tasks which cannot tolerate non-consecutive
faults. Then, we update  (catastrophic fault in  |,to ) to consider non-consecutive faults as
catastrophic faults when   ∈ Γ .</p>
        <p>(catastrophic fault in  |,to =)</p>
        <p>(fault in  |)</p>
        <p>The goal of the strategy state machine is to minimize the rate of catastrophic faults. This
may mean that the rate of catastrophic faults is so low that even a single fault is unlikely to
occur across the lifetime of the system. However, when used in soft real-time deployments,
a “catastrophic fault” may not be catastrophic while still undesirable. In such a deployment,
catastrophic faults may be acceptable. The impact of a task experiencing such a catastrophic
fault may not be the same across all tasks. As such, a priority function (  ∈ Γ) can be
integrated into  (, to ).</p>
        <p>(, to ) =
1 −
∏︁ (1 −  (catastrophic fault in  |,to ))( )</p>
        <p>The ( ) function assigns relative priority, where higher is better. A task   with ( ) = 2
has twice the importance as a task   with (  ) = 1. Note that this change afects the meaning of
the output  (Δ), which is no longer the rate of catastrophic faults. If the impact of a catastrophic
fault in a task with ( ) = 2 is equivalent to two catastrophic faults in another task with
(  ) = 1, then the final rate  (Δ) can be seen as the rate of catastrophic faults normalized to
faults in   .</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Related work</title>
      <p>(, ) constraints have been used before in the domain of real-time scheduling. Choi et al. [7]
proposed a scheduler, together with an eficient schedulability algorithm for a sporadic task
set with tasks under (, ) constraints. This scheduler allows for scheduling task sets that
would normally not be schedulable, but utilizing their (, ) constraints allows them to be
scheduled.</p>
      <p>Chen et al. [8] proposed a solution that similar to ours. Their method ofers fault-tolerance
with the goal of reducing the efective fault rate as well as lowering energy consumption.
Chen et al. proposes a static scheduling technique called Static Pattern-Based Reliable Execution,
ensuring each (, ) constraint is respected in the presence of transient faults. Furthermore,
they propose delaying the execution of their static pattern if no fault is detected at runtime,
opportunistically running more unprotected instances of the task with the goal of saving energy.
However, if the static pattern is found to be unschedulable as per their schedulability test,
their implementation is unable to provide a schedule that minimizes the fault rate for a given
resource-constrained real-time system. While their approach ofers more flexibility in the
task model (specifically the support for (, ) constraints with  &gt; 2), it does not consider
that fault mitigation may fail. Our approach optimally lowers the fault rate, regardless of the
hardware constrains. Furthermore, our approach recognizes that fault mitigation may fail, and
includes this in the calculation for lowering the fault rate.</p>
      <p>Gujarati et al. [9] contributed a technique for measuring the fault rate of an application
with tasks under (, ) constraints. Their technique provides an upper bound for the fault
probability per iteration of a Fault-tolerant Single-Input Single-Output (FT-SISO) control loop,
similar to our  (Δ) output on a task set with precedence relations. Their technique hopes
to provide transparency to system designers, allowing analyzing the impact on the reliablity
when changing the hardware or software. However, while their approach is aware of (, )
constraints, it does not provide schedules that utilize such constraints. Instead, it merely includes
them in the reliability calculation.</p>
      <p>The domain of strategy switching shares some aspects with Mixed-Criticality (MC) systems.
In an MC system, the system switches between diferent levels of criticality depending on the
operating conditions of the system. Tasks are assigned a criticality level, and when the system
criticality is higher than that of the task, the task is not scheduled to guarantee the successful
and timely execution of tasks with a higher criticality level. Pathan [10] combines MC with
fault-tolerance against transient faults. As is typical in MC research, as the level of criticality
increases, the pessimism increases. Pathan increases the maximum fault rate when switching
to a higher level of criticality. In our approach we do not vary the pessimism of any parameter.
Instead, we assume the  parameter provides a suitable upper bound to the fault rate in all
conditions. Our approach ofers some aspects typically not found in MC systems: while one
could argue that each strategy is really a criticality level, it is a criticality level applied to a
subset of the tasks (specifically Γ). Finally, the approach by Pathan requires bounding the
number of faults that can occur in any window. As such, passing their suficient schedulability
test will (under their fault model) guarantee the system will never experience a fault.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In this paper, we have shown how strategy switching can be used to improve fault-tolerance
for resource-constrained systems. Our method makes efective use of the ability to vary which
tasks receive fault-tolerance. It considers at the start of every iteration of the task set what the
best set of tasks is to protect. We have shown how our method computes the optimal strategy
state machine for any given task set, minimizing the rate of catastrophic faults. We have also
shown the flexibility of our method to be extended to support new task and fault models.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Future work</title>
      <p>In future work, we hope to improve the tractability of our algorithm by both state space reduction
algorithms as well as by using heuristics.</p>
      <p>Furthermore, we hope to extend the fault model to distinguish between deadline misses
and incorrect results. We also hope to integrate tasks with (, ) constraints with  &gt; 2.
Additionally, we hope to integrate the natural ability of tasks to detect faults into our task model,
as a SEU may for example lead to a segfault. Finally, we hope to validate our approach using
simulation-based analysis.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>The presentation of this paper was considerably improved in response to comments provided
by the anonymous reviewers, and we gratefully acknowledge their insights and assistance.</p>
      <p>This project has received funding from the European Union’s Horizon 2020 research and
innovation program under grant agreement No. 871259 (ADMORPH project). Additionally, this
work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Oz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <article-title>A survey on multithreading alternatives for soft error fault tolerance</article-title>
          ,
          <source>ACM Computing Surveys</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Lyons</surname>
          </string-name>
          , W. Vanderkulk,
          <article-title>The use of triple-modular redundancy to improve computer reliability</article-title>
          ,
          <source>IBM journal of research and development 6</source>
          (
          <year>1962</year>
          )
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. I. August</surname>
          </string-name>
          ,
          <article-title>Automatic instruction-level software-only recovery</article-title>
          ,
          <source>in: International Conference on Dependable Systems and Networks (DSN'06)</source>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Asghari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Binesh</given-names>
            <surname>Marvasti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rahmani</surname>
          </string-name>
          ,
          <article-title>Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          <volume>87</volume>
          (
          <year>2018</year>
          )
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.future.
          <year>2018</year>
          .
          <volume>04</volume>
          .049.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liamosi</surname>
          </string-name>
          ,
          <article-title>Weakly hard real-time systems</article-title>
          ,
          <source>IEEE transactions on Computers</source>
          <volume>50</volume>
          (
          <year>2001</year>
          )
          <fpage>308</fpage>
          -
          <lpage>321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Broster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rodriguez-Navas</surname>
          </string-name>
          ,
          <article-title>Timing analysis of real-time communication under electromagnetic interference</article-title>
          ,
          <source>Real-Time Systems</source>
          <volume>30</volume>
          (
          <year>2005</year>
          )
          <fpage>55</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Job-class-level fixed priority scheduling of weakly-hard real-time systems</article-title>
          ,
          <source>in: 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>253</lpage>
          . doi:
          <volume>10</volume>
          .1109/RTAS.
          <year>2019</year>
          .
          <volume>00028</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>K.-H. Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bönninghof</surname>
            ,
            <given-names>J.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Marwedel</surname>
          </string-name>
          ,
          <article-title>Compensate or ignore? Meeting control robustness requirements through adaptive soft-error handling</article-title>
          ,
          <source>in: Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers</source>
          , Tools, and
          <article-title>Theory for Embedded Systems</article-title>
          , LCTES 2016,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2016</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>91</lpage>
          . doi:
          <volume>10</volume>
          .1145/2907950.2907952.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gujarati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nasri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Brandenburg</surname>
          </string-name>
          ,
          <article-title>Quantifying the resiliency of fail-operational realtime networked control systems</article-title>
          , in: S. Altmeyer (Ed.),
          <source>30th Euromicro Conference on RealTime Systems (ECRTS</source>
          <year>2018</year>
          ), volume
          <volume>106</volume>
          of Leibniz International Proceedings in Informatics (LIPIcs),
          <source>Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik</source>
          , Dagstuhl, Germany,
          <year>2018</year>
          , pp.
          <volume>16</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          :
          <fpage>24</fpage>
          . doi:
          <volume>10</volume>
          .4230/LIPIcs.ECRTS.
          <year>2018</year>
          .
          <volume>16</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>R. M. Pathan</surname>
          </string-name>
          ,
          <article-title>Fault-tolerant and real-time scheduling for mixed-criticality systems</article-title>
          ,
          <source>RealTime Systems</source>
          <volume>50</volume>
          (
          <year>2014</year>
          )
          <fpage>509</fpage>
          -
          <lpage>547</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>