=Paper=
{{Paper
|id=Vol-3145/short05
|storemode=property
|title=Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications
|pdfUrl=https://ceur-ws.org/Vol-3145/paper05.pdf
|volume=Vol-3145
|authors=Lukas Miedema,Benjamin Rouxel,Clemens Grelck
|dblpUrl=https://dblp.org/rec/conf/cerciras/MiedemaRG21
}}
==Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications==
Strategy Switching: Smart Fault-tolerance for
Resource-constrained Real-time Applications
Lukas Miedema1 , Benjamin Rouxel1,2 and Clemens Grelck1
1
University of Amsterdam (UvA), Amsterdam, Netherlands
2
University of Modena and Reggio Emilia (Unimore), Modena, Italy
Abstract
Software-based fault-tolerance is an attractive alternative to hardware-based fault-tolerance, as it allows
for the use of cheap Commercial Off The Shelf hardware. However, software-based fault-tolerance
comes at a cost, requiring computing the same results multiple times to allow for the detection and
mitigation of faults. Resource-constrained real-time applications may not be able to afford this cost. At
the same time, the domain of a real-time task may allow it to tolerate a fault, provided it does not occur
in consecutive iterations of the task. In this paper, we introduce a new way to deploy fault-tolerance
called strategy switching. Our method targets Single Event Upsets by running different subsets of tasks
under fault-tolerance at different points in time. We do not bound the number of faults in a window,
nor does our method assume that tasks under fault-tolerance cannot still fail. Our technique does not
require a minimal amount of additional compute resources for fault-tolerance. Instead, our method
optimally utilizes any available compute resources for fault-tolerance for resource-constrained real-time
applications.
Keywords
Cyber-physical Systems, Resource Constraints, Fault-tolerance, Single Event Upsets, Weakly Hard Real-
time
1. Introduction
As transistor density increases and gate voltages decreases, the frequency of transient faults or
single event upsets (SEUs) increases [1]. Hence, the need for fault-tolerance against these types
of faults is growing.
Fault-tolerance techniques can either be implemented in hardware or in software. Software-
based fault-tolerance is attractive due to its ability to protect workloads on Commercial Off
The Shelf (COTS) hardware. However, providing general-purpose fault-tolerance against SEUs
typically requires redundant execution, often in the form of Triple Modular Redundancy [2]
(TMR). TMR uses two-out-of-three voting to obtain a majority and mitigate the effects of a
SEU. TMR can be implemented at different levels of granularity, e.g. at the compiler level like
SWIFT-R [3], but also at the OS task level as implemented in OS Task Level Redundancy [4].
However, the overhead remains: instrumenting a binary with SWIFT-R increases its execution
time by 99 percent. As such, constrained real-time systems may have insufficient processing
CERCIRAS WS01: 1st Workshop on Connecting Education and Research Communities for an Innovative Resource
Aware Society
l.miedema@uva.nl (L. Miedema); benjamin.rouxel@unimore.it (B. Rouxel); c.grelck@uva.nl (C. Grelck)
ยฉ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org)
resources to allow all tasks to run with fault-tolerance. However, for applications structured as
a set of periodic tasks, software-based fault-tolerance allows the application of fault-tolerance
to only a subset of the task set.
Control tasks may be able to tolerate non-consecutive deadline misses, which has led to the
adoption of the weakly hard model [5]. A task that is unable to provide a result may not result in
catastrophe, provided that in the next period it can provide a result. Per the weakly hard model,
each task ๐ has an (๐๐ , ๐๐ ) constraint, indicating that the task must complete at least ๐๐ times
successfully out of every ๐๐ times. ๐๐ is said to be the window size. We use this (๐๐ , ๐๐ ) constraint
with ๐๐ < ๐๐ to deliver more effective fault-tolerance to resource-constrained systems. For
example, consider a task set with just two tasks A and B, where only one task can be run under
fault-tolerance at a time. Furthermore, both task A and B can tolerate non-consecutive deadline
misses. Running task A under fault-tolerance and task B without would leave task B vulnerable
to SEUs. However, we could more optimally make use of the scarce fault-tolerance by switching
between protecting task A and task B in successive iterations of the task set. Tasks under
fault-tolerance may still fail (e.g. TMR reaches no majority), and these cases can be detected.
When the fault-tolerance technique has failed to protect task A, task A should be protected
again in the next iteration of the task set to ensure it does not experience a consecutive fault.
Contribution We propose a new approach for improving fault-tolerance for real-time ap-
plications running on resource-constrained systems by strategy switching. We minimize the
effective unmitigated fault-rate by selecting which tasks are to be run under fault-tolerance.
Our approach recognizes that the importance of protecting a task may change over time due
to earlier faults or lack thereof, and as such runs different tasks under fault-tolerance at dif-
ferent times. By exhaustively searching all patterns in which fault-tolerance can be applied,
our method optimally utilizes limited available computational resources for fault-tolerance in
resource-constrained real-time applications.
Organization In section 2 we introduce our task and fault model. Our method uses a state
machine, which is introduced in section 3. In section 4, we formalize the construction of the state
machine. Then, in section 5, we discuss how our method can lower the consecutive fault rate
for an example task set. We explain the flexibility of our method by extending it to alternative
task and fault models in section 6. Several pieces of related work exist, which are covered in
section 7. The paper is concluded in section 8, and finally in section 9 we discuss various future
directions for this technique.
2. System models
Task model We assume a set of periodic tasks ฮ = {๐1 ...๐๐ } with a single, global period
and deadline ๐ท such that the period is equal to or larger than the deadline (no pipelining).
Furthermore, we initially assume that each task can afford one (non-consecutive) deadline miss
(an (๐๐ , ๐๐ ) constraint of (1,2)), and that each task is equally important. In section 6, we will
discuss how some of these assumptions can be relaxed to support a wide variety of task models.
Fault model We use the Poisson distribution as an approximation for the worst-case fault
rate of SEUs, which was argued to be a good approximation by Broster et al. [6]. We do not
assume universal fault detection: only when the task runs under a fault-tolerance scheme can a
fault be detected and mitigated. When a task does not run with fault-tolerance, it is not known
whether or not it succeeded. We use the term catastrophic fault to describe an unmitigated
fault occurring in two consecutive iterations of a task that can tolerate a single unmitigated
fault, i.e. the task ๐ has an (๐๐ , ๐๐ ) = (1,2) constraint. Furthermore, a catastrophic fault also
occurs when a task that cannot tolerate non-consecutive faults experiences an unmitigated
fault, i.e. the task has a (1,1) constraint. We do not consider constraints beyond ๐๐ = 2 in this
paper.
Fault mitigation We assume the presence and implementation of a particular fault-tolerance
scheme, and that any task can be run under that scheme. In this paper, we assume that
SEUs always go undetected in tasks not under fault-tolerance. Finally, we assume the scheme
implements fault-detection and fault-mitigation. Our model allows the fault mitigation to fail
(e.g. due to successive SEUs during both replicas of a task under TMR), but assumes that it is
known when fault mitigation fails.
Other definitions Given the complexity and number of symbols used in this paper, a table
of all symbols and terms has been compiled in Table 1. Each symbol or term used will be defined
prior to use, as well as being listed in the table.
3. Strategy Switching State Machine
To both swiftly select a new subset of the task set to run under fault-tolerance while also making
optimal decisions, we precompute for each situation the next best subset of tasks to run under
fault-tolerance. The result of this is the strategy state machine, which is made available to the
online component. The strategy state machine is a bipartite state machine, consisting of strategy
states and result states. An example of such a state machine is shown in Figure 1.
The architecture of our strategy switching approach distinguishes between an online part
at runtime, as well as an offline part executing ahead-of-time not beholden to any real-time
constraints. The offline component prepares the state machine, which is then available for
online playback.
Online We introduce a strategy switching component, which plays back the strategy state
machine, taking transitions based on observed faults as the application runs. At runtime, this
component selects a single strategy ๐ ahead of every execution of the task set, which becomes
active. The strategy ๐ dictates which tasks run under fault-tolerance (ฮ๐ ), and which ones do not
(ฮ โ ฮ๐ ). Fault-tolerance techniques are typically not a silver bullet solution, and unmitigated
faults may still occur in tasks in ฮ๐ . Furthermore, fault-tolerance techniques can often report
the fact that they failed to mitigate a fault (e.g. no consensus in triple modular redundancy).
After executing all tasks, the online component uses this information from the execution of the
task set to select the matching result ๐ from the state machine. This result reflects the success
Table 1
Definitions of used symbols and terms
Item Meaning
Task model
ฮ Set of all tasks, ฮ = {๐1 ...๐๐ }
๐๐ โ ฮ Task ๐ โ ฮ, e.g. ๐๐ด is task A
๐ถ๐ Worst Case Execution Time (WCET) of task i
๐ท Global deadline (shared by all tasks)
Fault model
๐ Fault rate (Poisson)
(๐๐ , ๐๐ ) Constraint indicating task ๐ has to execute successfully for at least ๐๐ iterations out of every ๐๐
iterations
Unmitigated fault Fault in a task not mitigated by a fault-tolerance technique
Catastrophic fault Unmitigated fault that leads to the (๐๐ , ๐๐ ) constraint of the task being violated
States in the state machine
๐ฎ Set of all strategies
๐ โ๐ฎ A strategy
ฮ๐ โ ฮ The tasks protected under strategy ๐
๐ ๐ด,๐ต A strategy protecting task A and B, i.e. ฮ๐ ๐ด,๐ต = {๐๐ด , ๐๐ต }
โ Set of all results
๐โโ A result
๐๐ด,๐ต A result where task A (๐๐ด ) succeeded and task B (๐๐ต ) failed
Transitions in the state machine
ฮ The transition function for the strategy state machine
ฮ(๐ ) The set of successors of strategy ๐ as per the transition function ฮ. Due to the bipartite nature of
the state machine, this is always a set of results.
ฮ(๐) The successor of result ๐ as per transition function ฮ. Always a single element, and due to the
bipartite nature of ฮ it is always a strategy.
Scoring function
๐ฟ(...) Scoring function (lower is better), provides steady-state catastrophic fault rate, i.e. the average prob-
ability of a catastrophic fault for any iteration of the task set
๐ฟ(ฮ) Scoring function applied to the entire state machine
๐ฟ(๐ , ฮ) Probability of a catastrophic fault when leaving strategy ๐ considering transition function ฮ
๐ฟ(๐, ฮ) Probability of a catastrophic fault when leaving result ๐ considering transition function ฮ
๐ฟ(๐,๐ ) Probability of a catastrophic fault when transitioning from ๐ to ๐
Probabilities
๐๐ Probability of an unmitigated fault in task ๐๐ when no fault-tolerance is used
๐๐ Probability of an unmitigated fault in task ๐๐ when fault-tolerance is used
๐๐ด ๐๐ต
๐ด ๐ต
๐ ๐ด ๐ ๐ต
๐ด ๐ต
๐๐ด ๐๐ต
Figure 1: A strategy state machine for ฮ = {๐๐ด ,๐๐ต }
or fail state, or probability thereof, of each of the tasks. Each possible result ๐ directly maps to
its best successor strategy, which is applied to the next iteration of the task set.
Offline The full set of strategies ๐ โ ๐ฎ is computed ahead of time, as well as the transition
relation ฮ from any given result ๐ โ โ to the next best successor strategy ฮ(๐) = ๐ . Strategies
which are not schedulable are not included in ๐ฎ. Furthermore, strategies which are dominated
by other strategies (i.e. all tasks that are protected by one strategy are also protected by another)
are also not considered in ๐ฎ.
We develop an algorithm to compute ฮ, such that our choice of ฮ provides the lowest
steady-state rate of catastrophic faults.
Example State Machine An example of such a state machine is shown in Figure 1, where
there are two strategies ๐ฎ = {๐ ๐ด , ๐ ๐ต }. This state machine is not the optimal state machine for
this task set, but gives a non-trivial example of what such a state machine may look like. Such
a state machine is constructed by the offline component, and made available to the runtime.
Each strategy protects only one task (either task ๐ด or task ๐ต). Finally, fault-tolerance may fail
and if it fails this information is available to the strategy switching component. With these
assumptions, each strategy has two potential successors: one where the task under protection
succeeds and one where it fails (e.g. ๐๐ด and ๐๐ด for strategy ๐ ๐ด ). Each result state ๐ has just one
successor strategy ฮ(๐), e.g. ฮ(๐๐ด ) = ๐ ๐ต in Figure 1. The following sequence of actions may
take place at runtime, as per Figure 1;
1. To start, the runtime initializes by picking a random strategy, say ๐ ๐ต , and applies fault
tolerance accordingly. By picking ๐ ๐ต , task ๐๐ต will be executed with fault-tolerance, while
task ๐๐ด will not.
2. The task set is executed (iteration 1).
3. At the end of the iteration, data from the fault-tolerance applied to task ๐๐ต is used to
select a result. If ๐๐ต fails, the result ๐๐ต is selected. However, let us assume ๐๐ต succeeded,
and select result ๐๐ด accordingly. No information about the success or failure of ๐๐ด is
known.
4. ๐๐ต links to strategy ๐ ๐ด , which is selected.
5. By switching to ๐ ๐ด , task ๐๐ด will be executed with fault-tolerance, while task ๐๐ต will not.
6. The task set is executed (iteration 2).
7. At the end of the iteration, data from the fault-tolerance of task ๐๐ด indicates that task ๐๐ด
has failed, letting us select result ๐๐ด .
8. ๐๐ด links to strategy ๐ ๐ด , which is selected.
*. The process continues...
Note that there is no involved selection procedure for the initial strategy (strategy ๐ ๐ต in the
example). Our approach is only concerned with obtaining the lowest steady-state fault rate of
the application, which is in no way impacted by the choice of initial strategy.
4. Evaluating State Machines
One way to find the optimal state machine transition function ฮ for a given ๐ฎ is to enumerate
all possible transition functions, score each transition function, and select the best one. To do
so, we define the state machine scoring function ๐ฟ(ฮ). The output of this scoring function
is the steady-state rate of catastrophic faults. Specifically, ๐ฟ(ฮ) is the weighted probability
of a catastrophic fault occurring during any iteration of the task set when applying strategy
switching according to the transition function ฮ.
To evaluate ๐ฟ(ฮ), we compute the steady-state probability distribution of the transition
function ฮ (e.g. using linear algebra). The steady-state probability distribution provides a
probability ๐ (๐ |ฮ) of finding the state machine in strategy ๐ at an arbitrarily chosen iteration
of the task set given ฮ. Intuitively, as the number of iterations of the task set approaches infinity,
๐ (๐ |ฮ) is the fraction of โ๏ธiterations spent with strategy ๐ active. The sum of these fractions
over all strategies is 1, i.e ๐ โ๐ฎ ๐ (๐ |ฮ) = 1.
Let ๐ฟ(๐ , ฮ) be the probability of a catastrophic fault occurring in strategy ๐ . Together with
๐ (๐ |ฮ), we can now compute ๐ฟ(ฮ):
โ๏ธ
๐ฟ(ฮ) = ๐ (๐ |ฮ) ยท ๐ฟ(๐ , ฮ)
๐ โ๐ฎ
Catastrophic faults, by their definition, only occur when an unmitigated fault occurs in two
consecutive iterations of the task set. To compute ๐ฟ(๐ , ฮ), we must not just consider ๐ , but also
the successor of ๐ . The successor of ๐ depends on what result ๐ is being chosen, which is given
by ๐ (๐|๐ ,ฮ). Each ๐ has only one successor strategy ฮ(๐). Let ฮ(๐) = ๐ to . Thus, ๐ฟ(๐ ,ฮ) can
be expressed in terms of ๐ and ฮ(๐) = ๐ to as follows:
โ๏ธ
๐ฟ(๐ , ฮ) = ๐ (๐|๐ ) ยท ๐ฟ(๐, ๐ ๐ก ๐)
๐โฮ(๐ )
As ๐ to is a strategy, ๐ฟ(๐, ๐ to ) is independent of our choice of ฮ. ๐ฟ(๐, ๐ to ) is equal to the
probability that any task experiences a fault both in ๐ as well as in ๐ to :
โ๏ธ
๐ฟ(๐, ๐ to ) = 1 โ 1 โ ๐ (catastrophic fault in ๐๐ |๐,๐ to )
๐๐ โฮ
โ๏ธ
=1โ 1 โ ๐ (fault in ๐๐ |๐) ยท ๐ (fault in ๐๐ |๐ to )
๐๐ โฮ
The probabilities ๐ (fault in ๐๐ |๐) and ๐ (fault in ๐๐ |๐ to ) are computed based on the used
fault-tolerance technique, if ๐๐ โ ฮ๐ to , the rate of faults combined with the WCET of ๐๐ , and
information available in the result. For example, a task without fault-tolerance may have a
chance of experiencing a fault with ๐ (fault in ๐๐ ) = 1 โ ๐โ๐ยท๐ถ๐ , where ๐ถ๐ is the WCET of ๐๐ .
Likewise, a task which was run under fault-tolerance according to the previous strategy, and
has succeeded according to result ๐, will have a ๐ (fault in ๐๐ |๐) = 0. We consider all SEUs to be
statistically independent events. To keep independence, use the WCET instead of the average
execution time, as the execution times of tasks in the same task set may not be independent.
As such, our steady-state rate of catastrophic faults ๐ฟ(ฮ) provides an upper bound to the true
steady-state fault rate of the application.
Tractability The approach, as presented here, can easily become intractable for even small
task sets. Many state space reduction techniques can be realized in the way the state machines
are enumerated, which can considerably lower the number of elements. If a candidate state
machine ฮ contains disconnected sub-graphs, then no unique steady state can be computed,
and as such this candidate can be pruned. Furthermore, the candidate state machine may
contain strategies that are ignored, i.e. there is no path to that strategy from that same strategy.
When this is the case, the steady-state probability ๐ (๐ |ฮ) is always 0. For such an ignored
strategy, the successor strategies of its results do not matter, and its many ways of connecting
to successor strategies need not be individually examined. Finally, the domain itself my allow
for significant state space reduction. For example, if a precedence relation is added between
tasks in the task set over which data is communicated, the failure of a preceding task may imply
that the succeeding task is destined to fail. These modifications impact the lattice of strategies,
and reduce the size of the schedulable and non-redundant set of strategies ๐ฎ.
For completeness, we discuss the algorithmic complexity of the (naรฏve) state machine construc-
tion algorithm as presented in this paper. While the exact size of the state space depends on a
number of factors, such as the number of available strategies ๐ฎ, the upper bound is considerable.
The time complexity of ๐ฟ(ฮ) is ๐ช(|โ| ยท ss(|๐ฎ|)), where:
โข |โ| is the number of results
โข |๐ฎ| is the number of strategies
โข ss(๐) provides an upper bound to the computation of the steady state for ๐ strategies
To compute ss(๐), the steady-state matrix needs to be computed. For this computation,
the LAPACK driver routine DSEGD1 may be used, with a complexity of ๐ช(๐3 ), resulting in
ss(๐) = ๐3 .
The number of strategies is up to all combinations of tasks (|๐| โ ๐ช(|ฮ|!)). At the same time,
๐ฟ(ฮ) is computed for every possible state machine. As each result can link to any successor
strategy, this yields up to |๐||๐
| state machines. Each strategy has โค 2|ฮ๐ | results, 2|ฮ๐ | โ ๐ช(2|ฮ| )
and thus |๐
| โ ๐ช(|ฮ|!ยท2|ฮ| ). Let ๐ = |ฮ|, i.e. ๐ is the number of tasks. Then, the final algorithmic
time complexity is given:
๐ช(|๐||๐
| ยท |โ| ยท ss(|๐ฎ|)) =
|ฮ|
๐ช(|ฮ|!|ฮ|!ยท2 | ยท |ฮ|! ยท 2|ฮ| ยท |ฮ|!3 ) =
๐
๐ช(๐!๐!ยท2 ยท ๐! ยท 2๐ ยท ๐!3 ) โ
๐
๐ช(๐!๐!ยท2 ยท 2๐5 ) โ
๐
๐ช(๐!๐!ยท2 )
It must be noted that this is by no means a tight upper bound. In the next section, we will
examine a task set with |ฮ| = ๐ = 3. This example requires examination of only 64 candidate
1
http://www.netlib.org/lapack/lug/node71.html
state machines, even though such an ๐ value would appear to be completely intractable as per
the above formulation.
5. Example
Let us consider an example task set ฮ = {๐๐ด , ๐๐ต , ๐๐ถ }, with WCET values ๐ถ๐ด = 20, ๐ถ๐ต = 10
and ๐ถ๐ถ = 10. We set ๐ = 10โ3 for this example. For brevity, we abbreviate the probability of
no fault in task ๐๐ when not using fault-tolerance as ๐๐ :
๐ถ๐
๐ (no fault in ๐๐ |no FT) = ๐โ๐ยท๐ถ๐ = ๐โ 103
= ๐๐
As fault-tolerance technique, we use Triple Modular Redundancy (TMR). We require a majority
for TMR to succeed, and assume there is no other way TMR can fail (e.g. assume no unmitigated
faults during voting). A majority for TMR requires two or more copies of the task to be in
agreement. Like ๐๐ , we abbreviate the probability of no fault in task ๐๐ when using fault-tolerance
as ๐๐ :
(๏ธ )๏ธ
3 2
๐ (no fault in ๐๐ |TMR) = ๐3๐ + ๐ ยท (1 โ ๐๐ )
2 ๐
= ๐๐
With the WCET for each task available, we can then compute the ๐๐ and ๐๐ values for all
tasks:
๐๐ด : ๐๐ด = ๐โ20๐ = 0.9802 ๐๐ด = ๐3๐ด + 3(1 โ ๐๐ด )๐2๐ด = 0.9988
๐๐ต : ๐๐ต = ๐โ10๐ = 0.9901 ๐๐ต = ๐3๐ต + 3(1 โ ๐๐ต )๐2๐ต = 0.9997
๐๐ถ : ๐๐ถ = ๐โ10๐ = 0.9901 ๐๐ถ = ๐3๐ถ + 3(1 โ ๐๐ถ )๐2๐ถ = 0.9997
Enumerating strategies For our chosen task set ฮ, there are eight possible subsets and as
such eight possible strategies:
๐ โ
: ฮ๐ โ
= {} ฮ โ ฮ๐ โ
= {๐๐ด , ๐๐ต , ๐๐ถ }
๐ ๐ด : ฮ๐ ๐ด = {๐๐ด } ฮ โ ฮ๐ ๐ด = {๐๐ต , ๐๐ถ }
๐ ๐ต : ฮ๐ ๐ต = {๐๐ต } ฮ โ ฮ๐ ๐ต = {๐๐ด , ๐๐ถ }
๐ ๐ถ : ฮ๐ ๐ถ = {๐๐ถ } ฮ โ ฮ๐ ๐ถ = {๐๐ด , ๐๐ต }
๐ ๐ด,๐ต : ฮ๐ ๐ด,๐ต = {๐๐ด , ๐๐ต } ฮ โ ฮ๐ ๐ด,๐ต = {๐๐ถ }
๐ ๐ด,๐ถ : ฮ๐ ๐ด,๐ถ = {๐๐ด , ๐๐ถ } ฮ โ ฮ๐ ๐ด,๐ถ = {๐๐ต }
๐ ๐ต,๐ถ : ฮ๐ ๐ต,๐ถ = {๐๐ต , ๐๐ถ } ฮ โ ฮ๐ ๐ต,๐ถ = {๐๐ด }
๐ ๐ด,๐ต,๐ถ : ฮ๐ ๐ด,๐ต,๐ถ = {๐๐ด , ๐๐ต , ๐๐ถ } ฮ โ ฮ๐ ๐ด,๐ต,๐ถ = {}
Not all of these may be schedulable. If ๐ ๐ด,๐ต,๐ถ was schedulable there would be no reason to
use strategy switching. Likewise, if only ๐ โ
was schedulable, there is no way to deploy limited
Unschedulable
๐ ๐ด,๐ต,๐ถ
๐ ๐ด,๐ต ๐ ๐ด,๐ถ ๐ ๐ต,๐ถ
Schedulable
๐ ๐ด ๐ ๐ต ๐ ๐ถ
Redundant
๐ โ
Figure 2: Lattice of strategies for ฮ = {๐๐ด , ๐๐ต , ๐๐ถ }
fault-tolerance to this task set. For the example, we assume that only strategies ๐ โ
, ๐ ๐ด , ๐ ๐ต , ๐ ๐ถ
and ๐ ๐ต,๐ถ are determined to be schedulable by a scheduler. The strategies form a lattice with
๐ โ
as the greatest lower bound, and ๐ ๐ด,๐ต,๐ถ as the least upper bound. The lattice is shown in
Figure 2. Figure 2 also reveals redundant strategies, e.g. thereโs no reason to choose ๐ ๐ต when
๐ ๐ต,๐ถ is schedulable. As such, there is no need to consider them. Let ๐ฎ = {๐ ๐ด , ๐ ๐ต,๐ถ }.
TMR can fail to reach a consensus, and then this information is available to the strategy
selection component in the form of a result. As such, in this example each strategy ๐ has 2|ฮ๐ |
possible results. For ๐ ๐ด this is ๐๐ด and ๐๐ด , and for ๐ ๐ต,๐ถ this is ๐๐ต,๐ถ , ๐๐ต,๐ถ , ๐๐ต,๐ถ and ๐๐ต,๐ถ . These
results are shown in Figure 3a, which shows the strategy state machine without successor
relations for each result.
Evaluating transitions functions There are |๐ฎ||โ| = 26 = 64 possible transition functions
ฮ, as all of the six results needs to be linked to one of the two strategies. We will not enumerate
all of them, but two examples are shown in Figure 3b and Figure 3c.
In this example, we will show how to evaluate ๐ฟ(ฮ) for the state machine in Figure 3b. Let
this be ฮ3b .
1. Computing the steady-state probability of ฮ3b . The result states are โurgentโ states, in
which no time can pass. In other words, the result states themselves are not relevant for
the steady-state (i.e. the fraction of iterations of the task set spent in state ๐ โ โ is 0).
As such, we remove these states for the steady-state computation, linking each strategy
to multiple successor strategies. The resulting state machine is a Discrete-Time Markov
Chain, where each time step is an iteration of the task set ฮ. The steady-state can be
computed using linear algebra. Let ๐ be the transition matrix for ฮ3b with all results
removed:
[๏ธ ]๏ธ
๐ (๐๐ด |๐ ๐ด ) ๐ (๐๐ด |๐ ๐ด )
๐ =
๐ (๐๐ต,๐ถ โช ๐๐ต,๐ถ โช ๐๐ต,๐ถ โช ๐๐ต,๐ถ |๐ ๐ต,๐ถ ) 0
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐ ๐ด ๐ ๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
(a) Partial state machine without successor relations for the results
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐ ๐ด ๐ ๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
(b) Switching state machine choosing between ๐ ๐ด and ๐ ๐ต,๐ถ based on the result
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐ ๐ด ๐ ๐ต,๐ถ
๐ด ๐ต,๐ถ ๐ต,๐ถ
๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ
(c) Degenerate state machine always choosing ๐ ๐ด
Figure 3: Example state machines
[๏ธ ]๏ธ [๏ธ ]๏ธ
1 โ ๐๐ด ๐๐ด 0.0012 0.9988
= =
1 0 1 0
We can compute the steady-state vector ๐ฃ by solving ๐ฃ = ๐ฃ๐ .
(๏ธ 1 )๏ธ
๐ฃ = ๐ฃ๐ โ 2
1
2
Note that a unique steady-state vector ๐ฃ need not exist if there are two or more parts of
the state machine that are disconnected. For example, all results of ๐ ๐ด may link back to ๐ ๐ด ,
while all results of ๐ ๐ต,๐ถ link back to ๐ ๐ต,๐ถ . In such a case, the steady-state is dependent
on the initial state. However, we can safely disregard state machines that depend on the
initial condition, as other state machines with identical steady-state behavior must exist
as well. If staying in ๐ ๐ด provides the lowest ๐ฟ(ฮ) value, then a state machine like shown
in Figure 3c would yield the exact same ๐ฟ(ฮ) value as a state machine where ๐ ๐ด and ๐ ๐ต,๐ถ
is disconnected with ๐ ๐ด as the initial state.
2. Compute, for each result โ strategy transition, the ๐ฟ(๐, ๐ to ) catastrophic fault probability.
๐ฟ(๐๐ด , ๐ ๐ต,๐ถ ) = 1โ (1 โ (0) ยท (1 โ ๐๐ด ))ยท
(1 โ (1 โ ๐๐ต ) ยท (1 โ ๐๐ต ))ยท
(1 โ (1 โ ๐๐ถ ) ยท (1 โ ๐๐ถ ))
= 5.87 ยท 10โ6
๐ฟ(๐๐ด , ๐ ๐ด ) = 1โ (1 โ (1) ยท (1 โ ๐๐ด ))ยท
(1 โ (1 โ ๐๐ต ) ยท (1 โ ๐๐ต ))ยท
(1 โ (1 โ ๐๐ถ ) ยท (1 โ ๐๐ถ ))
= 1358 ยท 10โ6
๐ฟ(๐๐ต,๐ถ , ๐ ๐ด ) = 1โ (1 โ (1 โ ๐๐ด ) ยท (1 โ ๐๐ด ))ยท
(1 โ (0) ยท (1 โ ๐๐ต ))ยท
(1 โ (0) ยท (1 โ ๐๐ถ ))
= 22.98 ยท 10โ6
๐ฟ(๐๐ต,๐ถ , ๐ ๐ด ) = 1โ (1 โ (1 โ ๐๐ด ) ยท (1 โ ๐๐ด ))ยท
(1 โ (1) ยท (1 โ ๐๐ต ))ยท
(1 โ (0) ยท (1 โ ๐๐ถ ))
= 9973 ยท 10โ6
๐ฟ(๐๐ต,๐ถ , ๐ ๐ด ) = 1โ (1 โ (1 โ ๐๐ด ) ยท (1 โ ๐๐ด ))ยท
(1 โ (0) ยท (1 โ ๐๐ต ))ยท
(1 โ (1) ยท (1 โ ๐๐ถ ))
= 9973 ยท 10โ6
๐ฟ(๐๐ต,๐ด , ๐ ๐ด ) = 1โ (1 โ (1 โ ๐๐ด ) ยท (1 โ ๐๐ด ))ยท
(1 โ (1) ยท (1 โ ๐๐ต ))ยท
(1 โ (1) ยท (1 โ ๐๐ถ ))
= 19923 ยท 10โ6
3. Using the calculated ๐ฟ(๐, ๐ to ) values, compute a ๐ฟ(๐ โ ๐ฎ) value for each strategy.
๐ฟ(๐ ๐ด ) =๐ (๐๐ด |๐ ๐ด ) ยท ๐ฟ(๐๐ด , ๐ ๐ต,๐ถ )+
๐ (๐๐ด |๐ ๐ด ) ยท ๐ฟ(๐๐ด , ๐ ๐ด ) =
7.6077 ยท 10โ6
๐ฟ(๐ ๐ต,๐ถ ) =๐ (๐๐ต,๐ถ |๐ ๐ต,๐ถ ) ยท ๐ฟ(๐๐ต,๐ถ , ๐ ๐ด )+
๐ (๐๐ต,๐ถ |๐ ๐ต,๐ถ ) ยท ๐ฟ(๐๐ต,๐ถ , ๐ ๐ต,๐ถ )+
๐ (๐๐ต,๐ถ |๐ ๐ต,๐ถ ) ยท ๐ฟ(๐๐ต,๐ถ , ๐ ๐ต,๐ถ )+
๐ (๐๐ต,๐ถ |๐ ๐ต,๐ถ ) ยท ๐ฟ(๐๐ต,๐ถ , ๐ ๐ต,๐ถ ) =
29.69985 ยท 10โ6
๐ฟ(๐ ) is the probability that a catastrophic fault occurs by selecting strategy ๐ . The bad
score of ๐ ๐ต,๐ถ is not surprising: the ฮ3b state machine is not particularly clever as it
chooses to switch to strategy ๐ ๐ด even with the knowledge that ๐๐ต or ๐๐ถ has failed.
Table 2
Shortened table of all possible ฮ transition functions for Figure 3a
ฮ ๐๐ด ๐๐ด ๐๐ต,๐ถ ๐๐ต,๐ถ ๐๐ต,๐ถ ๐๐ต,๐ถ ๐ฟ(ฮ) ยท 105
ฮ1 ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด 19.9355
ฮ2 ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด 1.81421
ฮ3 ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ด 22.055
...
ฮ56 ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ 39.489
ฮ57 ๐ ๐ด ๐ ๐ด ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 19.935
ฮ58 ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 1.54062
ฮ59 ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 22.054
ฮ60 ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 2.6115
ฮ61 ๐ ๐ด ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ n/a3
ฮ62 ๐ ๐ต๐ถ ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 39.226
ฮ63 ๐ ๐ด ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 39.226
ฮ64 ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ ๐ ๐ต๐ถ 39.2265
ฮโ
n/a 59.0104
1
Example from Figure 3b (ฮ2 = ฮ3b )
2
Best (lowest) steady-state fault rate
3
No unique steady-state exists for this transition function
4
๐ฟ(ฮ) of the task set without any form of fault-tolerance
5
Does not strategy switch, i.e. always stays in the same strategy
4. Finally, compute ๐ฟ(ฮ3b ) by taking the weighted average of all ๐ฟ(๐ ) values by multiplying
๐ฟ(๐ ) for each strategy by the steady-state probability ๐ (๐ |ฮ3b ) of being in that strategy.
๐ฟ(ฮ3b ) = ๐ (๐ ๐ด ) ยท ๐ฟ(๐ ๐ด ) + ๐ (๐ ๐ต,๐ถ ) ยท ๐ฟ(๐ ๐ต,๐ถ )
= 1.814 ยท 10โ5
This process is repeated for all 64 possible transition functions. For brevity, we will not show
the evaluation of every single transition function here. Instead, Table 2 shows a subset of all
possible ฮ values. The columns headed with a result show to which successor strategy that
result maps. For example, for the first evaluated state machine ฮ1 (๐๐ด ) = ๐ ๐ด . The best state
machine is also revealed, listed as ฮ58 . Finally, the last item in the table is ฮโ
, added as a
reference. ฮโ
is the state machine with a single strategy ๐ โ
which protects no tasks, i.e. the
behavior obtained when not using any form of fault-tolerance. The best state machine is a
38.3-fold improvement over this default. That is, the best state machine offers a 38.3 times
lower rate of catastrophic failure when compared to ฮโ
. Finally, the table also shows two state
machines which perform no strategy switching, but still use fault-tolerance. These are ฮ1 and
ฮ64 , staying in ๐ ๐ด and ๐ ๐ต,๐ถ respectively. The strategy switching solution ฮ58 offers a 12.9
times lower rate of catastrophic faults when compared to the best of these static solutions.
A B
Figure 4: Task set ฮ = {๐๐ด , ๐๐ต } with a precedence relation between ๐๐ด and ๐๐ต
6. Extending the fault and task model
6.1. Adding precedence relations
Directed Acyclic Graph scheduling is an extension to our task model where precedence relations
exist between tasks. While this has no impact on the strategy switching algorithm directly as it
need not concern itself with scheduling, it may add dependence to the success probability of a
task. For this extension, we assume that when a predecessor task fails (produces incorrect data),
all successor tasks fail as well due to operating on incorrect data, even when not experiencing a
SEU.
To support this assumption, the ๐ฟ(๐, ๐ to ) cost function needs to be modified to consider
precedence relations. Specifically, ๐ (fault in ๐๐ |๐) and ๐ (fault in ๐๐ |๐ to ) must be replaced to not
just consider if ๐๐ failed in isolation. Let us call this precedence-aware probability the probability
of incorrect output.
๐ (incorrect output ๐๐ |๐ฅ) = ๐ (fault in ๐๐ |๐ฅ)ยท
โ๏ธ
๐ (fault in ๐๐ |๐ฅ)
๐๐ โpred(๐๐ )
Here, ๐ฅ โ ๐ฎ โช โ is either a strategy or a result. pred (๐๐ ) is the set of all direct and indirect
predecessors of task ๐๐ . This probability also includes the behavior of any predecessor tasks.
Figure 4 shows an example task set with a precedence relation. Consider the task set in
this example with two strategies: ๐ฎ = {๐ ๐ด , ๐ ๐ต }, protecting task ๐๐ด and ๐๐ต respectively. Each
strategy has two outcomes, hence โ = {๐๐ด , ๐๐ด , ๐๐ต , ๐๐ต }. Let us consider ๐ (fault in ๐๐ต |๐๐ด ):
the result ๐๐ด does not directly communicate anything about the state of task ๐ต, however due to
the precedence relation we know it effectively failed. As such, ๐ (incorrect output ๐๐ต |๐๐ด ) = 1.
6.2. Selective fault-tolerance and criticality
The task model can be extended to support heterogeneity in the ability of tasks to tolerate
non-consecutive faults. Let ฮ๐ be the set of tasks which cannot tolerate non-consecutive
faults. Then, we update ๐ (catastrophic fault in ๐๐ |๐,๐ to ) to consider non-consecutive faults as
catastrophic faults when ๐๐ โ ฮ๐ .
๐ (catastrophic fault in ๐๐ |๐,๐ to =)
{๏ธ
๐ (fault in ๐๐ |๐) ยท ๐ (fault in ๐๐ |๐ to ) if ๐๐ โ
/ ฮ๐
๐ (fault in ๐๐ |๐) else
The goal of the strategy state machine is to minimize the rate of catastrophic faults. This
may mean that the rate of catastrophic faults is so low that even a single fault is unlikely to
occur across the lifetime of the system. However, when used in soft real-time deployments,
a โcatastrophic faultโ may not be catastrophic while still undesirable. In such a deployment,
catastrophic faults may be acceptable. The impact of a task experiencing such a catastrophic
fault may not be the same across all tasks. As such, a priority function ๐(๐๐ โ ฮ) can be
integrated into ๐ฟ(๐, ๐ to ).
๐ฟ(๐, ๐ to ) =
โ๏ธ
1โ (1 โ ๐ (catastrophic fault in ๐๐ |๐,๐ to ))๐(๐๐ )
๐๐ โฮ
The ๐(๐๐ ) function assigns relative priority, where higher is better. A task ๐๐ with ๐(๐๐ ) = 2
has twice the importance as a task ๐๐ with ๐(๐๐ ) = 1. Note that this change affects the meaning of
the output ๐ฟ(ฮ), which is no longer the rate of catastrophic faults. If the impact of a catastrophic
fault in a task with ๐(๐๐ ) = 2 is equivalent to two catastrophic faults in another task with
๐(๐๐ ) = 1, then the final rate ๐ฟ(ฮ) can be seen as the rate of catastrophic faults normalized to
faults in ๐๐ .
7. Related work
(๐๐ , ๐๐ ) constraints have been used before in the domain of real-time scheduling. Choi et al. [7]
proposed a scheduler, together with an efficient schedulability algorithm for a sporadic task
set with tasks under (๐๐ , ๐๐ ) constraints. This scheduler allows for scheduling task sets that
would normally not be schedulable, but utilizing their (๐๐ , ๐๐ ) constraints allows them to be
scheduled.
Chen et al. [8] proposed a solution that similar to ours. Their method offers fault-tolerance
with the goal of reducing the effective fault rate as well as lowering energy consumption.
Chen et al. proposes a static scheduling technique called Static Pattern-Based Reliable Execution,
ensuring each (๐๐ , ๐๐ ) constraint is respected in the presence of transient faults. Furthermore,
they propose delaying the execution of their static pattern if no fault is detected at runtime,
opportunistically running more unprotected instances of the task with the goal of saving energy.
However, if the static pattern is found to be unschedulable as per their schedulability test,
their implementation is unable to provide a schedule that minimizes the fault rate for a given
resource-constrained real-time system. While their approach offers more flexibility in the
task model (specifically the support for (๐๐ , ๐๐ ) constraints with ๐๐ > 2), it does not consider
that fault mitigation may fail. Our approach optimally lowers the fault rate, regardless of the
hardware constrains. Furthermore, our approach recognizes that fault mitigation may fail, and
includes this in the calculation for lowering the fault rate.
Gujarati et al. [9] contributed a technique for measuring the fault rate of an application
with tasks under (๐๐ , ๐๐ ) constraints. Their technique provides an upper bound for the fault
probability per iteration of a Fault-tolerant Single-Input Single-Output (FT-SISO) control loop,
similar to our ๐ฟ(ฮ) output on a task set with precedence relations. Their technique hopes
to provide transparency to system designers, allowing analyzing the impact on the reliablity
when changing the hardware or software. However, while their approach is aware of (๐๐ , ๐๐ )
constraints, it does not provide schedules that utilize such constraints. Instead, it merely includes
them in the reliability calculation.
The domain of strategy switching shares some aspects with Mixed-Criticality (MC) systems.
In an MC system, the system switches between different levels of criticality depending on the
operating conditions of the system. Tasks are assigned a criticality level, and when the system
criticality is higher than that of the task, the task is not scheduled to guarantee the successful
and timely execution of tasks with a higher criticality level. Pathan [10] combines MC with
fault-tolerance against transient faults. As is typical in MC research, as the level of criticality
increases, the pessimism increases. Pathan increases the maximum fault rate when switching
to a higher level of criticality. In our approach we do not vary the pessimism of any parameter.
Instead, we assume the ๐ parameter provides a suitable upper bound to the fault rate in all
conditions. Our approach offers some aspects typically not found in MC systems: while one
could argue that each strategy is really a criticality level, it is a criticality level applied to a
subset of the tasks (specifically ฮ๐ ). Finally, the approach by Pathan requires bounding the
number of faults that can occur in any window. As such, passing their sufficient schedulability
test will (under their fault model) guarantee the system will never experience a fault.
8. Conclusion
In this paper, we have shown how strategy switching can be used to improve fault-tolerance
for resource-constrained systems. Our method makes effective use of the ability to vary which
tasks receive fault-tolerance. It considers at the start of every iteration of the task set what the
best set of tasks is to protect. We have shown how our method computes the optimal strategy
state machine for any given task set, minimizing the rate of catastrophic faults. We have also
shown the flexibility of our method to be extended to support new task and fault models.
9. Future work
In future work, we hope to improve the tractability of our algorithm by both state space reduction
algorithms as well as by using heuristics.
Furthermore, we hope to extend the fault model to distinguish between deadline misses
and incorrect results. We also hope to integrate tasks with (๐๐ , ๐๐ ) constraints with ๐๐ > 2.
Additionally, we hope to integrate the natural ability of tasks to detect faults into our task model,
as a SEU may for example lead to a segfault. Finally, we hope to validate our approach using
simulation-based analysis.
Acknowledgments
The presentation of this paper was considerably improved in response to comments provided
by the anonymous reviewers, and we gratefully acknowledge their insights and assistance.
This project has received funding from the European Unionโs Horizon 2020 research and
innovation program under grant agreement No. 871259 (ADMORPH project). Additionally, this
work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.
References
[1] I. Oz, S. Arslan, A survey on multithreading alternatives for soft error fault tolerance,
ACM Computing Surveys (2019).
[2] R. E. Lyons, W. Vanderkulk, The use of triple-modular redundancy to improve computer
reliability, IBM journal of research and development 6 (1962) 200โ209.
[3] J. Chang, G. A. Reis, D. I. August, Automatic instruction-level software-only recovery, in:
International Conference on Dependable Systems and Networks (DSNโ06), IEEE, 2006, pp.
83โ92.
[4] S. A. Asghari, M. Binesh Marvasti, A. M. Rahmani, Enhancing transient fault tolerance in
embedded systems through an OS task level redundancy approach, Future Generation
Computer Systems 87 (2018) 58โ65. doi:10.1016/j.future.2018.04.049.
[5] G. Bernat, A. Burns, A. Liamosi, Weakly hard real-time systems, IEEE transactions on
Computers 50 (2001) 308โ321.
[6] I. Broster, A. Burns, G. Rodriguez-Navas, Timing analysis of real-time communication
under electromagnetic interference, Real-Time Systems 30 (2005) 55โ81.
[7] H. Choi, H. Kim, Q. Zhu, Job-class-level fixed priority scheduling of weakly-hard real-time
systems, in: 2019 IEEE Real-Time and Embedded Technology and Applications Symposium
(RTAS), 2019, pp. 241โ253. doi:10.1109/RTAS.2019.00028.
[8] K.-H. Chen, B. Bรถnninghoff, J.-J. Chen, P. Marwedel, Compensate or ignore? Meeting
control robustness requirements through adaptive soft-error handling, in: Proceedings of
the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory
for Embedded Systems, LCTES 2016, Association for Computing Machinery, New York,
NY, USA, 2016, pp. 82โ91. doi:10.1145/2907950.2907952.
[9] A. Gujarati, M. Nasri, B. B. Brandenburg, Quantifying the resiliency of fail-operational real-
time networked control systems, in: S. Altmeyer (Ed.), 30th Euromicro Conference on Real-
Time Systems (ECRTS 2018), volume 106 of Leibniz International Proceedings in Informatics
(LIPIcs), Schloss DagstuhlโLeibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018, pp.
16:1โ16:24. doi:10.4230/LIPIcs.ECRTS.2018.16.
[10] R. M. Pathan, Fault-tolerant and real-time scheduling for mixed-criticality systems, Real-
Time Systems 50 (2014) 509โ547.