1. Introduction

Strategy Switching: Smart Fault-tolerance for Resource-constrained Real-time Applications

Lukas Miedema

Benjamin Rouxel

0 1

Clemens Grelck

0 0 University of Amsterdam (UvA) , Amsterdam , Netherlands 1 University of Modena and Reggio Emilia (Unimore) , Modena , Italy

Software-based fault-tolerance is an attractive alternative to hardware-based fault-tolerance, as it allows for the use of cheap Commercial Of The Shelf hardware. However, software-based fault-tolerance comes at a cost, requiring computing the same results multiple times to allow for the detection and mitigation of faults. Resource-constrained real-time applications may not be able to aford this cost. At the same time, the domain of a real-time task may allow it to tolerate a fault, provided it does not occur in consecutive iterations of the task. In this paper, we introduce a new way to deploy fault-tolerance called strategy switching. Our method targets Single Event Upsets by running diferent subsets of tasks under fault-tolerance at diferent points in time. We do not bound the number of faults in a window, nor does our method assume that tasks under fault-tolerance cannot still fail. Our technique does not require a minimal amount of additional compute resources for fault-tolerance. Instead, our method optimally utilizes any available compute resources for fault-tolerance for resource-constrained real-time applications.

eol>Cyber-physical Systems Resource Constraints Fault-tolerance Single Event Upsets Weakly Hard Realtime

1. Introduction

resources to allow all tasks to run with fault-tolerance. However, for applications structured as a set of periodic tasks, software-based fault-tolerance allows the application of fault-tolerance to only a subset of the task set.

Control tasks may be able to tolerate non-consecutive deadline misses, which has led to the adoption of the weakly hard model [5]. A task that is unable to provide a result may not result in catastrophe, provided that in the next period it can provide a result. Per the weakly hard model, each task has an (, ) constraint, indicating that the task must complete at least times successfully out of every times. is said to be the window size. We use this (, ) constraint with < to deliver more efective fault-tolerance to resource-constrained systems. For example, consider a task set with just two tasks A and B, where only one task can be run under fault-tolerance at a time. Furthermore, both task A and B can tolerate non-consecutive deadline misses. Running task A under fault-tolerance and task B without would leave task B vulnerable to SEUs. However, we could more optimally make use of the scarce fault-tolerance by switching between protecting task A and task B in successive iterations of the task set. Tasks under fault-tolerance may still fail (e.g. TMR reaches no majority), and these cases can be detected. When the fault-tolerance technique has failed to protect task A, task A should be protected again in the next iteration of the task set to ensure it does not experience a consecutive fault. Contribution We propose a new approach for improving fault-tolerance for real-time applications running on resource-constrained systems by strategy switching. We minimize the efective unmitigated fault-rate by selecting which tasks are to be run under fault-tolerance. Our approach recognizes that the importance of protecting a task may change over time due to earlier faults or lack thereof, and as such runs diferent tasks under fault-tolerance at different times. By exhaustively searching all patterns in which fault-tolerance can be applied, our method optimally utilizes limited available computational resources for fault-tolerance in resource-constrained real-time applications.

Organization In section 2 we introduce our task and fault model. Our method uses a state machine, which is introduced in section 3. In section 4, we formalize the construction of the state machine. Then, in section 5, we discuss how our method can lower the consecutive fault rate for an example task set. We explain the flexibility of our method by extending it to alternative task and fault models in section 6. Several pieces of related work exist, which are covered in section 7. The paper is concluded in section 8, and finally in section 9 we discuss various future directions for this technique.

2. System models

Task model We assume a set of periodic tasks Γ = { 1... } with a single, global period and deadline such that the period is equal to or larger than the deadline (no pipelining). Furthermore, we initially assume that each task can aford one (non-consecutive) deadline miss (an (, ) constraint of ( 1,2 )), and that each task is equally important. In section 6, we will discuss how some of these assumptions can be relaxed to support a wide variety of task models. Fault model We use the Poisson distribution as an approximation for the worst-case fault rate of SEUs, which was argued to be a good approximation by Broster et al. [6]. We do not assume universal fault detection: only when the task runs under a fault-tolerance scheme can a fault be detected and mitigated. When a task does not run with fault-tolerance, it is not known whether or not it succeeded. We use the term catastrophic fault to describe an unmitigated fault occurring in two consecutive iterations of a task that can tolerate a single unmitigated fault, i.e. the task has an (, ) = ( 1,2 ) constraint. Furthermore, a catastrophic fault also occurs when a task that cannot tolerate non-consecutive faults experiences an unmitigated fault, i.e. the task has a ( 1,1 ) constraint. We do not consider constraints beyond = 2 in this paper.

Fault mitigation We assume the presence and implementation of a particular fault-tolerance scheme, and that any task can be run under that scheme. In this paper, we assume that SEUs always go undetected in tasks not under fault-tolerance. Finally, we assume the scheme implements fault-detection and fault-mitigation. Our model allows the fault mitigation to fail (e.g. due to successive SEUs during both replicas of a task under TMR), but assumes that it is known when fault mitigation fails.

Other definitions Given the complexity and number of symbols used in this paper, a table of all symbols and terms has been compiled in Table 1. Each symbol or term used will be defined prior to use, as well as being listed in the table.

3. Strategy Switching State Machine

To both swiftly select a new subset of the task set to run under fault-tolerance while also making optimal decisions, we precompute for each situation the next best subset of tasks to run under fault-tolerance. The result of this is the strategy state machine, which is made available to the online component. The strategy state machine is a bipartite state machine, consisting of strategy states and result states. An example of such a state machine is shown in Figure 1.

The architecture of our strategy switching approach distinguishes between an online part at runtime, as well as an ofline part executing ahead-of-time not beholden to any real-time constraints. The ofline component prepares the state machine, which is then available for online playback.

Online We introduce a strategy switching component, which plays back the strategy state machine, taking transitions based on observed faults as the application runs. At runtime, this component selects a single strategy ahead of every execution of the task set, which becomes active. The strategy dictates which tasks run under fault-tolerance (Γ), and which ones do not (Γ ∖ Γ). Fault-tolerance techniques are typically not a silver bullet solution, and unmitigated faults may still occur in tasks in Γ. Furthermore, fault-tolerance techniques can often report the fact that they failed to mitigate a fault (e.g. no consensus in triple modular redundancy). After executing all tasks, the online component uses this information from the execution of the task set to select the matching result from the state machine. This result reflects the success or fail state, or probability thereof, of each of the tasks. Each possible result directly maps to its best successor strategy, which is applied to the next iteration of the task set.

Ofline

relation Δ from any given result ∈ ℛ to the next best successor strategy Δ() = . Strategies

The full set of strategies ∈ is computed ahead of time, as well as the transition which are not schedulable are not included in . Furthermore, strategies which are dominated by other strategies (i.e. all tasks that are protected by one strategy are also protected by another) are also not considered in .

We develop an algorithm to compute Δ, such that our choice of Δ provides the lowest steady-state rate of catastrophic faults.

Example State Machine An example of such a state machine is shown in Figure 1, where there are two strategies = {, }. This state machine is not the optimal state machine for this task set, but gives a non-trivial example of what such a state machine may look like. Such a state machine is constructed by the ofline component, and made available to the runtime. Each strategy protects only one task (either task or task ). Finally, fault-tolerance may fail and if it fails this information is available to the strategy switching component. With these assumptions, each strategy has two potential successors: one where the task under protection succeeds and one where it fails (e.g. and for strategy ). Each result state has just one successor strategy Δ(), e.g. Δ() = in Figure 1. The following sequence of actions may take place at runtime, as per Figure 1; 1. To start, the runtime initializes by picking a random strategy, say , and applies fault tolerance accordingly. By picking , task will be executed with fault-tolerance, while task will not. 2. The task set is executed (iteration 1). 3. At the end of the iteration, data from the fault-tolerance applied to task is used to select a result. If fails, the result is selected. However, let us assume succeeded, and select result accordingly. No information about the success or failure of is known. 4. links to strategy , which is selected. 5. By switching to , task will be executed with fault-tolerance, while task will not. 6. The task set is executed (iteration 2). 7. At the end of the iteration, data from the fault-tolerance of task indicates that task has failed, letting us select result . 8. links to strategy , which is selected. * . The process continues...

Note that there is no involved selection procedure for the initial strategy (strategy in the example). Our approach is only concerned with obtaining the lowest steady-state fault rate of the application, which is in no way impacted by the choice of initial strategy.

4. Evaluating State Machines

One way to find the optimal state machine transition function Δ for a given is to enumerate all possible transition functions, score each transition function, and select the best one. To do so, we define the state machine scoring function (Δ). The output of this scoring function is the steady-state rate of catastrophic faults. Specifically, (Δ) is the weighted probability of a catastrophic fault occurring during any iteration of the task set when applying strategy switching according to the transition function Δ.

To evaluate (Δ), we compute the steady-state probability distribution of the transition function Δ (e.g. using linear algebra). The steady-state probability distribution provides a probability (|Δ) of finding the state machine in strategy at an arbitrarily chosen iteration of the task set given Δ. Intuitively, as the number of iterations of the task set approaches infinity, (|Δ) is the fraction of iterations spent with strategy active. The sum of these fractions over all strategies is 1, i.e ∑︀∈ (|Δ) = 1.

Let (, Δ) be the probability of a catastrophic fault occurring in strategy . Together with (|Δ), we can now compute (Δ):

(Δ) = ∑︁ (|Δ) · (, Δ)

Catastrophic faults, by their definition, only occur when an unmitigated fault occurs in two consecutive iterations of the task set. To compute (, Δ), we must not just consider , but also the successor of . The successor of depends on what result is being chosen, which is given by (|,Δ). Each has only one successor strategy Δ(). Let Δ() = to . Thus, (,Δ) can be expressed in terms of and Δ() = to as follows: (, Δ) =

∑︁ (|) · (, )

As to is a strategy, (, to ) is independent of our choice of Δ. (, to ) is equal to the probability that any task experiences a fault both in as well as in to : ∈ ∈Δ() (, to ) = 1 − = 1 − ∏︁ 1 − (catastrophic fault in |,to ) ∈Γ ∏︁ 1 − (fault in |) · (fault in |to )

The probabilities (fault in |) and (fault in |to ) are computed based on the used fault-tolerance technique, if ∈ Γto , the rate of faults combined with the WCET of , and information available in the result. For example, a task without fault-tolerance may have a chance of experiencing a fault with (fault in ) = 1 − − · , where is the WCET of . Likewise, a task which was run under fault-tolerance according to the previous strategy, and has succeeded according to result , will have a (fault in |) = 0. We consider all SEUs to be statistically independent events. To keep independence, use the WCET instead of the average execution time, as the execution times of tasks in the same task set may not be independent. As such, our steady-state rate of catastrophic faults (Δ) provides an upper bound to the true steady-state fault rate of the application.

Tractability The approach, as presented here, can easily become intractable for even small task sets. Many state space reduction techniques can be realized in the way the state machines are enumerated, which can considerably lower the number of elements. If a candidate state machine Δ contains disconnected sub-graphs, then no unique steady state can be computed, and as such this candidate can be pruned. Furthermore, the candidate state machine may contain strategies that are ignored, i.e. there is no path to that strategy from that same strategy. When this is the case, the steady-state probability (|Δ) is always 0. For such an ignored strategy, the successor strategies of its results do not matter, and its many ways of connecting to successor strategies need not be individually examined. Finally, the domain itself my allow for significant state space reduction. For example, if a precedence relation is added between tasks in the task set over which data is communicated, the failure of a preceding task may imply that the succeeding task is destined to fail. These modifications impact the lattice of strategies, and reduce the size of the schedulable and non-redundant set of strategies .

For completeness, we discuss the algorithmic complexity of the (naïve) state machine construction algorithm as presented in this paper. While the exact size of the state space depends on a number of factors, such as the number of available strategies , the upper bound is considerable.

The time complexity of (Δ) is (|ℛ| · ss(||)), where: • |ℛ| is the number of results • || is the number of strategies • ss() provides an upper bound to the computation of the steady state for strategies To compute ss(), the steady-state matrix needs to be computed. For this computation, the LAPACK driver routine DSEGD1 may be used, with a complexity of (3), resulting in ss() = 3.

The number of strategies is up to all combinations of tasks (|| ∈ (|Γ|!)). At the same time, (Δ) is computed for every possible state machine. As each result can link to any successor strategy, this yields up to |||| state machines. Each strategy has ≤ 2|Γ| results, 2|Γ| ∈ (2|Γ|) and thus || ∈ (|Γ|!· 2|Γ|). Let = |Γ|, i.e. is the number of tasks. Then, the final algorithmic time complexity is given: (|||| · |ℛ| · ss(||)) = (|Γ|!|Γ|!· 2|Γ|| · | Γ|! · 2|Γ| · | Γ|!3) = (!!· 2 · ! · 2 · !3) ⊂ (!!· 2 · 25) ≈ (!!· 2 )

It must be noted that this is by no means a tight upper bound. In the next section, we will examine a task set with |Γ| = = 3. This example requires examination of only 64 candidate 1http://www.netlib.org/lapack/lug/node71.html state machines, even though such an value would appear to be completely intractable as per the above formulation.

5. Example

Let us consider an example task set Γ = { , , }, with WCET values = 20, = 10 and = 10. We set = 10− 3 for this example. For brevity, we abbreviate the probability of no fault in task when not using fault-tolerance as : (no fault in |no FT) = − · = − 103 =

As fault-tolerance technique, we use Triple Modular Redundancy (TMR). We require a majority for TMR to succeed, and assume there is no other way TMR can fail (e.g. assume no unmitigated faults during voting). A majority for TMR requires two or more copies of the task to be in agreement. Like , we abbreviate the probability of no fault in task when using fault-tolerance as : (no fault in |TMR) = 3 + ︂( 3 )︂

2 2 · (1 − ) =

With the WCET for each task available, we can then compute the and values for all tasks: : = − 20 = 0.9802 = 3 + 3(1 − )2 = 0.9988 : = − 10 = 0.9901 = 3 + 3(1 − )2 = 0.9997 : = − 10 = 0.9901 = 3 + 3(1 − )2 = 0.9997 Enumerating strategies For our chosen task set Γ, there are eight possible subsets and as such eight possible strategies: ∅ : Γ∅ = {} : Γ = { } : Γ = { } : Γ = { } , : Γ, = { , } , : Γ, = { , } , : Γ, = { , } ,, : Γ,, = { , , } Γ ∖ Γ∅ = { , , } Γ ∖ Γ = { , } Γ ∖ Γ = { , } Γ ∖ Γ = { , } Γ ∖ Γ, = { } Γ ∖ Γ, = { } Γ ∖ Γ, = { } Γ ∖ Γ,, = {}

Not all of these may be schedulable. If ,, was schedulable there would be no reason to use strategy switching. Likewise, if only ∅ was schedulable, there is no way to deploy limited Unschedulable Schedulable Redundant fault-tolerance to this task set. For the example, we assume that only strategies , , , ∅ and , are determined to be schedulable by a scheduler. The strategies form a lattice with as the greatest lower bound, and ,, as the least upper bound. The lattice is shown in , is schedulable. As such, there is no need to consider them. Let = {, , }.

TMR can fail to reach a consensus, and then this information is available to the strategy selection component in the form of a result. As such, in this example each strategy has 2|Γ| possible results. For this is and , and for , this is , , , , , and , . These results are shown in Figure 3a, which shows the strategy state machine without successor relations for each result.

Evaluating transitions functions

There are |||ℛ| = 26 = 64 possible transition functions Δ, as all of the six results needs to be linked to one of the two strategies. We will not enumerate all of them, but two examples are shown in Figure 3b and Figure 3c.

In this example, we will show how to evaluate (Δ) for the state machine in Figure 3b. Let this be Δ3b.

1. Computing the steady-state probability of Δ3b. The result states are “urgent” states, in which no time can pass. In other words, the result states themselves are not relevant for the steady-state (i.e. the fraction of iterations of the task set spent in state ∈ ℛ is 0). As such, we remove these states for the steady-state computation, linking each strategy to multiple successor strategies. The resulting state machine is a Discrete-Time Markov Chain, where each time step is an iteration of the task set Γ. The steady-state can be computed using linear algebra. Let be the transition matrix for Δ3b with all results removed: = ︂[

(|) (, ∪ , ∪ , ∪ , |, ) (|)︂] (b) Switching state machine choosing between and , based on the result

= ≈ Note that a unique steady-state vector need not exist if there are two or more parts of the state machine that are disconnected. For example, all results of may link back to , while all results of , link back to , . In such a case, the steady-state is dependent on the initial state. However, we can safely disregard state machines that depend on the initial condition, as other state machines with identical steady-state behavior must exist as well. If staying in provides the lowest (Δ) value, then a state machine like shown in Figure 3c would yield the exact same (Δ) value as a state machine where and , is disconnected with as the initial state. 2. Compute, for each result → strategy transition, the (,to) catastrophic fault probability. 3. Using the calculated (,to) values, compute a ( ∈ ) value for each strategy. () =(|) · (,,)+ (|) · (,) = (,) =(,|,) · (,,)+ (,|,) · (,,,)+ (,|,) · (,,,)+ (,|,) · (,,,) = 29.69985 · 10− 6 () is the probability that a catastrophic fault occurs by selecting strategy . The bad score of , is not surprising: the Δ3b state machine is not particularly clever as it chooses to switch to strategy even with the knowledge that or has failed. ,

... n/a ΔΔΔΔΔΔΔΔΔΔ666665555∅324108796 335321213n..9999299/56a.......2202094413221253801660645965254 1 Example from Figure 3b (Δ2 = Δ3b) 2 Best (lowest) steady-state fault rate 3 No unique steady-state exists for this transition function 4 (Δ) of the task set without any form of fault-tolerance 5 Does not strategy switch, i.e. always stays in the same strategy 4. Finally, compute (Δ3b) by taking the weighted average of all () values by multiplying () for each strategy by the steady-state probability (|Δ3b) of being in that strategy. (Δ3b) = () · () + (, ) · (, )

This process is repeated for all 64 possible transition functions. For brevity, we will not show the evaluation of every single transition function here. Instead, Table 2 shows a subset of all possible Δ values. The columns headed with a result show to which successor strategy that result maps. For example, for the first evaluated state machine Δ1() = . The best state machine is also revealed, listed as Δ58. Finally, the last item in the table is Δ∅, added as a reference. Δ∅ is the state machine with a single strategy ∅ which protects no tasks, i.e. the behavior obtained when not using any form of fault-tolerance. The best state machine is a 38.3-fold improvement over this default. That is, the best state machine ofers a 38.3 times lower rate of catastrophic failure when compared to Δ∅. Finally, the table also shows two state machines which perform no strategy switching, but still use fault-tolerance. These are Δ1 and Δ64, staying in and , respectively. The strategy switching solution Δ58 ofers a 12.9 times lower rate of catastrophic faults when compared to the best of these static solutions.

6. Extending the fault and task model 6.1. Adding precedence relations

Directed Acyclic Graph scheduling is an extension to our task model where precedence relations exist between tasks. While this has no impact on the strategy switching algorithm directly as it need not concern itself with scheduling, it may add dependence to the success probability of a task. For this extension, we assume that when a predecessor task fails (produces incorrect data), all successor tasks fail as well due to operating on incorrect data, even when not experiencing a SEU.

To support this assumption, the (, to) cost function needs to be modified to consider precedence relations. Specifically, (fault in |) and (fault in |to) must be replaced to not just consider if failed in isolation. Let us call this precedence-aware probability the probability of incorrect output.

(incorrect output |) = (fault in |)· ∏︁

(fault in |) ∈pred( )

Here, ∈ ∪ ℛ is either a strategy or a result. pred ( ) is the set of all direct and indirect predecessors of task . This probability also includes the behavior of any predecessor tasks.

Figure 4 shows an example task set with a precedence relation. Consider the task set in this example with two strategies: = {, }, protecting task and respectively. Each strategy has two outcomes, hence ℛ = {, , , }. Let us consider (fault in |): the result does not directly communicate anything about the state of task , however due to the precedence relation we know it efectively failed. As such, (incorrect output |) = 1.

6.2. Selective fault-tolerance and criticality

The task model can be extended to support heterogeneity in the ability of tasks to tolerate non-consecutive faults. Let Γ be the set of tasks which cannot tolerate non-consecutive faults. Then, we update (catastrophic fault in |,to ) to consider non-consecutive faults as catastrophic faults when ∈ Γ .

(catastrophic fault in |,to =)

(fault in |)

The goal of the strategy state machine is to minimize the rate of catastrophic faults. This may mean that the rate of catastrophic faults is so low that even a single fault is unlikely to occur across the lifetime of the system. However, when used in soft real-time deployments, a “catastrophic fault” may not be catastrophic while still undesirable. In such a deployment, catastrophic faults may be acceptable. The impact of a task experiencing such a catastrophic fault may not be the same across all tasks. As such, a priority function ( ∈ Γ) can be integrated into (, to ).

(, to ) = 1 − ∏︁ (1 − (catastrophic fault in |,to ))( )

The ( ) function assigns relative priority, where higher is better. A task with ( ) = 2 has twice the importance as a task with ( ) = 1. Note that this change afects the meaning of the output (Δ), which is no longer the rate of catastrophic faults. If the impact of a catastrophic fault in a task with ( ) = 2 is equivalent to two catastrophic faults in another task with ( ) = 1, then the final rate (Δ) can be seen as the rate of catastrophic faults normalized to faults in .

7. Related work

(, ) constraints have been used before in the domain of real-time scheduling. Choi et al. [7] proposed a scheduler, together with an eficient schedulability algorithm for a sporadic task set with tasks under (, ) constraints. This scheduler allows for scheduling task sets that would normally not be schedulable, but utilizing their (, ) constraints allows them to be scheduled.

Chen et al. [8] proposed a solution that similar to ours. Their method ofers fault-tolerance with the goal of reducing the efective fault rate as well as lowering energy consumption. Chen et al. proposes a static scheduling technique called Static Pattern-Based Reliable Execution, ensuring each (, ) constraint is respected in the presence of transient faults. Furthermore, they propose delaying the execution of their static pattern if no fault is detected at runtime, opportunistically running more unprotected instances of the task with the goal of saving energy. However, if the static pattern is found to be unschedulable as per their schedulability test, their implementation is unable to provide a schedule that minimizes the fault rate for a given resource-constrained real-time system. While their approach ofers more flexibility in the task model (specifically the support for (, ) constraints with > 2), it does not consider that fault mitigation may fail. Our approach optimally lowers the fault rate, regardless of the hardware constrains. Furthermore, our approach recognizes that fault mitigation may fail, and includes this in the calculation for lowering the fault rate.

Gujarati et al. [9] contributed a technique for measuring the fault rate of an application with tasks under (, ) constraints. Their technique provides an upper bound for the fault probability per iteration of a Fault-tolerant Single-Input Single-Output (FT-SISO) control loop, similar to our (Δ) output on a task set with precedence relations. Their technique hopes to provide transparency to system designers, allowing analyzing the impact on the reliablity when changing the hardware or software. However, while their approach is aware of (, ) constraints, it does not provide schedules that utilize such constraints. Instead, it merely includes them in the reliability calculation.

The domain of strategy switching shares some aspects with Mixed-Criticality (MC) systems. In an MC system, the system switches between diferent levels of criticality depending on the operating conditions of the system. Tasks are assigned a criticality level, and when the system criticality is higher than that of the task, the task is not scheduled to guarantee the successful and timely execution of tasks with a higher criticality level. Pathan [10] combines MC with fault-tolerance against transient faults. As is typical in MC research, as the level of criticality increases, the pessimism increases. Pathan increases the maximum fault rate when switching to a higher level of criticality. In our approach we do not vary the pessimism of any parameter. Instead, we assume the parameter provides a suitable upper bound to the fault rate in all conditions. Our approach ofers some aspects typically not found in MC systems: while one could argue that each strategy is really a criticality level, it is a criticality level applied to a subset of the tasks (specifically Γ). Finally, the approach by Pathan requires bounding the number of faults that can occur in any window. As such, passing their suficient schedulability test will (under their fault model) guarantee the system will never experience a fault.

8. Conclusion

In this paper, we have shown how strategy switching can be used to improve fault-tolerance for resource-constrained systems. Our method makes efective use of the ability to vary which tasks receive fault-tolerance. It considers at the start of every iteration of the task set what the best set of tasks is to protect. We have shown how our method computes the optimal strategy state machine for any given task set, minimizing the rate of catastrophic faults. We have also shown the flexibility of our method to be extended to support new task and fault models.

9. Future work

In future work, we hope to improve the tractability of our algorithm by both state space reduction algorithms as well as by using heuristics.

Furthermore, we hope to extend the fault model to distinguish between deadline misses and incorrect results. We also hope to integrate tasks with (, ) constraints with > 2. Additionally, we hope to integrate the natural ability of tasks to detect faults into our task model, as a SEU may for example lead to a segfault. Finally, we hope to validate our approach using simulation-based analysis.

Acknowledgments

The presentation of this paper was considerably improved in response to comments provided by the anonymous reviewers, and we gratefully acknowledge their insights and assistance.

This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 871259 (ADMORPH project). Additionally, this work is partially supported by CERCIRAS COST Action CA19135 funded by COST Association.

[1]

Oz ,

Arslan , A survey on multithreading alternatives for soft error fault tolerance , ACM Computing Surveys ( 2019 ).

[2]

R. E.

Lyons , W. Vanderkulk, The use of triple-modular redundancy to improve computer reliability , IBM journal of research and development 6 ( 1962 ) 200 - 209 .

[3]

Chang ,

G. A.

Reis , D. I. August , Automatic instruction-level software-only recovery , in: International Conference on Dependable Systems and Networks (DSN'06) , IEEE, 2006 , pp. 83 - 92 .

[4]

S. A.

Asghari ,

M. Binesh

Marvasti ,

A. M.

Rahmani , Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach , Future Generation Computer Systems 87 ( 2018 ) 58 - 65 . doi: 10 .1016/j.future. 2018 . 04 .049.

[5]

Bernat ,

Burns ,

Liamosi , Weakly hard real-time systems , IEEE transactions on Computers 50 ( 2001 ) 308 - 321 .

[6]

Broster ,

Burns ,

Rodriguez-Navas , Timing analysis of real-time communication under electromagnetic interference , Real-Time Systems 30 ( 2005 ) 55 - 81 .

[7]

Choi ,

Kim ,

Zhu , Job-class-level fixed priority scheduling of weakly-hard real-time systems , in: 2019 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) , 2019 , pp. 241 - 253 . doi: 10 .1109/RTAS. 2019 . 00028 .

[8] K.-H. Chen , B.

Bönninghof , J.-J.

Chen , P.

Marwedel , Compensate or ignore? Meeting control robustness requirements through adaptive soft-error handling , in: Proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers , Tools, and Theory for Embedded Systems , LCTES 2016, Association for Computing Machinery , New York, NY, USA, 2016 , pp. 82 - 91 . doi: 10 .1145/2907950.2907952.

[9]

Gujarati ,

Nasri ,

B. B.

Brandenburg , Quantifying the resiliency of fail-operational realtime networked control systems , in: S. Altmeyer (Ed.), 30th Euromicro Conference on RealTime Systems (ECRTS 2018 ), volume 106 of Leibniz International Proceedings in Informatics (LIPIcs), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik , Dagstuhl, Germany, 2018 , pp. 16 : 1 - 16 : 24 . doi: 10 .4230/LIPIcs.ECRTS. 2018 . 16 .

[10] R. M. Pathan , Fault-tolerant and real-time scheduling for mixed-criticality systems , RealTime Systems 50 ( 2014 ) 509 - 547 .