=Paper= {{Paper |id=Vol-1689/paper5 |storemode=property |title=Dynamic Feasibility Windows Reconfiguration For a Failure-Tolerant PFair Scheduling |pdfUrl=https://ceur-ws.org/Vol-1689/paper5.pdf |volume=Vol-1689 |authors=Yves Mouafo Tchinda,Annie Geniet-Choquet,Gaëlle Largeteau-Skapin |dblpUrl=https://dblp.org/rec/conf/vecos/TchindaGL16 }} ==Dynamic Feasibility Windows Reconfiguration For a Failure-Tolerant PFair Scheduling == https://ceur-ws.org/Vol-1689/paper5.pdf

Dynamic Feasibility Window Reconfiguration
for a Failure-Tolerant PFair Scheduling

Yves Mouafo Tchinda, Annie Choquet-Geniet, and Gaëlle Largeteau-Skapin

LIAS-ENSMA
Teleport2, 1 av. Clément Ader BP 40109
86961 Futuroscope-Chasseneuil France
XLIM-UNIVERSITE DE POITIERS
Teleport2, 11 bd Marie et Pierre Curie,BP 30179
86962 Futuroscope-Chasseneuil France
yves.mouafo@ensma.fr,annie.geniet@univ-poitiers.fr,
gaelle.largeteau.skapin@univ-poitiers.fr
http://www.lias-lab.fr; http://www.xlim.fr/sic

Abstract. This work addresses the problem of failure tolerance for real-
time applications. We propose a technique to schedule a system of tasks
running under PD2 on a multicore architecture submitted to the failure
of one of the cores at a given time with re-execution of the task running on
it. The technique is based on limited hardware redundancy and subtask
feasibility windows reconfiguration at runtime. We limit the study to the
re-execution of one time unit.

Keywords: Failure Tolerance, Multicore Architecture, Task Re-execution,
Pfair Scheduling, Dynamic Reconfiguration

1 Introduction
In recent decades, there has been a significant increase in the use of multicore ar-
chitectures in real-time embedded systems [1]. A multicore processor has two or
more independent cores into a single package. The drawback of such an architec-
ture is its weakness because a core may fail at any time requiring an adaptation
of the system [2]. Therefore, designers of systems running on such architectures
attach significant importance to the failure tolerance.
We consider systems of critical periodic tasks running on a multicore ar-
chitecture. We use the classical temporal modelling of tasks. Each task τi is
characterized by four temporal parameters: the first release date or offset ri , the
worst-case execution time (WCET) Ci , the period Ti and the relative deadline
Di which is the maximum delay allowed from the release of any instance of the
task to its completion.
If all the first release dates are equal, the tasks are said to have simultaneous first
releases. When the deadlines verify Di ≤ Ti , we say that they are constrained;
when Di =Ti , they are implicit. A system S is characterised by its processor’s
utilisation factor, which represents the processor workload due to the system.
2 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

Pn
It is defined by U = i=1 (Ci /Ti ). Finally, H = LCM (T1 ...Tn ) denotes the
hyper-period of the system.
In the sequel, we consider a system composed of n periodic independant tasks
with simultaneous first releases and implicit deadlines. A task τi is denoted
by a triplet < Ci , Di , Ti > when the deadlines are constrained or by a pair
< Ci , Ti > when they are implicit. We consider here global scheduling with the
algorithm PD2 (see 3.2), which is optimal in our context. In addition, there exists
a necessary and sufficient
Pn feasibility condition easy to compute. The system is
feasible on k cores iff i=1 (Ci /Ti ) ≤ k [5]. PD2 recommends to divide tasks into
unitary subtasks. If one of the cores fails, it affects the subtasks running on it. In
[3], we proved that if the affected subtasks are simply ignored, limited hardware
redundancy (see 4.1), which consists of the addition of a single further core,
provides a valid schedule (i.e a schedule which meets all its temporal constraints).
In the case where the affected subtask must be re-executed, the main issue is to
ensure the validity of the schedule, even if the re-execution causes delays.
We propose a technique based on a dynamic reconfiguration of the subtask
feasibility windows at runtime. The execution starts with constrained feasibility
windows. After the failure detection, the feasibility windows are released. The
constrained windows are computed by means of a task system with constrained
deadlines, using a method named ghost substask method (see 4.2).
We prove that, together with the limited hardware redundancy, this technique
of contraction and relaxation of feasibility windows provides a valid schedule.
The remainder of this paper is organized as follows: in Section 2, we present
the scheduling algorithm PD2 and the failure models of interest. Then, a state
of the art is given (Section 3). It is followed by the presentation of our approach
(section 4) and the priority inversion issue (Section 5). The paper is concluded
with the presentation of experimental results and some elements of proof of our
central feasibility result.

2 Context and problematic

2.1 Scheduling algorithm

PD2 [4] is a PFair (Proportionate Fair) algorithm. PFair algorihtms require
the execution of tasks at a regular rate, the objective is to approach an ideal
scheduling in which each task τi receives exactly Ui × t processor time units in
the range [0, t). The construction of a PFair scheduling requires to divide each
task τi into unitary subtasks τij (j ≥ 0). Each subtask has a pseudo-release date
rij = b Uji c and a pseudo-deadline dji = d j+1 Ci
Ui e, with j ≥ 0 and Ui = Ti . The
interval [rij , dji ) represents the feasibility window of the subtask τij . Scheduling a
subtask τij in its feasibility window means that τi runs for one time unit within
the time inverval [rij , dji ). Here the notion of subtask refers to a time division
not to a software division.
PD2 is recognized as the most efficient Pfair algorithms because it has a low
complexity for decision-making. A subtask has priority over another if it has the
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 3

smallest deadline. In the case of equality of deadlines, PD2 uses two additional
criteria based on the bit succesor bji and the group deadline Dij .
If τij and τij+1 denote two successive subtasks of the task τi , bji = 1 if their
feasibility windows overlap and bji = 0 otherwise.
dj −j−1
If Ui ≥ 0.5, Dij = d i1−Ui e; otherwise Dij = 0 (j ≥ 0).
Then, a subtask τij has priority over the subtask τqk if
(dji < dkq ) OR (dji = dkq ∧ bji > bkq ) OR (dji = dkq ∧ bji = bkq = 1 ∧ Dij > Dqk ) OR
(dji = dkq ∧ bji = bkq = 1 ∧ Dij = Dqk ∧ i < q). The last condition guarantees the
determinism. In section 5 we present an example.

2.2 Failure modeling
All computer systems are subject to hardware and software failures. Hardware
failures can be categorised into permanent, transient and intermittent failures
[6]. Permanent failure, such as wear off of any part, requires the replacement
of the spare part to restore the system functionality and does not disappear by
itself with time. When a core cycles between being working and out of work,
the failure is said intermittent. Transient failures are short term failures. They
may occur because of external noise or temporal disturbance due to unidentified
source; the core then recovers after a while. We consider here permanent failure.
We assume that during the processing only one core fails. In this context there
are two possible scenarios depending on the delay of detection:
1. Either the failure is detected instantly by means of a mechanism such as
the ones presented in [7]. This means that no execution is lost and thus the
problem of giving additionnal time to some tasks doesn’t arise.
2. Or the failure is detected after x time units. In this case, we can consider
three cases: (a) The current instances of the affected tasks must be fully exe-
cuted: additional time is allocated to regain the lost execution and complete
the tasks; (b) It is not mandatory to fully execute the current instances of
the impacted tasks: there is thus no further time allocation. The affected
tasks will use their remaining time to reexecute what has been lost and then
continue execution until full use of the allocated time. This is for example
the case for tasks which compute a result by means of successive iterations.
A shorter execution means a less precise result, but doesn’t lead to malfunc-
tions; (c) The current instances of the affected tasks can be discarded. The
scheduling continues with the unaffected task instances.

2.3 Outline of the study
Failure treatment consists of failure diagnosis and failure passivation [8]. Failure
diagnosis consists of the location of the source of the failure, i.e the affected
(hardware or software) component(s), and the determination of the nature of the
failure (transient, permanent or intermittent). Passivation consists in providing
an emergency response to the identified failure. It is our concern in this paper.
4 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

We focus on permanent failures. We consider that the diagnosis stage has
been completed, that the failure is detected one time unit after its occurrence
and the failing core identified. Thus, only one task is affected and we assume
that its current instance must be fully completed. Therefore we are in the case
2−a of the previous description with x = 1. According to the task decomposition
into unitary subtasks, only one substask of the affected task will have to be re-
executed. Since the tasks are periodic with simultaneous first releases, the system
is cyclic with period H. Thus, we limit the study to the first hyper-period [0, H).
Before presenting our solution based on limited hardware redundancy combined
with feasibility windows reconfiguration, we first give an overview of existing
related results in the litterature.

3 State of the art

3.1 Related works

The aim of failure tolerance is to schedule tasks in such a way that deadlines are
still met despite processor or software failure. Several studies have been devoted
to that issue. In this section, we are interested in those on tolerance to hardware
failures on a multicore or multiprocessor architecture. The classical way to pro-
vide fault-tolerance on multicore platforms is to use redundancy [9]. The idea
is to introduce redundant copies of the elements to be protected (processor or
other components), and exploit them in the case of a fault. Therefore, hardware,
time, information and software redundancies are used for fault-tolerance [10]. Of
these types, time and hardware redundancy are the most frequently used. There
are two main techniques for hardware redundancy: Triple Modular Redundancy
(TMR) and Primary/Backup (PB). In TMR, three processors run redundant
copies of the same workload and mask errors by voting on their outputs. In PB,
two copies of each task are scheduled on different processors. The primary copy
is executed and its output is checked for correctness by an acceptance test. If
the acceptance test is negative, the execution of the backup copy is initiated.
While there are some variations from one approach to another, the general
method to respond to a failure can be described as follows:

– Transient failures: If the system is designed only to withstand transients
that disappear quickly, reexecution of the failed task or of a shorter, sim-
pler version of that task, is carried out. The scheduling problem reduces to
ensuring that there is always enough time to carry out such an execution
before the deadline.
– Permanent failure: Backup versions of the tasks assigned to the failing core
must be invoked. The steps are: provide each task with a backup copy; place
the backups in the schedule; if the processor fails, activate one backup for
each task that has been affected.

Most papers found in the litterature concern transient failures. The main
techniques are given in [2]. In [6], authors propose an algorithm which is the
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 5

combination of TMR and DMR (Double Modular Redundancy) with an hybrid
scheduling, based on the addition to the task parameters of an architectural vul-
nerability factor (AVF). [11] introduce checkpointing optimization. While run-
ning, each task records its states at checkpoints. When a failure occurs, the
backup can resume execution from the lastest ckeckpoint.
Other authors address the use of PB technique for a fault tolerant EDF schedul-
ing [12]. Notice that the papers cited above concern partioned scheduling. For
global scheduling there is little work. We can mention [13] which proposes an
algorithm (named FT-PF) for a tolerant Pfair Scheduling. This algorithm in-
creases the utilization of a task to ensure that it completes execution Vi time
units before its deadline. So, when a task must recover, it can be allocated Vi
time units for recovery. FT-PF also uses a spare core to ensure that recovery
will not fail for lack of resources. This technique is similar to what we propose
for a permanent failure in the way that it modifies task temporal parameters
and uses one more core. The difference is that, in our approach, this core is
not kept idle at the beginning of the scheduling and after the failure, only the
affected task keeps its parameters modified. Moreover, in the FT-PF approach,
the additional time for a task recovery is created by increasing the WCET while
keeping implicit deadlines, whereas in our approach, the WCET is kept at its
initial value and deadlines are constrained.
There are still some works on permanent failures in the context of partitioned
scheduling. We can cite [14] whose idea is to provide each core with a twin and
assign to that twin all the task assigned to the initial core. Then each pair of
cores can suffer a failure without any deadlines being missed. However, such
a pairing approach can require more cores than necessary and leads to system
overload. [15] enriches PB technique by distinguishing two types of backup copy:
active backup and passive backup. The active backup is released when we know
that there is not enough remaining time to complete the primary copy. So, it
can start running even before we know that the primary has failed. The passive
backup starts after the failure of the primary. In summary, the active backup is
used as a preventive solution and the passive as a curative.

3.2 Originality of our contribution

PB which is the main technique used to manage a permanent failure, is not suit-
able in a global scheduling context, because it assumes that primary copies and
backups are assigned statically to the cores and no task migration is allowed.
Due to backup scheduling, PB technique causes an increase of system load and
therefore of the number of redundant cores required to guarantee the feasibility.
In light of the previous remarks, building a tolerant scheduling with PFair re-
quires a different approach. We thus propose a technique based on the modi-
fication of the temporal parameters. Since tasks are divided into subtasks, no
need to use ckeck pointing for the recovery. The originality of this contribution
is thus at two levels: 1- It covers a domain where there is little work: the domain
of Failure tolerance to a permanent failure in a context of global scheduling,
6 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

specifically by using Pfair algorithms; 2- It offers a new approach that limits
hardware redundancy while preserving the full functionality of the system.

4 Our approach

To guarantee a valid behavior despite the failure of a core, we propose to act on
the one hand at the architecture level, through the limited hardware redundancy
and, on the other hand, at the system level through dynamic reconfiguration of
feasibility windows. The main problem of the latter is that it can produce priority
modification (inversion) that must be managed.

4.1 Limited hardware redundancy

The minimal required number of cores for the application to be feasible is m =
dU e according to the feasibility condition. Now, if U = m, the system cannot
support an additional execution time unit on m cores, since S is fully loaded.
Thus, if U is an integer, we set m = U + 1. Therefore, we finally state m =
bU c + 1. The limited hardware redundancy consists in providing one more core
than necessary. So, S will run on m + 1 cores. When a failure occurs, after
recovery, the system will remain feasible on the m remaining cores.
We denote tp the time by which the failure occurs. The system thus switch to
m cores at time tp + 1.

4.2 Dynamic reconfiguration of windows: the ghost subtask method

The objective is to create for each task a feasibility window, called tolerance
window, which will be used as additional time to permit the affected subtask
rescheduling. For each task, we replace the implicit deadlines with constrained
deadlines. To compute them, we use the ghost subtask method which simulates
the addition of a subtask to each task.
For each task τi , we consider a WCET equal to Ci +1 in the calculation of the
subtask feasibility windows: each instance is assumed to have one more subtask.
The pseudo-deadline of penultimate subtask (τiCi −1 ) of the first instance is taken
i −1
as relative deadline: Di0 = dCi (see Fig. 1).
The subtask feasibility windows of the first instance of a task τi0 with con-
0 0
trained deadlines are computed by [16] rij = b CH j
i
c and dij = d CHj+1
i
e, j ≥ 0,
Ci
where CHi = D0 denotes the load factor of the task. To obtain the correspond-
i
ing parameters for the hth (h > 0) instance of τi0 , we just add h ∗ Ti to the first
instance values.
Before the failure, the feasibility windows are calculated with constrained
deadlines. When the failure is detected, the subtasks of the unaffected tasks
switch to their windows with implicit deadlines, whereas the affected task τi0
keeps on using the constrained windows until the end of the current hyper-
period. In addition, the WCET of τi0 switches from Ci0 to Ci0 + 1 to integrate
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 7

Fig. 1. The Ghost subtask method

the tolerance windows needed for the affected substask re-execution.
We thus use the following systems:
– The initial system S : τi < Ci , Ti >, having tasks with implicit deadlines;
– The ghost system S + : τi < Ci + 1, Ti > in which each task has one more
substask (the ghost subtask used to determine the task deadline);
– The constrained system S 0 : τi0 < Ci , Di0 , Ti > with constrained deadlines
deduced from S + feasibility windows;
– The intermediate system Si0 : τi6=i0 < Ci , Ti >, τi0 < Ci0 + 1, Ti0 >.
Our assumptions on these systems are the following:
1. S is feasible, which is guaranteed by the choice of m cores.
2. Each task in S has at least one left time unit between its WCET and its
period. i.e ∀τi , Ti − Ci ≥ 1. P
n
3. S’ is feasible on m + 1 cores: i=1 (Ci /Di0 ) < m + 1 [16].
C +1
4. Si0 is feasible on m cores for any i0 . i.e maxi0 ( i6=i0 (Ci /Ti ) + Ti0i ) ≤ m.
P
0

The subtasks use their constrained feasibility windows before failure, then the
intermediate windows from the time of failure detection tp + 1 until the next
hyper-period of the system N extH , and finally, their initial windows after this
hyper-period.
Formally, reconfiguration is performed as follows:
Non-affected tasks τi (i 6= i0 ):
0 0
Not yet scheduled subtasks switch to their windows in S: [rij , dij ) 7−→ [rij , dji ).
Affected task τi0 : three steps.
Before H, the remaining subtasks keep their constrained feasibility windows:
0 0 0 0
[ri0j , dij0 ) 7−→ [ri0j , dij0 );
After H, the remaining subtasks switch to their windows in S:
0 0
[ri0j , dij0 ) 7−→ [rij0 , dji0 );
The affected subtask τij00 is rescheduled in the tolerance windows of the current
0 0
instance: [ri0j0 , dij00 ) 7−→ [rit0h , dti0h ).
In the sequel, we denote by S 0 −→tp Si0 the system running from time 0 to
N extH , with failure at time tp .
8 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

4.3 Justification of the approach

Consider first the architecture. For the sake of weight, cost and energy consump-
tion, we decided to use the lowest possible number of cores. Thus, we only add
one core. Then, we could have decided to use it as a spare core: the application
runs on m cores and when one of them fails, the spare one takes over. But since
we wanted to be able to give additionnal time to the affected task, we prefered
to fully use the (m + 1) cores to get free slots for the additional processing time
unit.
At the system level, we wanted to constrain the system as less as possible, in
order to manage the largest number of feasible systems. Therefore, we considered
the three different systems. The system S 0 with constrained deadlines but with
the initial WCET. Starting with the increased WCET system S + would have
increased the load factor. Thus, some actually feasible systems would have be-
come non feasible. Then the intermediate system Si0 , whose feasibility windows
are constrained only for one task and relaxed for the other ones, provides more
flexibility to reschedule the affected subtask. Finally, return to the initial system
at the hyper-period ensures the feasibilty of the system on the remaining cores
after recovering.

4.4 Illustration of the approach

Consider the following system of tasks :
S : τ1 < 1, 3 >, τ2 < 3, 6 >, τ3 < 3, 4 >, τ4 < 5, 12 >, τ5 < 7, 12 >.
32
We have U = 12 thus m = 3 and h = 12.
We first add a 4th core. We then compute constrained deadlines. Each task
is supposed to have a WCET increased by 1. We consider the first instance of
i −1
each task and get (Di0 = dC i ):
0 0 0
D1 = d1 = d 2 e = 2; D2 = d22 = d 2+1
0 0+1 2 2+1
4 e = 5; D3 = d3 = d 4 e = 3;
3 6 4
D40 = d44 = d 4+1 0 6 6+1
6 e = 10; D5 = d5 = d 8 e = 11.
12 12
We obtain the following constrained system:
S 0 : τ10 < 1, 2, 3 >, τ20 < 3, 5, 6 >, τ30 < 3, 3, 4 >, τ40 < 5, 10, 12 >, τ50 < 7, 11, 12 >.
We assume that a failure occurs on the core C2 at time 1, thus is detected at
time 2. The affected task is τ3 (i.e i0 = 3). We thefore have
Si0 : τ1 < 1, 3 >, τ2 < 3, 6 >, τ30 < 4, 4 >, τ4 < 5, 12 >, τ5 < 7, 12 >.
Figure 2 shows the feasibility windows of each subtask in the different systems
S, S 0 and S3 and the actual feasibility windows used by each substask during
the scheduling (see column R).
In S3 , for each instance h of τ3 an additionnal subtask τ3th is provided which
feasibility windows can be used as tolerance window for re-execution. Thus, the
following windows are reserved: [3, 4) in the first instance, [7, 8) in the second and
[11, 12) in the third. Since the affected subtask belongs to the first instance of τ3 ,
the first tolerance window [3, 4) is used for its rescheduling. Figure 3 shows the
tolerant scheduling of system S where the failure on the core C2 is detected at
time tp +1 = 2. An additional subtask is scheduled in the tolerance windows [3, 4)
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 9

Fig. 2. Subtask feasibility windows in different systems

to fully complete τ3 . From time 2, the other tasks use their relaxed feasibility
windows.

Fig. 3. A failure-tolerant scheduling of System S

5 Some issue: inversion of priorities

5.1 The problem

According to PD2, priorities among subtasks are based on their feasibility win-
dows. Because these windows are not the same in systems S, S 0 and Si0 , it may
result in our approach changes of priorities between subtasks while switching
from one system to another. Moreover, a subtask may be released later in Si0
than in S 0 . This priority inversion between subtasks has consequences on the
scheduling after the failure. Even if after reconfiguration subtasks have switched
10 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

to their windows in the system Si0 , S 0 −→tp Si0 doesn’t necessarily behave
identically to Si0 for three main reasons: firstly, at time t > tp there may exist
subtasks pending in Si0 , which have already been scheduled in S 0 . We call them
anticipated subtasks. These subtasks are pending in Si0 but not in S 0 −→tp Si0 .
For example, τ51 and τ21 in Fig 3 and τ00 , τ10 and τ20 in Fig 5. Secondly, there
may exist subtasks scheduled in Si0 before the failure that were planned after
the failure in S 0 . We name them the staggered subtasks. These are pending in
S 0 −→tp Si0 , but not in Si0 . It is the case of τ80 and τ90 in Fig 5. Finally, a subtask
can be scheduled at any time t in Si0 but later in S 0 −→tp Si0 because it was
replaced by a higher priority subtask. We call it a postponed subtask. In Fig 5
0 0
τ17 and τ18 are postponed subtasks.
Note that, a subtask is said anticipated, staggered or postponed by reference to
the time planned for its execution in Si0 .

5.2 Illustration

Consider the behavior of the systems S 0 −→1 Si0 (Fig. 3) and Si0 (Fig. 4). Due
to reconfiguration, substasks have the same feasibility windows in both systems.
Moreover, in this example, there is no staggered substask, since at time t = 1
all substasks already scheduled in Si0 have already been scheduled in S 0 −→1 Si0 .

Fig. 4. Si0 scheduling on a 3 cores

We have following first remark. If there is no staggered subtask, a subtask τij
is scheduled in S 0 −→tp Si0 at the same time than in Si0 (eg. τ32 and τ52 ) or
earlier (eg. τ41 and τ11 ) .
Let’s now study an example with staggered subtask. Consider the following
system composed of 48 tasks. τi−j < C, T > denotes a list of tasks τi , τi+1 ...τj
with common temporal parameters.
S1 : τ0−3 < 1, 20 >, τ4−7 < 1, 36 >, τ8−47 < 2, 38 >.
U = 2.41 and m = b2.41c + 1 = 3. Then, the system is feasible on 3 cores.
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 11

Applying the ghost subtask method we get the constrained system
S10 = τ0−3 < 1, 10 >, τ4−7 < 1, 18 >, τ8−47 < 2, 26 > which runs on 4 cores.
We suppose that a failure occurs at time tp = 0 on core C4 affecting the sub-
task of τ30 . The feasibility windows of the first instances of the tasks in different
systems are given below:
0 0 0 1
S1: τ0−3 (0, 20), τ4−7 (0, 36), τ8−47 (0, 19), τ8−47 (19, 38).
0 0 0 0 1
S1 : τ0−3 (0, 10), τ4−7 (0, 18), τ8−47 (0, 13), τ8−47 (13, 26).
0
S13 : τ0−2 (0, 20), τ30 (0, 10), τ31 (10, 20), τ4−7
0 0
(0, 36), τ8−47 1
(0, 19), τ8−47 (19, 38).
0 0
We can note the priority inversion between τ0−3 and τ8−47 : in S1 and S13 ,
0
τ8−47 has priority over τ0−3 0
, whereas τ0−3 0
has priority over τ8−470
in S10 . Fig-
0
ure 5 shows respectively the PD2 schedule for the systems S1 −→0 S13 and S13 .

Fig. 5. A Tolerant Scheduling with staggered subtasks

The comparison of two figures leads us to a second remark:
at time tp , the 3 anticipated substasks τ00 , τ10 and τ20 correspond to the 2 staggered
subtasks τ80 and τ90 . In S10 −→0 S1i0 , the staggered subtasks are scheduled at time
0 0
tp + 1 = 1 causing the postponement of the substasks (τ14 and τ15 ) scheduled at
this time in S1i0 . The 2 postponed subtasks are scheduled at time 2 and 2 other
0 0
substasks (τ17 and τ18 ) are at their turn potsponed. Subtask postponement ends
at the time t = 15. From this time, subtasks are scheduled in S10 −→0 S1i0 at
the same time as in S1i0 or earlier.
12 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

Theses remarks give the key arguments for the proof of the validity of our
approach, specifically for Proposition 1 (for the first remark) and for Proposition
2 (for the second remark).

6 The results

6.1 Experimentations

To experiment our approach, we designed a software prototype, named FTA
(Failure Tolerance Analyser), to simulate PD2 scheduling with failure. This pro-
totype has three main modules: a random generator systems, a scheduler (that
randomly determines the time of the failure and the affected core and schedules
the system according to the proposed approach) and a Diagnostic tool (that
analyzes the sequence produced by the scheduler and determines whether it is
valid and fair).
Several parameters can be set before the analysis: the system load, the num-
ber of heavy tasks (i.e tasks with Ui ≥ 0.5), the time of occurrence of the failure
and the affected core. 550 random systems have been generated and submit-
ted to the simulation. This has been done by means of 11 random generations
each composed of 50 systems containing different numbers of heavy tasks. The
simulation is repeated many times to let the parameters vary and observe their
impact. All the obtained scheduling results were valid.
In the following paragraph, we give some elements of proof of this result.

6.2 Validity of our approach

We first define some terms and used notations.

– SchedSys
nb : PD2 schedule of the system Sys on nb cores.
S 0 −→t Si
– Sched(m+1)−→m p 0
: PD2 schedule of system S 0 running on m + 1 cores with a
failure of one core at time tp and a switch to system Si0 at time tp + 1.
– P ending(S, t): list of pending substasks in system S at time t. In our context,
a subtask τij is pending at time t if it is released and not yet scheduled. And
if j > 0, the previous subtask τij−1 has already been scheduled.
– Exec(τij , Sched): execution time of the subtask τij in the schedule Sched.
– tp : failure time; τij00 : the affected substask.
– N extH : next hyperperiod after tp ;
j(S) j(S)
– ri and di : pseudo-release date and pseudo-deadline of τij in system S.
0 S
We assume that SchedSm , SchedS(m+1) and Schedmi0 are valid and fair.
Our main result is that if S, S 0 and Si0 are feasible respectively on m, m + 1
and m cores, then S 0 −→tp Si0 is feasible.
The proof of the validity of S 0 −→tp Si0 is very technical; thus, we only present
some elements in two cases, as stated in the following theorem:
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 13

Theorem 1. If there is no staggered subtask or if there are staggered subtasks
and the failure occurs at a time of a hyper-period (i.e tp = 0[H]),
S 0 −→t Si
p 0
then Sched(m+1)−→m is valid and fair.
To prove this theorem, we first consider the case where no priority inversion
takes place, i.e there is no staggered subtasks. This can be expressed by the
following condition [H1]:
S 0
[H1] : ∀τij , Exec(τij , Schedmi0 ) ≤ tp =⇒ Exec(τij , SchedS(m+1) ) ≤ tp , τij 6= τit0h .
We then state the following proposition:
Proposition 1. R(t)
If there is no staggered subtask, at any time t > tp , a subtask τij is not scheduled
S 0 −→t Si S
p 0
later in Sched(m+1)−→m than in Schedmi0 .
To prove this proposition, we proceed by induction on t, using two further
properties:

Property 1. Prop1(t): A subtask pending at time t (t ≥ tp + 1) in Si0 is either
also pending in S 0 −→tp Si0 or has already been processed.
τij ∈ P ending(Si0 , t) =⇒ τij ∈ P ending(S 0 −→tp Si0 , t) or
(S 0 −→t Si )
Exec(τij , Sched(m+1)−→m
p 0
) 0, if all the subtasks already scheduled in Si0 have already
been scheduled in S 0 −→tp Si0 , then from this time, each subtask is scheduled in
S 0 −→tp Si0 not later than as in Si0 . i.e
S (S 0 −→t Si )
If Exec(τij , Schedmi0 ) = t0 with t0 > tp then, Exec(τij , Sched(m+1)−→m
p 0
) ≤ t0 .
With these two propositions we now prove the theorem.

Proof. (Theorem 1)
- First case: there is no staggered subtask.
According to R(t), after the failure we have
(S 0 −→t Si ) S
Exec(τij , Sched(m+1)−→m
p 0
) ≤ Exec(τij , Schedmi0 ).
S
Since Schedmi0 is valid and fair by assumption, Theorem 1 holds.
14 Dynamic Feasibility Window Reconfiguration for Failure Tolerance

- Second case: there are k staggered subtasks at tp = 0.
Using the second remark of Paragraph 5.2 and P rop(t), we prove that:
(1) At time tp = 0, to the k staggered substasks τa01 ...τa0k correspond k + 1
anticipated substasks τb01 ...τb0k+1 (in decreasing priority order) such that tbi =
S 0
Exec(τb0i , Schedmi0 ) > tp and Exec(τb0i , SchedS(m+1) ) = tp , with tb1 ≤ tb2 ≤ ... ≤
tbk+1 and tmax = M AX(tb1 ). (2) At time t = 1, the staggered subtasks τa01 ...τa0k
are among the m highest priority subtasks in S 0 −→tp Si0 and are thus executed
0 0(S 0 )
while meeting their deadlines: in fact, Exec(τa0i , SchedS(m+1) ) > 0 =⇒ dai >1
0(S)
and thus dai > 1. Their execution can lead to the postponment of other planned
subtasks τujir
ir
. (3) From time t=2 to time t= tmax −1, substasks postponed at t−1
are scheduled at t. We thus have a cascade of postponements of subtasks which
are shifted by one time unit. And we prove that they still meet their pseudo-
deadlines, because of the assumption U (Si0 ) ≤ m (for reasons of space, we
cannot detail the proof here). (4) At time tmax , all the staggered and postponed
subtasks have been scheduled in place of anticipated subtasks. Now, all the
substasks already scheduled in Si0 are already scheduled in S 0 −→tp Si0 and
thus P rop(tmax ) can be applied.
S
Since P rop(t) is true and Schedmi0 is valid by assumption [A3], all the pseudo-
(S 0 −→t Si ) j(S)
deadlines are met in S 0 −→tp Si0 . Thus, Sched(m+1)−→m
p 0
< di is valid and
Theorem 1 holds

7 Conclusion and perspectives
Failure tolerance is a fundamental aspect of embedded system design. Today,
many such systems run on multicore architectures. The advent of multicore
brings timeliness due to increased processing units. However, there are some
issues linked to these architectures, such as hardware failure risks. In this paper,
we considered a permanent failure of one core detected within the next time
unit after its occurrence for systems scheduled under PD2. Then one subtask
is affected and its execution is resumed. We proposed an approach for failure
tolerance based on limited hardware redundancy and dynamic reconfiguration
of subtask feasibility windows. We have experimentaly tested this method and
the resulting schedules were always valid and fair whatever the system load,
the time of the failure and the percentage of heavy tasks. This approach raises
the problem of priority inversion between subtasks during the scheduling. When
there is no staggered subtask, we proved the validity of the resulting schedule.
We also proved it when there are some staggered subtasks and the failure occurs
at the time of a hyper-period (i.e tp = 0[H]). We are currently completing the
proof for the last case where there are staggered subtasks and the failure occurs
at any time different of a hyperperiod. This part of the proof is more technical
than the one presented in this paper.
In future works we will study the case where the failure detection delay is larger,
and therefore more subtasks, that may belong to different tasks, are re-executed.
We will also study the case where several cores may fail.
Dynamic Feasibility Window Reconfiguration for Failure Tolerance 15

References
1. C. Lee, H. Kim, H. Park, S. Kim, H. Oh, S. Ha : A Task Remapping Tech-
nique for Reliable Multicore Embedded Systems. International Conference on Hard-
ware/Software Codesign and System Synthesis, pages 307 to 316 (2015)
2. S. Malhotra, P. Narkhede, K. Shah, S. Makaraju, M. Shanmugasundaram: A Re-
view of Fault Tolerant Sscheduling in Multicore Systems. International Journal of
Scientific and Technology Research, Volume 4, pages 132 to 136 (2015)
3. Y. Mouafo, A. Choquet-Geniet, G. Largeteau-Skapi: Failure Tolerance for a Multi-
core Real-Time System Scheduled by PD2. Proceedings of the 9th , Junior Researcher
Workshop on Real-Time Computing, pages 1-4 (2015).
4. J. Anderson, A. Srinivavasan: A New Look at Pfair Priorities. Rap. tech. TR00-023,
University of North Carolina at Chapel Hil (1999)
5. S.K. Baruah, N.K. Cohen, C.G. Plaxton, D.A. Varvel: A Notion of Fairness in
Ressource Allocation. Algorithmica 15 (6), Pg 600-625 (1996)
6. S. Shiravi, M.E. Salehi: Fault Tolerant Task Scheduling Algorithm for Multicore
Systems. The 22nd Iranian Conference on Electrical Engineering (ICEE), Pg 885-
890 (2014)
7. E. Chantery, Y. Pencole: Modélisation et Intégration du Diagnostic Actif dans une
Architecture Embarqué. In: Journal Européen des Systèmes Automatisées, MSR
2009, Pg 789-803 (2009)
8. A. Bondavalli, S. Chiaradonna, F. Di Giandomenico: Efficient Fault Tolerance: An
Approach to Deal With Transient Faults in Multiprocessor Architectures. Interna-
tional Conference on Parallel and Distributed Systems, Pg 354-359 (1994)
9. M. Cirinei, E. Bini, G. Lipari, A. Ferrari: A Flexible Scheme for Scheduling Fault-
Tolerant Real-Time Tasks on Multiprocessors. IEEE International Parallel and Dis-
tributed Processing Symposium, Pg 1-8 (2007)
10. C.M. Krishna: Fault-Tolerant Scheduling In Homogeneous Real-Time Systems. In:
Journal ACM Computing Surveys (CSUR), Volume 46 Issue 4, Article No 48 (2014)
11. K. Ahn, J. Kim, S. HONG: Fault-Tolerant Real-Time Scheduling Using Pas-
sive Replicas. Pacific Rim International Symposium on Fault-Tolerance, Pg 98-103
(1997)
12. A. Christy, T.R. Gopalakrishnan Nair: Fault-Tolerant Real Time Systems: Inter-
national Conference on Managing Next Generation Software Application (2008)
13. F. Liberato, S. Lauzac, R. Melhem, D. Mosse: Fault-Tolerant Real-Time Global
Scheduling on Multiprocessors. In: Proceedings of the 11th Euromicro Conference
on Real-time Systems, Pg 252-259 (1999)
14. Y. Oh, S.H. Son: An Algorithm For Real-Time Fault-Tolerant Scheduling in Mul-
tiprocessor Systems. Euromicro Workshop on Real-Time Systems, Pages 190-195
(1992)
15. A.A. Berossi, L.V. Mancini, F. Rossini: Fault-Tolerant Rate-Monotonic First-Fit
Scheduling in Hard Real-Time Systems. IEEE Transactions on Parallel and Dis-
tributed Systems 10, Pg 934-945 (1999)
16. S. Malo, A. Choquet-Geniet, M. Bikienga: PFair Scheduling of Late Released Tasks
with Constrained Deadlines. Colloque National sur la Recherche en Informatique et
ses applications, pages 142-149 (2012)