<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dynamic Feasibility Window Recon guration for a Failure-Tolerant PFair Scheduling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yves Mouafo Tchinda</string-name>
          <email>yves.mouafo@ensma.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annie Choquet-Geniet</string-name>
          <email>annie.geniet@univ-poitiers.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaelle Largeteau-Skapin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIAS-ENSMA Teleport2, 1 av. Clement Ader BP 40109 86961 Futuroscope-Chasseneuil France XLIM-UNIVERSITE DE POITIERS Teleport2, 11 bd Marie et Pierre Curie</institution>
          ,
          <addr-line>BP 30179 86962 Futuroscope-Chasseneuil</addr-line>
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work addresses the problem of failure tolerance for realtime applications. We propose a technique to schedule a system of tasks running under PD2 on a multicore architecture submitted to the failure of one of the cores at a given time with re-execution of the task running on it. The technique is based on limited hardware redundancy and subtask feasibility windows recon guration at runtime. We limit the study to the re-execution of one time unit.</p>
      </abstract>
      <kwd-group>
        <kwd>Failure Tolerance</kwd>
        <kwd>Multicore Architecture</kwd>
        <kwd>Task Re-execution</kwd>
        <kwd>Pfair Scheduling</kwd>
        <kwd>Dynamic Recon guration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent decades, there has been a signi cant increase in the use of multicore
architectures in real-time embedded systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A multicore processor has two or
more independent cores into a single package. The drawback of such an
architecture is its weakness because a core may fail at any time requiring an adaptation
of the system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Therefore, designers of systems running on such architectures
attach signi cant importance to the failure tolerance.
      </p>
      <p>We consider systems of critical periodic tasks running on a multicore
architecture. We use the classical temporal modelling of tasks. Each task i is
characterized by four temporal parameters: the rst release date or o set ri, the
worst-case execution time (WCET) Ci, the period Ti and the relative deadline
Di which is the maximum delay allowed from the release of any instance of the
task to its completion.</p>
      <p>If all the rst release dates are equal, the tasks are said to have simultaneous rst
releases. When the deadlines verify Di Ti, we say that they are constrained;
when Di=Ti, they are implicit. A system S is characterised by its processor's
utilisation factor, which represents the processor workload due to the system.
It is de ned by U = Pn</p>
      <p>i=1(Ci=Ti). Finally, H = LCM (T1:::Tn) denotes the
hyper-period of the system.</p>
      <p>
        In the sequel, we consider a system composed of n periodic independant tasks
with simultaneous rst releases and implicit deadlines. A task i is denoted
by a triplet &lt; Ci; Di; Ti &gt; when the deadlines are constrained or by a pair
&lt; Ci; Ti &gt; when they are implicit. We consider here global scheduling with the
algorithm PD2 (see 3.2), which is optimal in our context. In addition, there exists
a necessary and su cient feasibility condition easy to compute. The system is
feasible on k cores i Pin=1(Ci=Ti) k [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. PD2 recommends to divide tasks into
unitary subtasks. If one of the cores fails, it a ects the subtasks running on it. In
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we proved that if the a ected subtasks are simply ignored, limited hardware
redundancy (see 4.1), which consists of the addition of a single further core,
provides a valid schedule (i.e a schedule which meets all its temporal constraints).
In the case where the a ected subtask must be re-executed, the main issue is to
ensure the validity of the schedule, even if the re-execution causes delays.
      </p>
      <p>We propose a technique based on a dynamic recon guration of the subtask
feasibility windows at runtime. The execution starts with constrained feasibility
windows. After the failure detection, the feasibility windows are released. The
constrained windows are computed by means of a task system with constrained
deadlines, using a method named ghost substask method (see 4.2).
We prove that, together with the limited hardware redundancy, this technique
of contraction and relaxation of feasibility windows provides a valid schedule.</p>
      <p>The remainder of this paper is organized as follows: in Section 2, we present
the scheduling algorithm PD2 and the failure models of interest. Then, a state
of the art is given (Section 3). It is followed by the presentation of our approach
(section 4) and the priority inversion issue (Section 5). The paper is concluded
with the presentation of experimental results and some elements of proof of our
central feasibility result.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Context and problematic</title>
      <sec id="sec-2-1">
        <title>Scheduling algorithm</title>
        <p>
          PD2 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is a PFair (Proportionate Fair) algorithm. PFair algorihtms require
the execution of tasks at a regular rate, the objective is to approach an ideal
scheduling in which each task i receives exactly Ui t processor time units in
the range [0; t). The construction of a PFair scheduling requires to divide each
task i into unitary subtasks ij (j 0). Each subtask has a pseudo-release date
rij = b Uji c and a pseudo-deadline dij = d jU+i1 e, with j 0 and Ui = CTii . The
interval [rij , dij ) represents the feasibility window of the subtask ij . Scheduling a
subtask ij in its feasibility window means that i runs for one time unit within
the time inverval [rij , dij ). Here the notion of subtask refers to a time division
not to a software division.
        </p>
        <p>PD2 is recognized as the most e cient Pfair algorithms because it has a low
complexity for decision-making. A subtask has priority over another if it has the</p>
        <p>Then, a suijbt=asdk ij1ijjUhia1se;portiohreirtwyiosveeDrthe subtask qk if
If Ui 0:5, D ij = 0 (j 0).
smallest deadline. In the case of equality of deadlines, PD2 uses two additional
criteria based on the bit succesor bij and the group deadline D .
j
i
If ij and ij+1 denote two successive subtasks of the task i, bij = 1 if their
feasibility windows overlap and bij = 0 otherwise.</p>
        <p>d
(dij &lt; dqk) OR (dij = dqk ^ bij &gt; bqk) OR (dij = dqk ^ bij = bqk = 1 ^ Dij &gt; Dqk) OR
(dij = dqk ^ bij = bqk = 1 ^ Dij = Dqk ^ i &lt; q). The last condition guarantees the
determinism. In section 5 we present an example.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Failure modeling</title>
        <p>
          All computer systems are subject to hardware and software failures. Hardware
failures can be categorised into permanent, transient and intermittent failures
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Permanent failure, such as wear o of any part, requires the replacement
of the spare part to restore the system functionality and does not disappear by
itself with time. When a core cycles between being working and out of work,
the failure is said intermittent. Transient failures are short term failures. They
may occur because of external noise or temporal disturbance due to unidenti ed
source; the core then recovers after a while. We consider here permanent failure.
We assume that during the processing only one core fails. In this context there
are two possible scenarios depending on the delay of detection:
1. Either the failure is detected instantly by means of a mechanism such as
the ones presented in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This means that no execution is lost and thus the
problem of giving additionnal time to some tasks doesn't arise.
2. Or the failure is detected after x time units. In this case, we can consider
three cases: (a) The current instances of the a ected tasks must be fully
executed: additional time is allocated to regain the lost execution and complete
the tasks; (b) It is not mandatory to fully execute the current instances of
the impacted tasks: there is thus no further time allocation. The a ected
tasks will use their remaining time to reexecute what has been lost and then
continue execution until full use of the allocated time. This is for example
the case for tasks which compute a result by means of successive iterations.
A shorter execution means a less precise result, but doesn't lead to
malfunctions; (c) The current instances of the a ected tasks can be discarded. The
scheduling continues with the una ected task instances.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Outline of the study</title>
        <p>
          Failure treatment consists of failure diagnosis and failure passivation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Failure
diagnosis consists of the location of the source of the failure, i.e the a ected
(hardware or software) component(s), and the determination of the nature of the
failure (transient, permanent or intermittent). Passivation consists in providing
an emergency response to the identi ed failure. It is our concern in this paper.
        </p>
        <p>We focus on permanent failures. We consider that the diagnosis stage has
been completed, that the failure is detected one time unit after its occurrence
and the failing core identi ed. Thus, only one task is a ected and we assume
that its current instance must be fully completed. Therefore we are in the case
2 a of the previous description with x = 1. According to the task decomposition
into unitary subtasks, only one substask of the a ected task will have to be
reexecuted. Since the tasks are periodic with simultaneous rst releases, the system
is cyclic with period H. Thus, we limit the study to the rst hyper-period [0; H).
Before presenting our solution based on limited hardware redundancy combined
with feasibility windows recon guration, we rst give an overview of existing
related results in the litterature.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>State of the art</title>
      <sec id="sec-3-1">
        <title>Related works</title>
        <p>
          The aim of failure tolerance is to schedule tasks in such a way that deadlines are
still met despite processor or software failure. Several studies have been devoted
to that issue. In this section, we are interested in those on tolerance to hardware
failures on a multicore or multiprocessor architecture. The classical way to
provide fault-tolerance on multicore platforms is to use redundancy [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The idea
is to introduce redundant copies of the elements to be protected (processor or
other components), and exploit them in the case of a fault. Therefore, hardware,
time, information and software redundancies are used for fault-tolerance [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Of
these types, time and hardware redundancy are the most frequently used. There
are two main techniques for hardware redundancy: Triple Modular Redundancy
(TMR) and Primary/Backup (PB). In TMR, three processors run redundant
copies of the same workload and mask errors by voting on their outputs. In PB,
two copies of each task are scheduled on di erent processors. The primary copy
is executed and its output is checked for correctness by an acceptance test. If
the acceptance test is negative, the execution of the backup copy is initiated.
        </p>
        <p>While there are some variations from one approach to another, the general
method to respond to a failure can be described as follows:
{ Transient failures : If the system is designed only to withstand transients
that disappear quickly, reexecution of the failed task or of a shorter,
simpler version of that task, is carried out. The scheduling problem reduces to
ensuring that there is always enough time to carry out such an execution
before the deadline.
{ Permanent failure: Backup versions of the tasks assigned to the failing core
must be invoked. The steps are: provide each task with a backup copy; place
the backups in the schedule; if the processor fails, activate one backup for
each task that has been a ected.</p>
        <p>
          Most papers found in the litterature concern transient failures. The main
techniques are given in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], authors propose an algorithm which is the
combination of TMR and DMR (Double Modular Redundancy) with an hybrid
scheduling, based on the addition to the task parameters of an architectural
vulnerability factor (AVF). [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] introduce checkpointing optimization. While
running, each task records its states at checkpoints. When a failure occurs, the
backup can resume execution from the lastest ckeckpoint.
        </p>
        <p>
          Other authors address the use of PB technique for a fault tolerant EDF
scheduling [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Notice that the papers cited above concern partioned scheduling. For
global scheduling there is little work. We can mention [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] which proposes an
algorithm (named FT-PF) for a tolerant Pfair Scheduling. This algorithm
increases the utilization of a task to ensure that it completes execution Vi time
units before its deadline. So, when a task must recover, it can be allocated Vi
time units for recovery. FT-PF also uses a spare core to ensure that recovery
will not fail for lack of resources. This technique is similar to what we propose
for a permanent failure in the way that it modi es task temporal parameters
and uses one more core. The di erence is that, in our approach, this core is
not kept idle at the beginning of the scheduling and after the failure, only the
a ected task keeps its parameters modi ed. Moreover, in the FT-PF approach,
the additional time for a task recovery is created by increasing the WCET while
keeping implicit deadlines, whereas in our approach, the WCET is kept at its
initial value and deadlines are constrained.
        </p>
        <p>
          There are still some works on permanent failures in the context of partitioned
scheduling. We can cite [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] whose idea is to provide each core with a twin and
assign to that twin all the task assigned to the initial core. Then each pair of
cores can su er a failure without any deadlines being missed. However, such
a pairing approach can require more cores than necessary and leads to system
overload. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] enriches PB technique by distinguishing two types of backup copy:
active backup and passive backup. The active backup is released when we know
that there is not enough remaining time to complete the primary copy. So, it
can start running even before we know that the primary has failed. The passive
backup starts after the failure of the primary. In summary, the active backup is
used as a preventive solution and the passive as a curative.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Originality of our contribution</title>
        <p>PB which is the main technique used to manage a permanent failure, is not
suitable in a global scheduling context, because it assumes that primary copies and
backups are assigned statically to the cores and no task migration is allowed.
Due to backup scheduling, PB technique causes an increase of system load and
therefore of the number of redundant cores required to guarantee the feasibility.
In light of the previous remarks, building a tolerant scheduling with PFair
requires a di erent approach. We thus propose a technique based on the
modication of the temporal parameters. Since tasks are divided into subtasks, no
need to use ckeck pointing for the recovery. The originality of this contribution
is thus at two levels: 1- It covers a domain where there is little work: the domain
of Failure tolerance to a permanent failure in a context of global scheduling,
speci cally by using Pfair algorithms; 2- It o ers a new approach that limits
hardware redundancy while preserving the full functionality of the system.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Our approach</title>
      <p>To guarantee a valid behavior despite the failure of a core, we propose to act on
the one hand at the architecture level, through the limited hardware redundancy
and, on the other hand, at the system level through dynamic recon guration of
feasibility windows. The main problem of the latter is that it can produce priority
modi cation (inversion) that must be managed.
4.1</p>
      <sec id="sec-4-1">
        <title>Limited hardware redundancy</title>
        <p>The minimal required number of cores for the application to be feasible is m =
dU e according to the feasibility condition. Now, if U = m, the system cannot
support an additional execution time unit on m cores, since S is fully loaded.
Thus, if U is an integer, we set m = U + 1. Therefore, we nally state m =
bU c + 1. The limited hardware redundancy consists in providing one more core
than necessary. So, S will run on m + 1 cores. When a failure occurs, after
recovery, the system will remain feasible on the m remaining cores.
We denote tp the time by which the failure occurs. The system thus switch to
m cores at time tp + 1.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Dynamic recon guration of windows: the ghost subtask method</title>
        <p>The objective is to create for each task a feasibility window, called tolerance
window, which will be used as additional time to permit the a ected subtask
rescheduling. For each task, we replace the implicit deadlines with constrained
deadlines. To compute them, we use the ghost subtask method which simulates
the addition of a subtask to each task.</p>
        <p>For each task i, we consider a WCET equal to Ci +1 in the calculation of the
subtask feasibility windows: each instance is assumed to have one more subtask.
The pseudo-deadline of penultimate subtask ( iCi 1) of the rst instance is taken
as relative deadline: Di0 = diCi 1 (see Fig. 1).</p>
        <p>
          The subtask feasibility windows of the rst instance of a task i0 with
contrained deadlines are computed by [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] ri0j = b CjHi c and d0ij = d Cj+H1i e; j 0,
where CHi = DCi0 denotes the load factor of the task. To obtain the
correspondi
ing parameters for the hth(h &gt; 0) instance of i0, we just add h Ti to the rst
instance values.
        </p>
        <p>Before the failure, the feasibility windows are calculated with constrained
deadlines. When the failure is detected, the subtasks of the una ected tasks
switch to their windows with implicit deadlines, whereas the a ected task i0
keeps on using the constrained windows until the end of the current
hyperperiod. In addition, the WCET of i0 switches from Ci0 to Ci0 + 1 to integrate
the tolerance windows needed for the a ected substask re-execution.
We thus use the following systems:
{ The initial system S : i &lt; Ci; Ti &gt;, having tasks with implicit deadlines;
{ The ghost system S+ : i &lt; Ci + 1; Ti &gt; in which each task has one more
substask (the ghost subtask used to determine the task deadline);
{ The constrained system S0 : i0 &lt; Ci; Di0; Ti &gt; with constrained deadlines
deduced from S+ feasibility windows;
{ The intermediate system Si0 : i6=i0 &lt; Ci; Ti &gt;; i0 &lt; Ci0 + 1; Ti0 &gt;.</p>
        <p>Our assumptions on these systems are the following:</p>
        <sec id="sec-4-2-1">
          <title>1. S is feasible, which is guaranteed by the choice of m cores.</title>
          <p>2. Each task in S has at least one left time unit between its WCET and its
period. i.e 8 i; Ti Ci 1.
3. S' is feasible on m + 1 cores: Pn</p>
          <p>
            i=1(Ci=Di0) &lt; m + 1 [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ].
4. Si0 is feasible on m cores for any i0. i.e maxi0 (Pi6=i0 (Ci=Ti) + CTi0i+01 ) m.
The subtasks use their constrained feasibility windows before failure, then the
intermediate windows from the time of failure detection tp + 1 until the next
hyper-period of the system N extH , and nally, their initial windows after this
hyper-period.
          </p>
          <p>Formally, recon guration is performed as follows:</p>
          <p>Non-a ected tasks i(i 6= i0):
Not yet scheduled subtasks switch to their windows in S: [ri0j ; d0ij ) 7 ! [rij ; dij ).</p>
          <p>A ected task i0 : three steps.</p>
          <p>Before H, the remaining subtasks keep their constrained feasibility windows:
[ri00j ; d0ij0 ) 7 ! [ri00j ; d0ij0 );
After H, the remaining subtasks switch to their windows in S:
[ri00j ; d0ij0 ) 7 ! [rij0 ; dij0 );
The a ected subtask ij00 is rescheduled in the tolerance windows of the current
instance: [ri00j0 ; d0ij00 ) 7 ! [rit0h ; dit0h ).</p>
          <p>In the sequel, we denote by S0
N extH , with failure at time tp.</p>
          <p>!tp Si0 the system running from time 0 to
4.3</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Justi cation of the approach</title>
        <p>Consider rst the architecture. For the sake of weight, cost and energy
consumption, we decided to use the lowest possible number of cores. Thus, we only add
one core. Then, we could have decided to use it as a spare core: the application
runs on m cores and when one of them fails, the spare one takes over. But since
we wanted to be able to give additionnal time to the a ected task, we prefered
to fully use the (m + 1) cores to get free slots for the additional processing time
unit.</p>
        <p>At the system level, we wanted to constrain the system as less as possible, in
order to manage the largest number of feasible systems. Therefore, we considered
the three di erent systems. The system S0 with constrained deadlines but with
the initial WCET. Starting with the increased WCET system S+ would have
increased the load factor. Thus, some actually feasible systems would have
become non feasible. Then the intermediate system Si0 , whose feasibility windows
are constrained only for one task and relaxed for the other ones, provides more
exibility to reschedule the a ected subtask. Finally, return to the initial system
at the hyper-period ensures the feasibilty of the system on the remaining cores
after recovering.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Illustration of the approach</title>
        <sec id="sec-4-4-1">
          <title>Consider the following system of tasks :</title>
          <p>S : 1 &lt; 1; 3 &gt;; 2 &lt; 3; 6 &gt;; 3 &lt; 3; 4 &gt;; 4 &lt; 5; 12 &gt;; 5 &lt; 7; 12 &gt;.
We have U = 1322 thus m = 3 and h = 12.</p>
          <p>We rst add a 4th core. We then compute constrained deadlines. Each task
is supposed to have a WCET increased by 1. We consider the rst instance of
eDa10ch=tda01sk=andd0+23g1eet=(D2i0; =D20diC=i d122)=: d 2 +646+1 1e = 5; D30 = d23 = d 2+44 1 e = 3;
4+1
D40 = d44 = d 162 e = 10; D50 = d65 = d 182 e = 11.</p>
          <p>We obtain the following constrained system:
S0 : 10 &lt; 1; 2; 3 &gt;; 20 &lt; 3; 5; 6 &gt;; 30 &lt; 3; 3; 4 &gt;; 40 &lt; 5; 10; 12 &gt;; 50 &lt; 7; 11; 12 &gt;.
We assume that a failure occurs on the core C2 at time 1, thus is detected at
time 2. The a ected task is 3 (i.e i0 = 3). We thefore have
Si0 : 1 &lt; 1; 3 &gt;; 2 &lt; 3; 6 &gt;; 30 &lt; 4; 4 &gt;; 4 &lt; 5; 12 &gt;; 5 &lt; 7; 12 &gt;.
Figure 2 shows the feasibility windows of each subtask in the di erent systems
S; S0 and S3 and the actual feasibility windows used by each substask during
the scheduling (see column R).</p>
          <p>In S3, for each instance h of 3 an additionnal subtask 3th is provided which
feasibility windows can be used as tolerance window for re-execution. Thus, the
following windows are reserved: [3; 4) in the rst instance, [7; 8) in the second and
[11; 12) in the third. Since the a ected subtask belongs to the rst instance of 3,
the rst tolerance window [3; 4) is used for its rescheduling. Figure 3 shows the
tolerant scheduling of system S where the failure on the core C2 is detected at
time tp+1 = 2. An additional subtask is scheduled in the tolerance windows [3; 4)
to fully complete 3. From time 2, the other tasks use their relaxed feasibility
windows.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Some issue: inversion of priorities</title>
      <sec id="sec-5-1">
        <title>The problem</title>
        <p>According to PD2, priorities among subtasks are based on their feasibility
windows. Because these windows are not the same in systems S, S0and Si0 , it may
result in our approach changes of priorities between subtasks while switching
from one system to another. Moreover, a subtask may be released later in Si0
than in S0. This priority inversion between subtasks has consequences on the
scheduling after the failure. Even if after recon guration subtasks have switched
to their windows in the system Si0 , S0 !tp Si0 doesn't necessarily behave
identically to Si0 for three main reasons: rstly, at time t &gt; tp there may exist
subtasks pending in Si0 , which have already been scheduled in S0. We call them
anticipated subtasks. These subtasks are pending in Si0 but not in S0 !tp Si0 .
For example, 51 and 21 in Fig 3 and 00; 10 and 20 in Fig 5. Secondly, there
may exist subtasks scheduled in Si0 before the failure that were planned after
the failure in S0. We name them the staggered subtasks. These are pending in
S0 !tp Si0 , but not in Si0 . It is the case of 80 and 90 in Fig 5. Finally, a subtask
can be scheduled at any time t in Si0 but later in S0 !tp Si0 because it was
replaced by a higher priority subtask. We call it a postponed subtask. In Fig 5
107 and 108 are postponed subtasks.</p>
        <p>Note that, a subtask is said anticipated, staggered or postponed by reference to
the time planned for its execution in Si0 .
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Illustration</title>
        <p>Consider the behavior of the systems S0 !1 Si0 (Fig. 3) and Si0 (Fig. 4). Due
to recon guration, substasks have the same feasibility windows in both systems.
Moreover, in this example, there is no staggered substask, since at time t = 1
all substasks already scheduled in Si0 have already been scheduled in S0 !1 Si0 .</p>
        <p>We have following rst remark. If there is no staggered subtask, a subtask ij
is scheduled in S0 !tp Si0 at the same time than in Si0 (eg. 32 and 52) or
earlier (eg. 41 and 11) .</p>
        <p>Let's now study an example with staggered subtask. Consider the following
system composed of 48 tasks. i j &lt; C; T &gt; denotes a list of tasks i; i+1::: j
with common temporal parameters.</p>
        <p>S1 : 0 3 &lt; 1; 20 &gt;; 4 7 &lt; 1; 36 &gt;; 8 47 &lt; 2; 38 &gt;.</p>
        <p>U = 2:41 and m = b2:41c + 1 = 3. Then, the system is feasible on 3 cores.
Applying the ghost subtask method we get the constrained system
S10 = 0 3 &lt; 1; 10 &gt;; 4 7 &lt; 1; 18 &gt;; 8 47 &lt; 2; 26 &gt; which runs on 4 cores.</p>
        <p>We suppose that a failure occurs at time tp = 0 on core C4 a ecting the
subtask of 30. The feasibility windows of the rst instances of the tasks in di erent
systems are given below:
S1: 0 3(0; 20); 40 7(0; 36); 80 47(0; 19); 81 47(19; 38).</p>
        <p>0
S10: 0 3(0; 10); 40 7(0; 18); 80 47(0; 13); 81 47(13; 26).</p>
        <p>0
S13: 0 2(0; 20); 30(0; 10); 31(10; 20); 40 7(0; 36), 8 47(0; 19); 81 47(19; 38).</p>
        <p>0 0
We can note the priority inversion between 00 3 and 80 47: in S1 and S13,
80 47 has priority over 00 3, whereas 00 3 has priority over 80 47 in S10.
Figure 5 shows respectively the PD2 schedule for the systems S10 !0 S13 and S13.</p>
        <p>The comparison of two gures leads us to a second remark:
at time tp, the 3 anticipated substasks 00, 10 and 20 correspond to the 2 staggered
subtasks 80 and 90. In S10 !0 S1i0 , the staggered subtasks are scheduled at time
tp + 1 = 1 causing the postponement of the substasks ( 104 and 105) scheduled at
this time in S1i0 . The 2 postponed subtasks are scheduled at time 2 and 2 other
substasks ( 107 and 108) are at their turn potsponed. Subtask postponement ends
at the time t = 15. From this time, subtasks are scheduled in S10 !0 S1i0 at
the same time as in S1i0 or earlier.</p>
        <p>Theses remarks give the key arguments for the proof of the validity of our
approach, speci cally for Proposition 1 (for the rst remark) and for Proposition
2 (for the second remark).
6
To experiment our approach, we designed a software prototype, named FTA
(Failure Tolerance Analyser), to simulate PD2 scheduling with failure. This
prototype has three main modules: a random generator systems, a scheduler (that
randomly determines the time of the failure and the a ected core and schedules
the system according to the proposed approach) and a Diagnostic tool (that
analyzes the sequence produced by the scheduler and determines whether it is
valid and fair).</p>
        <p>Several parameters can be set before the analysis: the system load, the
number of heavy tasks (i.e tasks with Ui 0:5), the time of occurrence of the failure
and the a ected core. 550 random systems have been generated and
submitted to the simulation. This has been done by means of 11 random generations
each composed of 50 systems containing di erent numbers of heavy tasks. The
simulation is repeated many times to let the parameters vary and observe their
impact. All the obtained scheduling results were valid.</p>
        <p>In the following paragraph, we give some elements of proof of this result.
We rst de ne some terms and used notations.</p>
        <p>{ SchedSnbys: PD2 schedule of the system Sys on nb cores.
{ Sched(Sm0+!1)tp !Si0m: PD2 schedule of system S0 running on m + 1 cores with a
failure of one core at time tp and a switch to system Si0 at time tp + 1.
{ P ending(S; t): list of pending substasks in system S at time t. In our context,
a subtask ij is pending at time t if it is released and not yet scheduled. And
if j &gt; 0, the previous subtask ij 1 has already been scheduled.
{ Exec( ij ; Sched): execution time of the subtask ij in the schedule Sched.
{ tp: failure time; ij00 : the a ected substask.
{ N extH : next hyperperiod after tp;
{ rj(S) and dij(S): pseudo-release date and pseudo-deadline of ij in system S.</p>
        <p>i
We assume that SchedSm, Sched(Sm0+1) and SchedSmi0 are valid and fair.</p>
        <p>Our main result is that if S, S0 and Si0 are feasible respectively on m, m + 1
and m cores, then S0 !tp Si0 is feasible.</p>
        <p>The proof of the validity of S0 !tp Si0 is very technical; thus, we only present
some elements in two cases, as stated in the following theorem:
Theorem 1. If there is no staggered subtask or if there are staggered subtasks
and the failure occurs at a time of a hyper-period (i.e tp = 0[H]),
then Sched(Sm0+!1)tp !Si0m is valid and fair.</p>
        <p>To prove this theorem, we rst consider the case where no priority inversion
takes place, i.e there is no staggered subtasks. This can be expressed by the
following condition [H1]:
[H1] : 8 ij ; Exec( ij ; SchedSmi0 )
tp =) Exec( ij ; Sched(Sm0+1))
tp; ij 6= it0h .</p>
        <sec id="sec-5-2-1">
          <title>We then state the following proposition:</title>
          <p>later in Sched(Sm0+!1)tp !Si0m than in SchedSmi0 .</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Proposition 1. R(t)</title>
        <p>If there is no staggered subtask, at any time t &gt; tp, a subtask ij is not scheduled</p>
        <p>To prove this proposition, we proceed by induction on t, using two further
properties:
Property 1. Prop1(t): A subtask pending at time t (t tp + 1) in Si0 is either
also pending in S0 !tp Si0 or has already been processed.
ij 2 P ending(Si0 ; t) =) ij 2 P ending(S0 !tp Si0 ; t) or</p>
        <p>(S0
Exec( ij ; Sched(m+!1)tp!Smi0 )) &lt; t
Property 2. Prop2(t)
At time t (t tp + 1), if there are k pending substasks with higher priority than
ij in S0 !tp Si0 , then in Si0 there are at least k pending substasks with higher
priority than ij .</p>
        <p>Then, we consider the case where there are some priority inversions i.e, some
staggered subtasks. For simplicity reason, we present the case where the failure
occurs at the beginning of an hyper-period (i.e tp = 0[H]). In this case, the rst
subtask of each task is pending in S0. We then prove the next proposition:</p>
      </sec>
      <sec id="sec-5-4">
        <title>Proposition 2. Prop(t)</title>
        <p>At any time t &gt; 0, if all the subtasks already scheduled in Si0 have already
been scheduled in S0 !tp Si0 , then from this time, each subtask is scheduled in
S0 !tp Si0 not later than as in Si0 . i.e
If Exec( ij ; SchedSmi0 ) = t0 with t0 &gt; tp then, Exec( ij ; Sched((Sm0+!1)tp!Smi0 ))
t0.</p>
        <sec id="sec-5-4-1">
          <title>With these two propositions we now prove the theorem.</title>
        </sec>
        <sec id="sec-5-4-2">
          <title>Proof. (Theorem 1)</title>
          <p>- First case: there is no staggered subtask.</p>
          <p>According to R(t), after the failure we have</p>
          <p>(S0
Exec( ij ; Sched(m+!1)tp!Smi0 ))</p>
          <p>Exec( ij ; SchedSmi0 ).</p>
          <p>Since SchedSmi0 is valid and fair by assumption, Theorem 1 holds.
- Second case: there are k staggered subtasks at tp = 0.</p>
          <p>Using the second remark of Paragraph 5.2 and P rop(t), we prove that:
(1) At time tp = 0, to the k staggered substasks a01 ::: a0k correspond k + 1
anticipated substasks b01 ::: b0k+1 (in decreasing priority order) such that tbi =
Exec( b0i ; SchedSmi0 ) &gt; tp and Exec( b0i ; Sched(Sm0+1)) = tp, with tb1
while meeting their deadlines: in fact, Exec( a0i ; Sched(Sm0+1)) &gt; 0 =) d0a(iS0) &gt; 1
and thus d0a(iS) &gt; 1. Their execution can lead to the postponment of other planned
subtasks ujiirr . (3) From time t=2 to time t= tmax 1, substasks postponed at t 1
are scheduled at t. We thus have a cascade of postponements of subtasks which
are shifted by one time unit. And we prove that they still meet their
pseudodeadlines, because of the assumption U (Si0 ) m (for reasons of space, we
cannot detail the proof here). (4) At time tmax, all the staggered and postponed
subtasks have been scheduled in place of anticipated subtasks. Now, all the
substasks already scheduled in Si0 are already scheduled in S0 !tp Si0 and
thus P rop(tmax) can be applied.</p>
          <p>Since P rop(t) is true and SchedSmi0 is valid by assumption [A3], all the
pseudo(S0 !tp Si0 ) &lt; dij(S) is valid and
!tp Si0 . Thus, Sched(m+1) !m
deadlines are met in S0
Theorem 1 holds
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and perspectives</title>
      <p>Failure tolerance is a fundamental aspect of embedded system design. Today,
many such systems run on multicore architectures. The advent of multicore
brings timeliness due to increased processing units. However, there are some
issues linked to these architectures, such as hardware failure risks. In this paper,
we considered a permanent failure of one core detected within the next time
unit after its occurrence for systems scheduled under PD2. Then one subtask
is a ected and its execution is resumed. We proposed an approach for failure
tolerance based on limited hardware redundancy and dynamic recon guration
of subtask feasibility windows. We have experimentaly tested this method and
the resulting schedules were always valid and fair whatever the system load,
the time of the failure and the percentage of heavy tasks. This approach raises
the problem of priority inversion between subtasks during the scheduling. When
there is no staggered subtask, we proved the validity of the resulting schedule.
We also proved it when there are some staggered subtasks and the failure occurs
at the time of a hyper-period (i.e tp = 0[H]). We are currently completing the
proof for the last case where there are staggered subtasks and the failure occurs
at any time di erent of a hyperperiod. This part of the proof is more technical
than the one presented in this paper.</p>
      <p>In future works we will study the case where the failure detection delay is larger,
and therefore more subtasks, that may belong to di erent tasks, are re-executed.
We will also study the case where several cores may fail.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Park</surname>
          </string-name>
          , S. Kim,
          <string-name>
            <given-names>H.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Ha : A Task Remapping Technique for Reliable Multicore Embedded Systems</article-title>
          .
          <source>International Conference on Hardware/Software Codesign and System Synthesis</source>
          , pages
          <volume>307</volume>
          to 316 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>S.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Narkhede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Makaraju</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shanmugasundaram: A Review of Fault Tolerant Sscheduling in Multicore Systems</article-title>
          .
          <source>International Journal of Scienti c and Technology Research</source>
          , Volume
          <volume>4</volume>
          , pages 132 to 136 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mouafo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Choquet-Geniet</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Largeteau-Skapi: Failure Tolerance for a Multicore Real-Time System Scheduled by PD2</article-title>
          .
          <source>Proceedings of the 9th, Junior Researcher Workshop on Real-Time Computing</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>J. Anderson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Srinivavasan: A New Look at Pfair Priorities</article-title>
          .
          <source>Rap. tech. TR00-023</source>
          , University of North Carolina at Chapel Hil (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>S.K.</given-names>
            <surname>Baruah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.K.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.G.</given-names>
            <surname>Plaxton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.A.</given-names>
            <surname>Varvel</surname>
          </string-name>
          :
          <article-title>A Notion of Fairness in Ressource Allocation</article-title>
          .
          <source>Algorithmica</source>
          <volume>15</volume>
          (
          <issue>6</issue>
          ), Pg 600-
          <fpage>625</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Shiravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.E.</given-names>
            <surname>Salehi</surname>
          </string-name>
          <article-title>: Fault Tolerant Task Scheduling Algorithm for Multicore Systems</article-title>
          .
          <source>The 22nd Iranian Conference on Electrical Engineering (ICEE)</source>
          ,
          <source>Pg 885- 890</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>E.</given-names>
            <surname>Chantery</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Pencole: Modelisation et Integration du Diagnostic Actif dans une Architecture Embarque</article-title>
          .
          <source>In: Journal Europeen des Systemes Automatisees, MSR 2009, Pg</source>
          <volume>789</volume>
          -
          <fpage>803</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondavalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiaradonna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Di Giandomenico: E cient Fault</surname>
          </string-name>
          <article-title>Tolerance: An Approach to Deal With Transient Faults in Multiprocessor Architectures</article-title>
          .
          <source>International Conference on Parallel and Distributed Systems, Pg</source>
          <volume>354</volume>
          -
          <fpage>359</fpage>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Cirinei</surname>
          </string-name>
          , E. Bini,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lipari</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Ferrari: A Flexible Scheme for Scheduling FaultTolerant Real-Time Tasks on Multiprocessors</article-title>
          .
          <source>IEEE International Parallel and Distributed Processing Symposium, Pg 1-8</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>C.M.</surname>
          </string-name>
          <article-title>Krishna: Fault-Tolerant Scheduling In Homogeneous Real-Time Systems</article-title>
          .
          <source>In: Journal ACM Computing Surveys (CSUR)</source>
          , Volume
          <volume>46</volume>
          Issue 4,
          <string-name>
            <surname>Article</surname>
            <given-names>No 48</given-names>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HONG</surname>
          </string-name>
          :
          <article-title>Fault-Tolerant Real-Time Scheduling Using Passive Replicas</article-title>
          . Paci c Rim
          <source>International Symposium on Fault-Tolerance, Pg</source>
          <volume>98</volume>
          -
          <fpage>103</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A.</given-names>
            <surname>Christy</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.R</surname>
          </string-name>
          . Gopalakrishnan
          <source>Nair: Fault-Tolerant Real Time Systems: International Conference on Managing Next Generation Software Application</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>F.</given-names>
            <surname>Liberato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lauzac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Melhem</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Mosse: Fault-Tolerant Real-Time Global Scheduling on Multiprocessors</article-title>
          .
          <source>In: Proceedings of the 11th Euromicro Conference on Real-time Systems, Pg</source>
          <volume>252</volume>
          -
          <fpage>259</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.H.</given-names>
            <surname>Son</surname>
          </string-name>
          :
          <article-title>An Algorithm For Real-Time Fault-Tolerant Scheduling in Multiprocessor Systems</article-title>
          .
          <source>Euromicro Workshop on Real-Time Systems, Pages</source>
          <volume>190</volume>
          -
          <fpage>195</fpage>
          (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>A.A.</given-names>
            <surname>Berossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.V.</given-names>
            <surname>Mancini</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Rossini: Fault-Tolerant Rate-Monotonic First-Fit Scheduling in Hard Real-Time Systems</article-title>
          .
          <source>IEEE Transactions on Parallel and Distributed Systems 10, Pg</source>
          <volume>934</volume>
          -
          <fpage>945</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>S.</given-names>
            <surname>Malo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Choquet-Geniet</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bikienga: PFair Scheduling of Late Released Tasks with Constrained Deadlines. Colloque National sur la Recherche en Informatique et ses applications</article-title>
          , pages
          <fpage>142</fpage>
          -
          <lpage>149</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>