=Paper=
{{Paper
|id=Vol-1689/paper5
|storemode=property
|title=Dynamic Feasibility Windows Reconfiguration For a Failure-Tolerant PFair Scheduling
|pdfUrl=https://ceur-ws.org/Vol-1689/paper5.pdf
|volume=Vol-1689
|authors=Yves Mouafo Tchinda,Annie Geniet-Choquet,Gaëlle Largeteau-Skapin
|dblpUrl=https://dblp.org/rec/conf/vecos/TchindaGL16
}}
==Dynamic Feasibility Windows Reconfiguration For a Failure-Tolerant PFair Scheduling ==
Dynamic Feasibility Window Reconfiguration for a Failure-Tolerant PFair Scheduling Yves Mouafo Tchinda, Annie Choquet-Geniet, and Gaëlle Largeteau-Skapin LIAS-ENSMA Teleport2, 1 av. Clément Ader BP 40109 86961 Futuroscope-Chasseneuil France XLIM-UNIVERSITE DE POITIERS Teleport2, 11 bd Marie et Pierre Curie,BP 30179 86962 Futuroscope-Chasseneuil France yves.mouafo@ensma.fr,annie.geniet@univ-poitiers.fr, gaelle.largeteau.skapin@univ-poitiers.fr http://www.lias-lab.fr; http://www.xlim.fr/sic Abstract. This work addresses the problem of failure tolerance for real- time applications. We propose a technique to schedule a system of tasks running under PD2 on a multicore architecture submitted to the failure of one of the cores at a given time with re-execution of the task running on it. The technique is based on limited hardware redundancy and subtask feasibility windows reconfiguration at runtime. We limit the study to the re-execution of one time unit. Keywords: Failure Tolerance, Multicore Architecture, Task Re-execution, Pfair Scheduling, Dynamic Reconfiguration 1 Introduction In recent decades, there has been a significant increase in the use of multicore ar- chitectures in real-time embedded systems [1]. A multicore processor has two or more independent cores into a single package. The drawback of such an architec- ture is its weakness because a core may fail at any time requiring an adaptation of the system [2]. Therefore, designers of systems running on such architectures attach significant importance to the failure tolerance. We consider systems of critical periodic tasks running on a multicore ar- chitecture. We use the classical temporal modelling of tasks. Each task τi is characterized by four temporal parameters: the first release date or offset ri , the worst-case execution time (WCET) Ci , the period Ti and the relative deadline Di which is the maximum delay allowed from the release of any instance of the task to its completion. If all the first release dates are equal, the tasks are said to have simultaneous first releases. When the deadlines verify Di ≤ Ti , we say that they are constrained; when Di =Ti , they are implicit. A system S is characterised by its processor’s utilisation factor, which represents the processor workload due to the system. 2 Dynamic Feasibility Window Reconfiguration for Failure Tolerance Pn It is defined by U = i=1 (Ci /Ti ). Finally, H = LCM (T1 ...Tn ) denotes the hyper-period of the system. In the sequel, we consider a system composed of n periodic independant tasks with simultaneous first releases and implicit deadlines. A task τi is denoted by a triplet < Ci , Di , Ti > when the deadlines are constrained or by a pair < Ci , Ti > when they are implicit. We consider here global scheduling with the algorithm PD2 (see 3.2), which is optimal in our context. In addition, there exists a necessary and sufficient Pn feasibility condition easy to compute. The system is feasible on k cores iff i=1 (Ci /Ti ) ≤ k [5]. PD2 recommends to divide tasks into unitary subtasks. If one of the cores fails, it affects the subtasks running on it. In [3], we proved that if the affected subtasks are simply ignored, limited hardware redundancy (see 4.1), which consists of the addition of a single further core, provides a valid schedule (i.e a schedule which meets all its temporal constraints). In the case where the affected subtask must be re-executed, the main issue is to ensure the validity of the schedule, even if the re-execution causes delays. We propose a technique based on a dynamic reconfiguration of the subtask feasibility windows at runtime. The execution starts with constrained feasibility windows. After the failure detection, the feasibility windows are released. The constrained windows are computed by means of a task system with constrained deadlines, using a method named ghost substask method (see 4.2). We prove that, together with the limited hardware redundancy, this technique of contraction and relaxation of feasibility windows provides a valid schedule. The remainder of this paper is organized as follows: in Section 2, we present the scheduling algorithm PD2 and the failure models of interest. Then, a state of the art is given (Section 3). It is followed by the presentation of our approach (section 4) and the priority inversion issue (Section 5). The paper is concluded with the presentation of experimental results and some elements of proof of our central feasibility result. 2 Context and problematic 2.1 Scheduling algorithm PD2 [4] is a PFair (Proportionate Fair) algorithm. PFair algorihtms require the execution of tasks at a regular rate, the objective is to approach an ideal scheduling in which each task τi receives exactly Ui × t processor time units in the range [0, t). The construction of a PFair scheduling requires to divide each task τi into unitary subtasks τij (j ≥ 0). Each subtask has a pseudo-release date rij = b Uji c and a pseudo-deadline dji = d j+1 Ci Ui e, with j ≥ 0 and Ui = Ti . The interval [rij , dji ) represents the feasibility window of the subtask τij . Scheduling a subtask τij in its feasibility window means that τi runs for one time unit within the time inverval [rij , dji ). Here the notion of subtask refers to a time division not to a software division. PD2 is recognized as the most efficient Pfair algorithms because it has a low complexity for decision-making. A subtask has priority over another if it has the Dynamic Feasibility Window Reconfiguration for Failure Tolerance 3 smallest deadline. In the case of equality of deadlines, PD2 uses two additional criteria based on the bit succesor bji and the group deadline Dij . If τij and τij+1 denote two successive subtasks of the task τi , bji = 1 if their feasibility windows overlap and bji = 0 otherwise. dj −j−1 If Ui ≥ 0.5, Dij = d i1−Ui e; otherwise Dij = 0 (j ≥ 0). Then, a subtask τij has priority over the subtask τqk if (dji < dkq ) OR (dji = dkq ∧ bji > bkq ) OR (dji = dkq ∧ bji = bkq = 1 ∧ Dij > Dqk ) OR (dji = dkq ∧ bji = bkq = 1 ∧ Dij = Dqk ∧ i < q). The last condition guarantees the determinism. In section 5 we present an example. 2.2 Failure modeling All computer systems are subject to hardware and software failures. Hardware failures can be categorised into permanent, transient and intermittent failures [6]. Permanent failure, such as wear off of any part, requires the replacement of the spare part to restore the system functionality and does not disappear by itself with time. When a core cycles between being working and out of work, the failure is said intermittent. Transient failures are short term failures. They may occur because of external noise or temporal disturbance due to unidentified source; the core then recovers after a while. We consider here permanent failure. We assume that during the processing only one core fails. In this context there are two possible scenarios depending on the delay of detection: 1. Either the failure is detected instantly by means of a mechanism such as the ones presented in [7]. This means that no execution is lost and thus the problem of giving additionnal time to some tasks doesn’t arise. 2. Or the failure is detected after x time units. In this case, we can consider three cases: (a) The current instances of the affected tasks must be fully exe- cuted: additional time is allocated to regain the lost execution and complete the tasks; (b) It is not mandatory to fully execute the current instances of the impacted tasks: there is thus no further time allocation. The affected tasks will use their remaining time to reexecute what has been lost and then continue execution until full use of the allocated time. This is for example the case for tasks which compute a result by means of successive iterations. A shorter execution means a less precise result, but doesn’t lead to malfunc- tions; (c) The current instances of the affected tasks can be discarded. The scheduling continues with the unaffected task instances. 2.3 Outline of the study Failure treatment consists of failure diagnosis and failure passivation [8]. Failure diagnosis consists of the location of the source of the failure, i.e the affected (hardware or software) component(s), and the determination of the nature of the failure (transient, permanent or intermittent). Passivation consists in providing an emergency response to the identified failure. It is our concern in this paper. 4 Dynamic Feasibility Window Reconfiguration for Failure Tolerance We focus on permanent failures. We consider that the diagnosis stage has been completed, that the failure is detected one time unit after its occurrence and the failing core identified. Thus, only one task is affected and we assume that its current instance must be fully completed. Therefore we are in the case 2−a of the previous description with x = 1. According to the task decomposition into unitary subtasks, only one substask of the affected task will have to be re- executed. Since the tasks are periodic with simultaneous first releases, the system is cyclic with period H. Thus, we limit the study to the first hyper-period [0, H). Before presenting our solution based on limited hardware redundancy combined with feasibility windows reconfiguration, we first give an overview of existing related results in the litterature. 3 State of the art 3.1 Related works The aim of failure tolerance is to schedule tasks in such a way that deadlines are still met despite processor or software failure. Several studies have been devoted to that issue. In this section, we are interested in those on tolerance to hardware failures on a multicore or multiprocessor architecture. The classical way to pro- vide fault-tolerance on multicore platforms is to use redundancy [9]. The idea is to introduce redundant copies of the elements to be protected (processor or other components), and exploit them in the case of a fault. Therefore, hardware, time, information and software redundancies are used for fault-tolerance [10]. Of these types, time and hardware redundancy are the most frequently used. There are two main techniques for hardware redundancy: Triple Modular Redundancy (TMR) and Primary/Backup (PB). In TMR, three processors run redundant copies of the same workload and mask errors by voting on their outputs. In PB, two copies of each task are scheduled on different processors. The primary copy is executed and its output is checked for correctness by an acceptance test. If the acceptance test is negative, the execution of the backup copy is initiated. While there are some variations from one approach to another, the general method to respond to a failure can be described as follows: – Transient failures: If the system is designed only to withstand transients that disappear quickly, reexecution of the failed task or of a shorter, sim- pler version of that task, is carried out. The scheduling problem reduces to ensuring that there is always enough time to carry out such an execution before the deadline. – Permanent failure: Backup versions of the tasks assigned to the failing core must be invoked. The steps are: provide each task with a backup copy; place the backups in the schedule; if the processor fails, activate one backup for each task that has been affected. Most papers found in the litterature concern transient failures. The main techniques are given in [2]. In [6], authors propose an algorithm which is the Dynamic Feasibility Window Reconfiguration for Failure Tolerance 5 combination of TMR and DMR (Double Modular Redundancy) with an hybrid scheduling, based on the addition to the task parameters of an architectural vul- nerability factor (AVF). [11] introduce checkpointing optimization. While run- ning, each task records its states at checkpoints. When a failure occurs, the backup can resume execution from the lastest ckeckpoint. Other authors address the use of PB technique for a fault tolerant EDF schedul- ing [12]. Notice that the papers cited above concern partioned scheduling. For global scheduling there is little work. We can mention [13] which proposes an algorithm (named FT-PF) for a tolerant Pfair Scheduling. This algorithm in- creases the utilization of a task to ensure that it completes execution Vi time units before its deadline. So, when a task must recover, it can be allocated Vi time units for recovery. FT-PF also uses a spare core to ensure that recovery will not fail for lack of resources. This technique is similar to what we propose for a permanent failure in the way that it modifies task temporal parameters and uses one more core. The difference is that, in our approach, this core is not kept idle at the beginning of the scheduling and after the failure, only the affected task keeps its parameters modified. Moreover, in the FT-PF approach, the additional time for a task recovery is created by increasing the WCET while keeping implicit deadlines, whereas in our approach, the WCET is kept at its initial value and deadlines are constrained. There are still some works on permanent failures in the context of partitioned scheduling. We can cite [14] whose idea is to provide each core with a twin and assign to that twin all the task assigned to the initial core. Then each pair of cores can suffer a failure without any deadlines being missed. However, such a pairing approach can require more cores than necessary and leads to system overload. [15] enriches PB technique by distinguishing two types of backup copy: active backup and passive backup. The active backup is released when we know that there is not enough remaining time to complete the primary copy. So, it can start running even before we know that the primary has failed. The passive backup starts after the failure of the primary. In summary, the active backup is used as a preventive solution and the passive as a curative. 3.2 Originality of our contribution PB which is the main technique used to manage a permanent failure, is not suit- able in a global scheduling context, because it assumes that primary copies and backups are assigned statically to the cores and no task migration is allowed. Due to backup scheduling, PB technique causes an increase of system load and therefore of the number of redundant cores required to guarantee the feasibility. In light of the previous remarks, building a tolerant scheduling with PFair re- quires a different approach. We thus propose a technique based on the modi- fication of the temporal parameters. Since tasks are divided into subtasks, no need to use ckeck pointing for the recovery. The originality of this contribution is thus at two levels: 1- It covers a domain where there is little work: the domain of Failure tolerance to a permanent failure in a context of global scheduling, 6 Dynamic Feasibility Window Reconfiguration for Failure Tolerance specifically by using Pfair algorithms; 2- It offers a new approach that limits hardware redundancy while preserving the full functionality of the system. 4 Our approach To guarantee a valid behavior despite the failure of a core, we propose to act on the one hand at the architecture level, through the limited hardware redundancy and, on the other hand, at the system level through dynamic reconfiguration of feasibility windows. The main problem of the latter is that it can produce priority modification (inversion) that must be managed. 4.1 Limited hardware redundancy The minimal required number of cores for the application to be feasible is m = dU e according to the feasibility condition. Now, if U = m, the system cannot support an additional execution time unit on m cores, since S is fully loaded. Thus, if U is an integer, we set m = U + 1. Therefore, we finally state m = bU c + 1. The limited hardware redundancy consists in providing one more core than necessary. So, S will run on m + 1 cores. When a failure occurs, after recovery, the system will remain feasible on the m remaining cores. We denote tp the time by which the failure occurs. The system thus switch to m cores at time tp + 1. 4.2 Dynamic reconfiguration of windows: the ghost subtask method The objective is to create for each task a feasibility window, called tolerance window, which will be used as additional time to permit the affected subtask rescheduling. For each task, we replace the implicit deadlines with constrained deadlines. To compute them, we use the ghost subtask method which simulates the addition of a subtask to each task. For each task τi , we consider a WCET equal to Ci +1 in the calculation of the subtask feasibility windows: each instance is assumed to have one more subtask. The pseudo-deadline of penultimate subtask (τiCi −1 ) of the first instance is taken i −1 as relative deadline: Di0 = dCi (see Fig. 1). The subtask feasibility windows of the first instance of a task τi0 with con- 0 0 trained deadlines are computed by [16] rij = b CH j i c and dij = d CHj+1 i e, j ≥ 0, Ci where CHi = D0 denotes the load factor of the task. To obtain the correspond- i ing parameters for the hth (h > 0) instance of τi0 , we just add h ∗ Ti to the first instance values. Before the failure, the feasibility windows are calculated with constrained deadlines. When the failure is detected, the subtasks of the unaffected tasks switch to their windows with implicit deadlines, whereas the affected task τi0 keeps on using the constrained windows until the end of the current hyper- period. In addition, the WCET of τi0 switches from Ci0 to Ci0 + 1 to integrate Dynamic Feasibility Window Reconfiguration for Failure Tolerance 7 Fig. 1. The Ghost subtask method the tolerance windows needed for the affected substask re-execution. We thus use the following systems: – The initial system S : τi < Ci , Ti >, having tasks with implicit deadlines; – The ghost system S + : τi < Ci + 1, Ti > in which each task has one more substask (the ghost subtask used to determine the task deadline); – The constrained system S 0 : τi0 < Ci , Di0 , Ti > with constrained deadlines deduced from S + feasibility windows; – The intermediate system Si0 : τi6=i0 < Ci , Ti >, τi0 < Ci0 + 1, Ti0 >. Our assumptions on these systems are the following: 1. S is feasible, which is guaranteed by the choice of m cores. 2. Each task in S has at least one left time unit between its WCET and its period. i.e ∀τi , Ti − Ci ≥ 1. P n 3. S’ is feasible on m + 1 cores: i=1 (Ci /Di0 ) < m + 1 [16]. C +1 4. Si0 is feasible on m cores for any i0 . i.e maxi0 ( i6=i0 (Ci /Ti ) + Ti0i ) ≤ m. P 0 The subtasks use their constrained feasibility windows before failure, then the intermediate windows from the time of failure detection tp + 1 until the next hyper-period of the system N extH , and finally, their initial windows after this hyper-period. Formally, reconfiguration is performed as follows: Non-affected tasks τi (i 6= i0 ): 0 0 Not yet scheduled subtasks switch to their windows in S: [rij , dij ) 7−→ [rij , dji ). Affected task τi0 : three steps. Before H, the remaining subtasks keep their constrained feasibility windows: 0 0 0 0 [ri0j , dij0 ) 7−→ [ri0j , dij0 ); After H, the remaining subtasks switch to their windows in S: 0 0 [ri0j , dij0 ) 7−→ [rij0 , dji0 ); The affected subtask τij00 is rescheduled in the tolerance windows of the current 0 0 instance: [ri0j0 , dij00 ) 7−→ [rit0h , dti0h ). In the sequel, we denote by S 0 −→tp Si0 the system running from time 0 to N extH , with failure at time tp . 8 Dynamic Feasibility Window Reconfiguration for Failure Tolerance 4.3 Justification of the approach Consider first the architecture. For the sake of weight, cost and energy consump- tion, we decided to use the lowest possible number of cores. Thus, we only add one core. Then, we could have decided to use it as a spare core: the application runs on m cores and when one of them fails, the spare one takes over. But since we wanted to be able to give additionnal time to the affected task, we prefered to fully use the (m + 1) cores to get free slots for the additional processing time unit. At the system level, we wanted to constrain the system as less as possible, in order to manage the largest number of feasible systems. Therefore, we considered the three different systems. The system S 0 with constrained deadlines but with the initial WCET. Starting with the increased WCET system S + would have increased the load factor. Thus, some actually feasible systems would have be- come non feasible. Then the intermediate system Si0 , whose feasibility windows are constrained only for one task and relaxed for the other ones, provides more flexibility to reschedule the affected subtask. Finally, return to the initial system at the hyper-period ensures the feasibilty of the system on the remaining cores after recovering. 4.4 Illustration of the approach Consider the following system of tasks : S : τ1 < 1, 3 >, τ2 < 3, 6 >, τ3 < 3, 4 >, τ4 < 5, 12 >, τ5 < 7, 12 >. 32 We have U = 12 thus m = 3 and h = 12. We first add a 4th core. We then compute constrained deadlines. Each task is supposed to have a WCET increased by 1. We consider the first instance of i −1 each task and get (Di0 = dC i ): 0 0 0 D1 = d1 = d 2 e = 2; D2 = d22 = d 2+1 0 0+1 2 2+1 4 e = 5; D3 = d3 = d 4 e = 3; 3 6 4 D40 = d44 = d 4+1 0 6 6+1 6 e = 10; D5 = d5 = d 8 e = 11. 12 12 We obtain the following constrained system: S 0 : τ10 < 1, 2, 3 >, τ20 < 3, 5, 6 >, τ30 < 3, 3, 4 >, τ40 < 5, 10, 12 >, τ50 < 7, 11, 12 >. We assume that a failure occurs on the core C2 at time 1, thus is detected at time 2. The affected task is τ3 (i.e i0 = 3). We thefore have Si0 : τ1 < 1, 3 >, τ2 < 3, 6 >, τ30 < 4, 4 >, τ4 < 5, 12 >, τ5 < 7, 12 >. Figure 2 shows the feasibility windows of each subtask in the different systems S, S 0 and S3 and the actual feasibility windows used by each substask during the scheduling (see column R). In S3 , for each instance h of τ3 an additionnal subtask τ3th is provided which feasibility windows can be used as tolerance window for re-execution. Thus, the following windows are reserved: [3, 4) in the first instance, [7, 8) in the second and [11, 12) in the third. Since the affected subtask belongs to the first instance of τ3 , the first tolerance window [3, 4) is used for its rescheduling. Figure 3 shows the tolerant scheduling of system S where the failure on the core C2 is detected at time tp +1 = 2. An additional subtask is scheduled in the tolerance windows [3, 4) Dynamic Feasibility Window Reconfiguration for Failure Tolerance 9 Fig. 2. Subtask feasibility windows in different systems to fully complete τ3 . From time 2, the other tasks use their relaxed feasibility windows. Fig. 3. A failure-tolerant scheduling of System S 5 Some issue: inversion of priorities 5.1 The problem According to PD2, priorities among subtasks are based on their feasibility win- dows. Because these windows are not the same in systems S, S 0 and Si0 , it may result in our approach changes of priorities between subtasks while switching from one system to another. Moreover, a subtask may be released later in Si0 than in S 0 . This priority inversion between subtasks has consequences on the scheduling after the failure. Even if after reconfiguration subtasks have switched 10 Dynamic Feasibility Window Reconfiguration for Failure Tolerance to their windows in the system Si0 , S 0 −→tp Si0 doesn’t necessarily behave identically to Si0 for three main reasons: firstly, at time t > tp there may exist subtasks pending in Si0 , which have already been scheduled in S 0 . We call them anticipated subtasks. These subtasks are pending in Si0 but not in S 0 −→tp Si0 . For example, τ51 and τ21 in Fig 3 and τ00 , τ10 and τ20 in Fig 5. Secondly, there may exist subtasks scheduled in Si0 before the failure that were planned after the failure in S 0 . We name them the staggered subtasks. These are pending in S 0 −→tp Si0 , but not in Si0 . It is the case of τ80 and τ90 in Fig 5. Finally, a subtask can be scheduled at any time t in Si0 but later in S 0 −→tp Si0 because it was replaced by a higher priority subtask. We call it a postponed subtask. In Fig 5 0 0 τ17 and τ18 are postponed subtasks. Note that, a subtask is said anticipated, staggered or postponed by reference to the time planned for its execution in Si0 . 5.2 Illustration Consider the behavior of the systems S 0 −→1 Si0 (Fig. 3) and Si0 (Fig. 4). Due to reconfiguration, substasks have the same feasibility windows in both systems. Moreover, in this example, there is no staggered substask, since at time t = 1 all substasks already scheduled in Si0 have already been scheduled in S 0 −→1 Si0 . Fig. 4. Si0 scheduling on a 3 cores We have following first remark. If there is no staggered subtask, a subtask τij is scheduled in S 0 −→tp Si0 at the same time than in Si0 (eg. τ32 and τ52 ) or earlier (eg. τ41 and τ11 ) . Let’s now study an example with staggered subtask. Consider the following system composed of 48 tasks. τi−j < C, T > denotes a list of tasks τi , τi+1 ...τj with common temporal parameters. S1 : τ0−3 < 1, 20 >, τ4−7 < 1, 36 >, τ8−47 < 2, 38 >. U = 2.41 and m = b2.41c + 1 = 3. Then, the system is feasible on 3 cores. Dynamic Feasibility Window Reconfiguration for Failure Tolerance 11 Applying the ghost subtask method we get the constrained system S10 = τ0−3 < 1, 10 >, τ4−7 < 1, 18 >, τ8−47 < 2, 26 > which runs on 4 cores. We suppose that a failure occurs at time tp = 0 on core C4 affecting the sub- task of τ30 . The feasibility windows of the first instances of the tasks in different systems are given below: 0 0 0 1 S1: τ0−3 (0, 20), τ4−7 (0, 36), τ8−47 (0, 19), τ8−47 (19, 38). 0 0 0 0 1 S1 : τ0−3 (0, 10), τ4−7 (0, 18), τ8−47 (0, 13), τ8−47 (13, 26). 0 S13 : τ0−2 (0, 20), τ30 (0, 10), τ31 (10, 20), τ4−7 0 0 (0, 36), τ8−47 1 (0, 19), τ8−47 (19, 38). 0 0 We can note the priority inversion between τ0−3 and τ8−47 : in S1 and S13 , 0 τ8−47 has priority over τ0−3 0 , whereas τ0−3 0 has priority over τ8−470 in S10 . Fig- 0 ure 5 shows respectively the PD2 schedule for the systems S1 −→0 S13 and S13 . Fig. 5. A Tolerant Scheduling with staggered subtasks The comparison of two figures leads us to a second remark: at time tp , the 3 anticipated substasks τ00 , τ10 and τ20 correspond to the 2 staggered subtasks τ80 and τ90 . In S10 −→0 S1i0 , the staggered subtasks are scheduled at time 0 0 tp + 1 = 1 causing the postponement of the substasks (τ14 and τ15 ) scheduled at this time in S1i0 . The 2 postponed subtasks are scheduled at time 2 and 2 other 0 0 substasks (τ17 and τ18 ) are at their turn potsponed. Subtask postponement ends at the time t = 15. From this time, subtasks are scheduled in S10 −→0 S1i0 at the same time as in S1i0 or earlier. 12 Dynamic Feasibility Window Reconfiguration for Failure Tolerance Theses remarks give the key arguments for the proof of the validity of our approach, specifically for Proposition 1 (for the first remark) and for Proposition 2 (for the second remark). 6 The results 6.1 Experimentations To experiment our approach, we designed a software prototype, named FTA (Failure Tolerance Analyser), to simulate PD2 scheduling with failure. This pro- totype has three main modules: a random generator systems, a scheduler (that randomly determines the time of the failure and the affected core and schedules the system according to the proposed approach) and a Diagnostic tool (that analyzes the sequence produced by the scheduler and determines whether it is valid and fair). Several parameters can be set before the analysis: the system load, the num- ber of heavy tasks (i.e tasks with Ui ≥ 0.5), the time of occurrence of the failure and the affected core. 550 random systems have been generated and submit- ted to the simulation. This has been done by means of 11 random generations each composed of 50 systems containing different numbers of heavy tasks. The simulation is repeated many times to let the parameters vary and observe their impact. All the obtained scheduling results were valid. In the following paragraph, we give some elements of proof of this result. 6.2 Validity of our approach We first define some terms and used notations. – SchedSys nb : PD2 schedule of the system Sys on nb cores. S 0 −→t Si – Sched(m+1)−→m p 0 : PD2 schedule of system S 0 running on m + 1 cores with a failure of one core at time tp and a switch to system Si0 at time tp + 1. – P ending(S, t): list of pending substasks in system S at time t. In our context, a subtask τij is pending at time t if it is released and not yet scheduled. And if j > 0, the previous subtask τij−1 has already been scheduled. – Exec(τij , Sched): execution time of the subtask τij in the schedule Sched. – tp : failure time; τij00 : the affected substask. – N extH : next hyperperiod after tp ; j(S) j(S) – ri and di : pseudo-release date and pseudo-deadline of τij in system S. 0 S We assume that SchedSm , SchedS(m+1) and Schedmi0 are valid and fair. Our main result is that if S, S 0 and Si0 are feasible respectively on m, m + 1 and m cores, then S 0 −→tp Si0 is feasible. The proof of the validity of S 0 −→tp Si0 is very technical; thus, we only present some elements in two cases, as stated in the following theorem: Dynamic Feasibility Window Reconfiguration for Failure Tolerance 13 Theorem 1. If there is no staggered subtask or if there are staggered subtasks and the failure occurs at a time of a hyper-period (i.e tp = 0[H]), S 0 −→t Si p 0 then Sched(m+1)−→m is valid and fair. To prove this theorem, we first consider the case where no priority inversion takes place, i.e there is no staggered subtasks. This can be expressed by the following condition [H1]: S 0 [H1] : ∀τij , Exec(τij , Schedmi0 ) ≤ tp =⇒ Exec(τij , SchedS(m+1) ) ≤ tp , τij 6= τit0h . We then state the following proposition: Proposition 1. R(t) If there is no staggered subtask, at any time t > tp , a subtask τij is not scheduled S 0 −→t Si S p 0 later in Sched(m+1)−→m than in Schedmi0 . To prove this proposition, we proceed by induction on t, using two further properties: Property 1. Prop1(t): A subtask pending at time t (t ≥ tp + 1) in Si0 is either also pending in S 0 −→tp Si0 or has already been processed. τij ∈ P ending(Si0 , t) =⇒ τij ∈ P ending(S 0 −→tp Si0 , t) or (S 0 −→t Si ) Exec(τij , Sched(m+1)−→m p 0 )0, if all the subtasks already scheduled in Si0 have already been scheduled in S 0 −→tp Si0 , then from this time, each subtask is scheduled in S 0 −→tp Si0 not later than as in Si0 . i.e S (S 0 −→t Si ) If Exec(τij , Schedmi0 ) = t0 with t0 > tp then, Exec(τij , Sched(m+1)−→m p 0 ) ≤ t0 . With these two propositions we now prove the theorem. Proof. (Theorem 1) - First case: there is no staggered subtask. According to R(t), after the failure we have (S 0 −→t Si ) S Exec(τij , Sched(m+1)−→m p 0 ) ≤ Exec(τij , Schedmi0 ). S Since Schedmi0 is valid and fair by assumption, Theorem 1 holds. 14 Dynamic Feasibility Window Reconfiguration for Failure Tolerance - Second case: there are k staggered subtasks at tp = 0. Using the second remark of Paragraph 5.2 and P rop(t), we prove that: (1) At time tp = 0, to the k staggered substasks τa01 ...τa0k correspond k + 1 anticipated substasks τb01 ...τb0k+1 (in decreasing priority order) such that tbi = S 0 Exec(τb0i , Schedmi0 ) > tp and Exec(τb0i , SchedS(m+1) ) = tp , with tb1 ≤ tb2 ≤ ... ≤ tbk+1 and tmax = M AX(tb1 ). (2) At time t = 1, the staggered subtasks τa01 ...τa0k are among the m highest priority subtasks in S 0 −→tp Si0 and are thus executed 0 0(S 0 ) while meeting their deadlines: in fact, Exec(τa0i , SchedS(m+1) ) > 0 =⇒ dai >1 0(S) and thus dai > 1. Their execution can lead to the postponment of other planned subtasks τujir ir . (3) From time t=2 to time t= tmax −1, substasks postponed at t−1 are scheduled at t. We thus have a cascade of postponements of subtasks which are shifted by one time unit. And we prove that they still meet their pseudo- deadlines, because of the assumption U (Si0 ) ≤ m (for reasons of space, we cannot detail the proof here). (4) At time tmax , all the staggered and postponed subtasks have been scheduled in place of anticipated subtasks. Now, all the substasks already scheduled in Si0 are already scheduled in S 0 −→tp Si0 and thus P rop(tmax ) can be applied. S Since P rop(t) is true and Schedmi0 is valid by assumption [A3], all the pseudo- (S 0 −→t Si ) j(S) deadlines are met in S 0 −→tp Si0 . Thus, Sched(m+1)−→m p 0 < di is valid and Theorem 1 holds 7 Conclusion and perspectives Failure tolerance is a fundamental aspect of embedded system design. Today, many such systems run on multicore architectures. The advent of multicore brings timeliness due to increased processing units. However, there are some issues linked to these architectures, such as hardware failure risks. In this paper, we considered a permanent failure of one core detected within the next time unit after its occurrence for systems scheduled under PD2. Then one subtask is affected and its execution is resumed. We proposed an approach for failure tolerance based on limited hardware redundancy and dynamic reconfiguration of subtask feasibility windows. We have experimentaly tested this method and the resulting schedules were always valid and fair whatever the system load, the time of the failure and the percentage of heavy tasks. This approach raises the problem of priority inversion between subtasks during the scheduling. When there is no staggered subtask, we proved the validity of the resulting schedule. We also proved it when there are some staggered subtasks and the failure occurs at the time of a hyper-period (i.e tp = 0[H]). We are currently completing the proof for the last case where there are staggered subtasks and the failure occurs at any time different of a hyperperiod. This part of the proof is more technical than the one presented in this paper. In future works we will study the case where the failure detection delay is larger, and therefore more subtasks, that may belong to different tasks, are re-executed. We will also study the case where several cores may fail. Dynamic Feasibility Window Reconfiguration for Failure Tolerance 15 References 1. C. Lee, H. Kim, H. Park, S. Kim, H. Oh, S. Ha : A Task Remapping Tech- nique for Reliable Multicore Embedded Systems. International Conference on Hard- ware/Software Codesign and System Synthesis, pages 307 to 316 (2015) 2. S. Malhotra, P. Narkhede, K. Shah, S. Makaraju, M. Shanmugasundaram: A Re- view of Fault Tolerant Sscheduling in Multicore Systems. International Journal of Scientific and Technology Research, Volume 4, pages 132 to 136 (2015) 3. Y. Mouafo, A. Choquet-Geniet, G. Largeteau-Skapi: Failure Tolerance for a Multi- core Real-Time System Scheduled by PD2. Proceedings of the 9th , Junior Researcher Workshop on Real-Time Computing, pages 1-4 (2015). 4. J. Anderson, A. Srinivavasan: A New Look at Pfair Priorities. Rap. tech. TR00-023, University of North Carolina at Chapel Hil (1999) 5. S.K. Baruah, N.K. Cohen, C.G. Plaxton, D.A. Varvel: A Notion of Fairness in Ressource Allocation. Algorithmica 15 (6), Pg 600-625 (1996) 6. S. Shiravi, M.E. Salehi: Fault Tolerant Task Scheduling Algorithm for Multicore Systems. The 22nd Iranian Conference on Electrical Engineering (ICEE), Pg 885- 890 (2014) 7. E. Chantery, Y. Pencole: Modélisation et Intégration du Diagnostic Actif dans une Architecture Embarqué. In: Journal Européen des Systèmes Automatisées, MSR 2009, Pg 789-803 (2009) 8. A. Bondavalli, S. Chiaradonna, F. Di Giandomenico: Efficient Fault Tolerance: An Approach to Deal With Transient Faults in Multiprocessor Architectures. Interna- tional Conference on Parallel and Distributed Systems, Pg 354-359 (1994) 9. M. Cirinei, E. Bini, G. Lipari, A. Ferrari: A Flexible Scheme for Scheduling Fault- Tolerant Real-Time Tasks on Multiprocessors. IEEE International Parallel and Dis- tributed Processing Symposium, Pg 1-8 (2007) 10. C.M. Krishna: Fault-Tolerant Scheduling In Homogeneous Real-Time Systems. In: Journal ACM Computing Surveys (CSUR), Volume 46 Issue 4, Article No 48 (2014) 11. K. Ahn, J. Kim, S. HONG: Fault-Tolerant Real-Time Scheduling Using Pas- sive Replicas. Pacific Rim International Symposium on Fault-Tolerance, Pg 98-103 (1997) 12. A. Christy, T.R. Gopalakrishnan Nair: Fault-Tolerant Real Time Systems: Inter- national Conference on Managing Next Generation Software Application (2008) 13. F. Liberato, S. Lauzac, R. Melhem, D. Mosse: Fault-Tolerant Real-Time Global Scheduling on Multiprocessors. In: Proceedings of the 11th Euromicro Conference on Real-time Systems, Pg 252-259 (1999) 14. Y. Oh, S.H. Son: An Algorithm For Real-Time Fault-Tolerant Scheduling in Mul- tiprocessor Systems. Euromicro Workshop on Real-Time Systems, Pages 190-195 (1992) 15. A.A. Berossi, L.V. Mancini, F. Rossini: Fault-Tolerant Rate-Monotonic First-Fit Scheduling in Hard Real-Time Systems. IEEE Transactions on Parallel and Dis- tributed Systems 10, Pg 934-945 (1999) 16. S. Malo, A. Choquet-Geniet, M. Bikienga: PFair Scheduling of Late Released Tasks with Constrained Deadlines. Colloque National sur la Recherche en Informatique et ses applications, pages 142-149 (2012)