On-line Reinforcement Learning for Trajectory Following with Unknown Faults Yves Sohége11[0000−0002−3942−0454] and Gregory Provan21[0000−0003−3678−046X] Insight-Centre for Data Analytics, University College Cork, Cork, Ireland yves-sohege@insight-centre.ie, g.provan@cs.ucc.ie Abstract. Reinforcement learning (RL) is a key method for providing robots with appropriate control algorithms. Controller blending is a tech- nique for combining the control output of several controllers. In this arti- cle we use on-line RL to learn an optimal blending of controllers for novel faults. Since one cannot anticipate all possible fault states, which are ex- ponential in the number of possible faults, we instead apply learning on the effects the faults have on the system. We use a quadcopter path- following simulation in the presence of unknown rotor actuator faults for which the system has not been tuned. We empirically demonstrate the effectiveness of our novel on-line learning framework on a quadcopter trajectory following task with unknown faults, even after a small number of learning cycles. The authors are not aware of any other use of on-line RL for fault tolerant control under unknown faults. Keywords: Reinforcement Learning · Fault-tolerant Control · Quad- copter control 1 Introduction One of the most important uses of reinforcement learning (RL) is for controlling robots, using reinforcement learning control (RLC). It has been shown, e.g., [1] that RLC can provide a model-free method for learning control of a robot, e.g., a quadrotor. Model-free RLC assumes that the control system starts with no model and solves the Bellman equation based on running experiments with appropriate rewards, to create a matrix of values that serves as the model. Although model- free RLC can prove accurate, its main disadvantage is a long convergence time. Since the policy space for robotic interactions can be extremely large, RLC requires a large number of iterations to achieve convergence. In addition, if the plant is unstable (as is that of a quadcopter), or safety is an issue, using RLC can prove difficult in practice. The alternative approach is to use model-based RLC, which is also known as iterative learning control (ILC). ILC refines the reference or input signals of a 2 Y. Sohége and G. Provan desired maneuver based on data from previous executions. This can be used to update model parameters or extend an existing model. We assume that we have modelled the quadcopter with a linear approxi- mation of the underlying non-linear flight dynamics. Further, we assume that unmodeled dynamics (in our case, faults) can be represented as linear multi- plicative term to the actuation dynamics. Given planned trajectory inputs uk , each ILC iteration can be decomposed into two steps: (1) disturbance estimation, where a Kalman filter computes the current estimate of the disturbance ; and (2) input update,where we compute an improved quadrotor input, uk+1 . Using this framework, the input can be abstracted at any level; e.g., from robot thrust and angular velocities [2] to the position commands for trajectory following [3– 5]. Experiments show that few iterations (on the order of 10-20) are needed to characterize repeatable disturbances and improve tracking performance. The majority of applications of RL assume that learning is done off-line. However, there are many situations in which a robot will encounter novel situ- ations and needs to adapt to those situations. We address that scenario in this article. In particular, we examine a quadcopter that has been pre-programmed with a set of controls and uses control blending to operate within a known control environment. However, for novel scenarios the robot must adapt. We introduce novel actuator faults into a quadcopter, and use ILC to learn new control laws. We have defined the quadcopter to have a hierarchical control architecture. The lower-level controllers use PID methods to control each of the quadcopters three axis of movement. The higher-level controller uses blends of the lower-level controllers to control flight trajectories. We have pre-defined a set of control laws for nominal and fault modes. We then subject the quadcopter to unseen fault and then allow the quadcopter to repeat the novel conditions for the unseen faults to learn new high-level control laws. This article proposes the first use of ILC for learning novel fault-tolerant control laws. We empirically demonstrate that we can learn new controls from a small number of learning trials. 2 Related Work This work builds on extensive prior work in RL, fault-tolerant control (FTC) and Fault Detection and Isolation (FDI). A significant body of work exists for RL for robotics applications, dating from [1]. This work is related to work on trajectory following, e.g., [3–5]. For this class of application, several instances of learning quadcopter control have been achieved [6]; however we are not aware of prior work that uses Reinforcement Learning to learn the optimal blending of controllers and achieve fault tolerant control. In the area of FTC [7], a significant body of work has been developed and applied to real-world systems. [8] presents a recent overview of FTC, and [9] presents FTC with relation of system safety. Traditional methods for FTC em- ploy a bank of observers coupled with dedicated controllers, and perform dis- Title Suppressed Due to Excessive Length 3 crete switching. This approach enables designers to tune the system to dedicated faults, but the speed of the system hinges on the speed of FDI. More recent ap- proaches use mixing controllers, e.g., [10, 11], which blend the outputs of multiple controllers and are less reliant on FDI. Fault-Tolerant Control (FTC) can be divided into two types, passive and active. Passive fault tolerant controllers are designed off-line against predefined models for certain operating conditions and have no ability to react to unantici- pated faults. Passive FTC enables fast adaptation to faults, within the predefined operating conditions. Active FTC uses on-line data to reconfigure the controller to stabilise the plant. For a comprehensive study between the two approaches see [12]. Both active and passive FTC rely on specifying the space of faults that the system will encounter. For passive FTC, approaches such as blending of controllers tuned to nominal and failure modes are used to maintain system stability, e.g., [13]. Analogously, active FTC can rely on being able to detect pre- specified faults, such as using a bank of observers, with each observer tuned to a particular fault, e.g. [14]. For complex systems, it is impossible to pre-specify all faults, since there are too many fault combinations to consider, and it may be impossible to know all possible faults a priori. As a consequence, it is imperative that a system designer understand the space of possible faults and their impact on a system. Very little work has been conducted on exploring the space of faults and their impact on active vs passive FTC. 3 Reinforcement Learning Reinforcement learning (RL) [15] is a technique for learning control actions that are optimal for particular states, using interactions of an agent with the envi- ronment in which the agent obtains rewards for actions. Reinforcement learning is typcially formalized as a Markov decision process (MDP), which is a tuple M = hS, U, T, R, γi, where – S is the set of possible world states, – U is the set of possible control actions, – T is a transition function T : S × U → P (S), – R is the reward function R : S × U → R, – and γ is a discount factor such that 0 ≤ γ ≤ 1. Reinforcement learning learns a policy Π : S → U , which defines which ac- tions should be taken in each state. Q-learning [16] is a model-free reinforcement learning technique that uses a Q-value Q(s, u) to estimate the expected future discounted rewards for taking action u in state s. At each step, Q-learning applies an update equation for Q(s, u) given by   Q(st , ut ) ← Q(st , ut ) + α rt+1 + γ max Q[st+1 , ut ] − Q[st , ut ] u where rt+1 is the reward observed after performing action ut in state st , α is the learning rate (0 ≤ α ≤ 1), and st+1 is the state that the agent transitions to 4 Y. Sohége and G. Provan after performing action ut . After Q(st , ut ) converges, the optimal action for the agent in state st is arg maxu Q(st , u). In Q-learning and related algorithms, an agent maintains a table of Q[S, U ] based on its history of interaction with the environment. An experience hs, u, r, s0 i provides one data point for the value of Q(s, u). The data point is that the agent received the future value of r + γV (s0 ), where V (s0 ) = maxu0 Q(s0 , u0 ); this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation to update its estimate for Q(s, u): Q[s, u]Q[s, u] + α(r + γ max 0 Q[s0 , u0 ] − Q[s, u]) u or, equivalently, Q[s, u](1 − α)Q[s, u] + α(r + γ max 0 Q[s0 , u0 ]). u 4 RL Results This section presents a summary of our results. We have shown that RL can be used to dynamically learn how to stabilise a robot’s trajectory given previ- ously unseen faults. We used a quadcopter to demonstrate how path-following deviations can be reduced following learning epochs. We will firstly present the final Q-Matrix (Figure 1) that was learned. Positive and negative Q-values represent good and bad performance respectively. There is a large amount of negative values which can be attributed to our reward function, see Section 5.1. We say the matrix has converged perfectly if only a single positive Q-Value exists per row. This is not evident in this matrix but with adjustments to the reward function and enough learning epochs this matrix would converge. Another interesting point to note is the magnitude of the Q- Values. The first row has significantly higher values than the rest. This is because the first row represents the lowest deviation rate and this row will get trained for every fault since for the deviation rate to increase to the higher partitions it must first go through first partition. To see the improvements in trajectory following performance for a quadcopter, we compared the RLC with the original quadcopter configuration. As far as the authors are aware there is currently no baseline controller for unknown faults to compare against. In future we intend to run comparisons against other FTC architectures. We use total trajectory deviation in centimetres to gauge how much the tracking accuracy improved. Table 1 shows this comparison for several magnitudes of unknown rotor faults. We denote the number of learning cycles of RLC using µ. We noticed a 63-75% decrease in the trajectory deviation after a small number of learning cycles. Figure 2 shows the time it took to re-stabilize the quadcopter along its trajec- tory with acceptable deviation after each completed learning cycle. It is clearly visible that after a small number of learning iterations the time to re-stabilize drastically reduces but never converges to 0. The spikes in the graph show an increase in the magnitude of a benchmark fault induced. Title Suppressed Due to Excessive Length 5 Fig. 1. Matrix learned after 250 learning cycles. Rows represent different system states and each column represents a blended controller. Rotor Fault Original Quadcopter(cm) RLC Quadcopter (cm) µ Improvement 4% 532.81 131.39 100 75.34% 6% 2188.1 650.92 200 70.25% 7% 5439.9 1995.1 250 63.32% Table 1. Empirical Analysis of RLC improvment in path deviation error against the original controller. Fig. 2. Evaluation phase τ timings. 6 Y. Sohége and G. Provan 5 Reinforcement Learning Blended Control This section describes the mapping of the RL approach to our experimental do- main, i.e. what the RL tuple hS, U, T, R, γi means for our quadcopter domain. For the experimental set-up we used an opensource Matlab implementation of a quadcopter flight simulation [20]. This implementation used 4 tuned PID (Pro- portional Integral Derivative) controllers to control Roll, Pitch, Yaw and Altitude (Figure 3 Right) respectively. Fig. 3. Left: Simulation Path. Right: Roll , Pitch and Yaw axis of the quadcopter. Our objective with this simulation is to implement on-line reinforcement learning for faults of unknown origin and magnitude. As far as the authors are aware not much work has been done using RL to learn a blended controller for unknown faults on-line. The simulations core task is trajectory following which is defined as a list of time indexed way point locations in the form of a triplet (X , Y, t), presenting desired X and Y position at time t. Given the quadcopters current position at time t as (x̄, ȳ) the deviation from the trajectory can be computed as: q 2 2 δt = (X − x̄) + (Y − ȳ) The trajectory used is a square and can be seen in Figure 3 Left. We define a single learning cycle for RL as a single execution of the shown trajectory with one unknown faults. The faults are injected at the same point in the trajectory for consistency. After every learning cycle we allow for time to fully stabilize along its trajectory so no residual effects from previous learning cycles will influence future learning quality. Since we have a cyclic trajectory we can continuously repeat the learning cycles and analyse the performance improvements over time and on-line. Each learning cycle can be defined in terms of four phases, which are: 1. Stabilize - Ensure nominal operating conditions. 2. Learning - Random fault in specified range and stabilizing using RL. 3. Stabilize - Ensure nominal operating conditions. 4. Evaluate - Test against a benchmark error to check improvement. Title Suppressed Due to Excessive Length 7 The Stabilize phases simply ensure the quadcopter returns to its trajectory with acceptable deviation rate and in nominal operating conditions before pro- ceeding to the following phases. The Learning phase consists of random fault injection and stabilizing with RL. The Evaluation evaluates the current learned policy against a benchmark fault to record the improvements learned over time, Figure 2. RL States The space of unknown faults is simply too large to apply learning to each fault individually. We hence classify the states S for our RL implementation in terms of the effect that unknown faults have on the trajectory following task. For this metric we chose the deviation rate from the desired trajectory, which we define as: 4 d X ρ= δt−i , dt i=0 where δi is the trajectory error between current and desired position at time step t − i. We then define S as a partition over the parameter space of ρ into N regions of equal size. RL Actions Our action space U is defined in terms of blended control for which we use a linear combination of predefined controllers. In other words, given a set of M predefined controllers Λ = {Λ1 , · · · , ΛM } and corresponding weights {ϕ1 , · · · , ϕM }, our applied blended control is given by M X Λ∗ = ϕi Λi . (1) i=1 PM where i=1 ϕi = 1. We define our action space U as a partition of size P over the parameter space of ϕ. The granularity of this partition will dictate how many different blended controllers are being learned on. We then define our Q-Matrix as Q(S, U ) and set N and P to 5 and 10 respectively. In other words we are learning the optimal blended controller for each deviation-rate sub-partition. The size of the matrix increases drastically with the size of the partitions. We hence chose small enough partition sizes to concentrate the learning into a smaller region. The transition function T maps the current state and action chosen to the new state. In our context T is already given by the simulation itself. For RL to be able to learn on-line we must have some performance metric to evaluate how well a controller performed during the fault recovery for each sep- arate learning cycle. For this metric we choose the time, τ , until the quadcopter is stabilized on its desired trajectory. 8 Y. Sohége and G. Provan RL Rewards Since we use τ as our performance metric we must also compute a baseline value to compare this against, which indicates positive or negative reward for the blended controller used. Since we have no prior knowledge about the faults, we compute τ̄ , a running average of τ . For each subspace of S i we define τ̄ i to allow larger errors more time to stabilize. We assign credit based on the performance against the running average. ( 1 if τ ≤ τ̄ i C= -1 if τ > τ̄ i given that max(ρ) < S i . The τ̄ values for each fault mode after the RL simulation can be found in Table 2. Si S1 S2 S3 S4 S5 i τ̄ (sec) 2.8 3.6 3.9 4.3 4.7 Table 2. τ̄ i values for each partition of S after the RL simulation. Credit Assignment Problem The Credit Assignment Problem [21] refers to the problem of identifying which action was the one deserving credit. In our case the problem is identifying which partition S i should receive the credit af- ter a learning phase. This is because ρ will naturally vary across multiple sub- partitions of S during a single learning cycle. We hence apply RL for each of the sub-partitions. The blended controller used to control the system is changed when ρ transitions from its current S i to S i−1 or S i+1 . Figure 4 shows various signals and the bounds of each S i indicated as Levels from the quadcopter simu- lation during a single learning cycle. The Green signal represents the changing ρ. We indicate the transition between sub-partitions of S with red arrows. Notice the blue signal representing the applied control signal changes when the devia- tion magnitude changes. However this implementation makes it difficult to assign credit to a single subspace, since learning was potentially applied on multiple subspaces. We combat this by assigning proportional credit to each subspace depending on how long the value of ρ stayed in each of the parameter subspaces. That is, the credit assigned to each S i is: t R(S i ) = C ∗ τ where t is the duration that ρ was in the bounds of S i and C and τ are as previously defined. For this implementation we do not use γ, the future expected reward. Esti- mating the expected deviation rate ρt+1 is challenging as the controller is syn- thesised on-line and the faults are unknown. In future work we will extend the reward function to include the γ term. Title Suppressed Due to Excessive Length 9 Fig. 4. Graph showing ρ (green), current error (orange) and blended controller used (blue). The subspaces S 1 , S 2 , ..., S 5 are indicated along the Y-axis using the Level terminology. Error Generation We focus solely on unknown rotor faults for this article for simplicity but this framework works for any unknown faults that cause trajectory deviations. More specifically, we use a multiplicative term ι on the rotor speed to represent the unknown fault. To achieve an even distribution of learning we randomize ι, such 0 ≤ ι ≤ Γ , where Γ is the upper limit on the fault. We incrementally increase Γ after a number of learning cycles. This will give a larger variance of deviations allowing the system to apply learning across all sub-partitions. Figure 5 shows the rotor fault magnitudes used over 250 learning cycles. Note this figure shows the benchmark errors used as well as randomized errors for training. Fig. 5. Rotor Fault magnitude ι and benchmark errors Γ throughout simulation. Note: larger errors are given longer time to stabilize. 10 Y. Sohége and G. Provan 5.1 Simulation Details For completeness we will give a full list of low level controllers and other pa- rameters used for the experiments. The PID controller coefficients for nominal and fault (indicated by superscript N and F) controller for each control axis (indicated by subscript φ, θ, ψ) can be seen in Table 3. It is worth noting that these controllers are not tuned for any specific fault; they are simply tuned with a more ”aggressive” coefficient. Figure 1 shows the matrix after 250 learning cycles. We set the initial value for Γ to 3% and increased this after every 50 learning iterations up to 7%. We apply the fault to the same rotor every time. Controller λNφ λFφ λNθ λFθ λNψ λFψ P 2 10 2 10 4 14 I 1.1 5 1.1 5 0.5 2 D 1.2 3 1.2 3 3.5 5 Table 3. Low level nominal and fault controller tuning for each control axis. 6 Conclusion In this article we presented a novel reinforcement learning approach for on- line, real-time learning for unknown faults. We empirically demonstrated the effectiveness of this approach on a Quadcopter trajectory following task. We noticed a 63-75% decrease in the trajectory deviation due to an unknown fault after a small number of learning cycles. In this set of experiments only rotor faults where investigated but given appropriate controller pairs for blending the authors believe this approach will work for other system disturbances such as wind or sensor faults. Since we are learning on the disturbance space ,instead of the fault parameter space, this learning approach will work for most faults that cause a path deviation. The novelty of this approach is that the learning phase can be conducted on-line and needs very few iterations before converging compared to traditional RL methods. Further more, we are not aware of any other RL based approach for FTC under novel faults. This could provide adaptive FTC capabilities to systems that are not easily fixable or reconfigurable such as satellites or space rovers. Our future work includes focusing on improving the learning process and attempting training on a multitude of errors. For simplicity, this article also only explores a two controller blending strategy but in theory this is not a limitation. Larger sets of predefined controllers for blending can be trained using our described method but with a considerably larger number of training phases which will also be explored in future work. Title Suppressed Due to Excessive Length 11 References 1. Richard S Sutton, Andrew G Barto, and Ronald J Williams. Reinforcement learn- ing is direct adaptive optimal control. IEEE Control Systems, 12(2):19–22, 1992. 2. Angela P Schoellig, Fabian L Mueller, and Raffaello DAndrea. Optimization-based iterative learning for precise quadrocopter trajectory tracking. Autonomous Robots, 33(1-2):103–127, 2012. 3. Fabian L Mueller, Angela P Schoellig, and Raffaello D’Andrea. Iterative learning of feed-forward corrections for high-performance tracking. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 3276–3281. IEEE, 2012. 4. Markus Hehn and Raffaello D’Andrea. A frequency domain iterative feed-forward learning scheme for high performance periodic quadrocopter maneuvers. In Intel- ligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 2445–2451. IEEE, 2013. 5. Jemin Hwangbo, Inkyu Sa, Roland Siegwart, and Marco Hutter. Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters, 2(4):2096–2103, 2017. 6. Michael C. Koval Michael L. Littman, Christopher R. Mansley. ”Autonomous quadrotor control with reinforcement learning”. 7. Mogens Blanke, Michel Kinnaert, Jan Lunze, Marcel Staroswiecki, and J Schröder. Diagnosis and fault-tolerant control, volume 691. Springer, 2006. 8. Ron J Patton. Fault-tolerant control. Encyclopedia of systems and control, pages 422–428, 2015. 9. Xiang Yu and Jin Jiang. A survey of fault-tolerant controllers based on safety- related issues. Annual Reviews in Control, 39:46–57, 2015. 10. Matthew Kuipers and Petros Ioannou. Multiple model adaptive control with mix- ing. IEEE Transactions on Automatic Control, 55(8):1822–1836, 2010. 11. Youmin Zhang and Jin Jiang. ”Integrated active fault-tolerant control using IMM approach”. Transactions on Aerospace and Electronic Systems, 37(4):1221–1235, 2001. 12. Jin Jiang and Xiang Yu. ”Fault-tolerant control systems: A comparative study between active and passive approaches”. Annual Reviews in Control, 36:60–72, 2012. 13. Kemal Büyükkabasakal, Bariş Fidan, and Aydogan Savran. Mixing adaptive fault tolerant control of quadrotor UAV. Asian Journal of Control, 19(5):1–14, 2017. 14. Jan Lunze. From fault diagnosis to reconfigurable control: A unified concept. In Control and Fault-Tolerant Systems (SysTol), 2016 3rd Conference on, pages 413– 421. IEEE, 2016. 15. Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 16. Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3- 4):279–292, 1992. 17. Kumpati S Narendra and Zhuo Han. The changing face of adaptive control: the use of multiple models. Annual Reviews in Control, 35(1):1–12, 2011. 18. Said G Khan, Guido Herrmann, Frank L Lewis, Tony Pipe, and Chris Melhuish. Reinforcement learning and optimal adaptive control: An overview and implemen- tation examples. Annual Reviews in Control, 36(1):42–59, 2012. 19. Weicun Zhang. Stable weighted multiple model adaptive control: discrete-time stochastic plant. International Journal of Adaptive Control and Signal Processing, 27(7):562–581, 2013. 12 Y. Sohége and G. Provan 20. D. Hartman, K. Landis, M. Mehrer, S. Moreno, and J. Kim. Quadcopter Simula- tion. 21. RICHARD S. SUTTON. TEMPORAL CREDIT ASSIGNMENT IN REIN- FORCEMENT LEARNING. PhD thesis, 1984. Copyright - Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works; Last updated - 2016-05-11.