=Paper= {{Paper |id=Vol-3344/paper21 |storemode=property |title=Intelligent Control of Morphing Aircraft Based on Soft Actor-Critic Algorithm |pdfUrl=https://ceur-ws.org/Vol-3344/paper21.pdf |volume=Vol-3344 |authors=Shaojie Ma,Xuan Zhang,Yuhang Wang,Junpeng Hui,Zhu Han }} ==Intelligent Control of Morphing Aircraft Based on Soft Actor-Critic Algorithm== https://ceur-ws.org/Vol-3344/paper21.pdf
Intelligent Control of Morphing Aircraft Based on Soft Actor-
Critic Algorithm 1
Shaojie Ma1*, Xuan Zhang1, Yuhang Wang1, Junpeng Hui2, Zhu Han3
1
  Research and Development Center, China Academy of Launch Vehicle Technology, Beijing, China
2
  Beijing Institute of Space Long March Vehicle Beijing, China
3
  Key Laboratory of Digital Earth Science Aerospace Information Research Institute Chinese Academy of
Sciences Beijing, China

                 Abstract
                 Morphing aircraft can optimize its flight performance by changing aerodynamic shape.
                 However, the deformation comes up with a great challenge to the control system. The most
                 outstanding characteristics are strongly nonlinear and large uncertainty. Therefore, an
                 intelligent control method is proposed based on Soft Actor-Critic algorithm. Firstly, the state
                 space, action space and reward function required by the algorithm are designed. Then the
                 training efficiency of the algorithm is improved through the way of network pre-training. The
                 mathematical simulation proves that the control strategy can keep the altitude and velocity
                 stable during deformation, and also has strong robustness to the uncertainty caused by
                 deformation and complex external interference.

                 Keywords
                 Intelligent control; Morphing aircraft; Flight control; Soft Actor-Critic
1. INTRODUCTION
    Morphing aircraft is a type of aircraft that can change its aerodynamic shape independently
according to different flight states and task requirements. The aerodynamic parameters and structural
parameters of morphing aircraft change nonlinearly during deformation, which makes the aircraft
dynamics model have strong nonlinear. The movement between wings and body, as well as the air
would produce additional disturbance, which makes the model have great uncertainty.
    For the flight control problem of morphing aircraft, the commonly used method is switching linear
variable parameter robust control based on linear model [1-2]. However, the linearization would lose
the nonlinear characteristics of morphing aircraft model partly. Therefore, some methods based on
nonlinear control have become the mainly researched method [3]. Reference [4] realized adaptive
control of morphing aircraft based on dynamic inverse control, but such methods also have the problem
of high dependence on model accuracy. Therefore, Reference [5] designed a controller based on the
active disturbance rejection control theory, which has strong robustness to disturbances during
deformation. But the parameters of active disturbance rejection control are too many, which increases
the complexity of the controller design.
    With the development of intelligent control theory, deep reinforcement learning is increasingly
applied to complex control tasks, and shows good performance [6-7]. Reference [8] applied Soft Actor-
Critic algorithm to fault-tolerant control of aircraft. Reference [9] designed a composite controller
based on deep deterministic policy gradient algorithm and traditional controller. Reference [10]
proposed a fixed-time disturbance rejection controller which set parameters assisted by twin delayed
deep deterministic policy gradient algorithm.
    Based on this, this paper proposes a controller for morphing aircraft based on Soft Actor-Critic
algorithm. Taking a variable sweep UAV as the object, Firstly, a longitudinal mathematical model is
established considering its multi-rigid body structure. Then the state space, action space, reward
function and network structure required by the algorithm are designed under the framework of Markov

ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control
EMAIL: * Corresponding author: mashj05@163.com (Shaojie Ma)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                 148
decision process. Finally, the control accuracy and strong robustness of the proposed control strategy
are verified by mathematical simulations.
2. MATHEMATICAL MODEL OF MORPHING AIRCRAFT
    In this paper, a class of variable sweep aircraft is considered. The model is similar with the research
in reference [11-12]. The longitudinal motion model of the morphing aircraft can be described by
                                    V = ( − X + P cos  − mg sin  + Fsx ) / m
                                    
                                     = (Y + P sin  − mg cos  − Fsy ) / ( mV )
                                    
                                    h = V sin 
                                    
                                     = z
                                    
                                    z = ( M z + M sz − S x g cos  − I zz ) / I z
                                     =  − 
                                                                                                                       ()
    Where V ,  ,  , h denote the velocity, flight path angle, angle of attack and altitude, respectively.
 , z denote the angle and angular rate of pitch, respectively. m , I z represent the mass and moment
of inertia of the aircraft, respectively. g is the gravitational acceleration. P is the thrust of the engine.
X , Y , M z denote the lift force, drag force and pitch moment, respectively, which are given as

                                     X = Cx 0 ( ) + Cx ( )  + Cx 2 ( )  2  QS
                                                                                 
                                     Y = C y 0 ( ) + C y ( )   QS
                                     
                                      M z = Cm 0 ( ) + Cm ( )  + Cm ( )  z + Cm ( )  z  QSL
                                                                                                                        (2)
    Where Q = 0.5V is dynamic pressure, S is the wing surface, L is the mean aerodynamic chord,
                        2


Cx 0 ( ) , Cx ( ) , C x 2 ( ) , C y 0 ( ) , C y ( ) , Cm 0 ( ) , Cm ( ) , Cm ( ) , Cm ( ) are the aerodynamic
derivatives which can be formulated as polynomial functions of the sweep angle    0, 45 .
    Fsx , Fsy , M sz are the inertial forces and moment caused by deformation, and S x is the static
moment distribution in the body frame varies with the sweep angle  , which are given as

                                    Fsx = (z sin  + z2 cos  ) S x + 2S xz sin  − S x cos 
                                   
                                    Fsy = (z cos  − z2 sin  ) S x + 2S xz cos  + S x sin 
                                    
                                     M sz = (V sin  + V  cos  − V z cos  ) S x
                                    
                                     S x = 2m1r1 + m3 r3                                                              (3)
   Where m1 , m3 represent the mass of the wings and body of the aircraft, respectively. r1 , r3 denote
the position of related components in the body frame.
3. DESIGN OF CONTROLLER BASED ON SOFT ACTOR-CRITIC ALGORITHM
3.1.     Principle of Soft Actor-Critic Algorithm
   Soft Actor-Critic (SAC) algorithm is a deep reinforcement learning algorithm based on Actor-Critic
(AC) framework, and use deep neural network to represent policy  and action-state value
function Q ( s, a ) . SAC uses stochastic network as policy network  ( s |  ) , which outputs the average
and variance of the action and obtains the action instruction through sampling, so as to improve the
exploration of the algorithm. SAC uses two critic networks Qi ( s, a | Qi ) to reduce the estimation error
of Q-function, which is inherited from Double Q-Learning. Moreover, as an off-policy algorithm, SAC
sets replay buffer and two target critic networks Qi  ( s, a |  Qi  ) . In addition, SAC encourages more
exploration by maximizing the cumulative reward of entropy regularization rather than just the

                                                              149
cumulative reward. Which makes it become a kind of widely used continuous control reinforcement
learning algorithm.
   SAC selects the minimum value of the two target critic networks when updating the Bellman
equation, so the loss function of critic network can be given as
                             yt = rt +1 +   min ( Qi ( st +1 , at +1 | Qi  ) ) −  log  ( st |  ) 
                                                                                                          
                            
                             L (Qi ) =  ( y j − Qi ( s j , a j | Qi ) )
                                                  N
                                            1                                     2


                                          N     j =1
                                     =        +   Q Qi L (Qi )
                             Qi ,t +1     Qi ,t
                                                                                                                (4)
   Where N denotes batch size,  Q is the learning rate of critic network, at+1 represents the next action
corresponding to the next state,  represents the temperature parameter, similarly the loss function of
policy network can be given as
                                J ( ) =   log  ( st |  ) − min i =1,2 ( Qi ( s j , a j ) ) 
                                                                                                 
                                                 1 N
                                J ( ) =    ( s j |  )  a J ( )
                                                N j =1
                                ,t +1 =  ,t +    J ( )
                                                        
                                                                                                                (5)
   Where  is the learning rate of policy network. Besides, SAC also provides a method to adjust the
temperature parameter automatically, and the loss function can be given as

                                   J ( ) =    − log  ( st |  ) −  
                                  
                                   t +1 =  t +   J ( )
                                                                                                                (6)
   Where  is target entropy, which can dynamically find the lowest temperature that still ensures a
certain minimum entropy,  is the learning rate of  . SAC updates the target network by exponential
smoothing rather than direct replacement like DDPG, which can make the target network update more
slowly and stably, and improve the stability of algorithm.
3.2.    Design of Controller
   The longitudinal plane of the aircraft is controlled by altitude and velocity. Due to the complex
continuous action space of the aircraft, it is difficult for the randomly initialized policy network to
ensure flight stability, and the quality of the samples collected at the initial training episode is poor
which would result that the training efficiency is extremely low. Therefore, the network pre-training is
considered in this paper. Firstly, the traditional controller is fitted through deep learning, and the deep
neural network learned is used as the initial policy network of SAC. The training structure is shown in
Figure 1. The policy network serves as the aircraft controller directly, and the action from network is
the command of the aircraft control actuator.




Figure 1. Controller structure

   The control model is transformed into Markov decision process, then the state space, action space,
reward function and deep neural network structure are designed under this framework.



                                                               150
3.2.1. State space and action space
   Drawing on the traditional controller design ideas, and considering the influence of deformation on
the model, the seven-dimensional state vector is designed as
                                      s = h, h, V , V ,  , z ,  
                                                                                                           (7)
   The actuator of altitude channel is mainly elevator, and the velocity channel is mainly adjustable
thrust engine, so the action vector is designed as
                                                a =  ,  T 
                                                                                                           (8)

3.2.2. Reward function
   In order to ensure that the aircraft can accurately track altitude and velocity commands, and reduce
the control energy demand, the reward function is designed as

                              r =  h h +  V V +    +  T  T + 1 +  2 +  d
                                                                                                           (9)

   Where first four are the penalty related to tracking error, deflection angle of elevator, and the thrust.
When the height tracking error, velocity tracking error, deflection angle of elevator and thrust increase,
the penalty value increases,  i ( i = h,V ,  , T ) is the weight value, respectively. The last two are sparse
reward for tracking accuracy. When the tracking error is less than the threshold, the reward is applied.
 d is the penalty of states divergence. In this paper, when the height tracking error is greater than
500m, it would be judged that the states divergence, and the episode would be ended.
3.2.3. Deep neural networks
    All the networks used in this paper are back propagation neural network. The input layer of the
policy network has 7 neurons corresponding to the 7-dimensional state vector, the hidden layer has 2
fully connected layers, both composed of 256 neurons, and the activation function are ReLU. The
output layer is composed of the mean and the variance of the action. The two critic networks have the
same structure. The input layer has 9 neurons corresponding to the 7-dimensional state vector and the
2-dimensional action vector; the hidden layer has 3 fully connected layers, all composed of 64 neurons;
and the activation function are ReLU, too. The output layer has 1 neuron corresponding to the action-
state value function.
4. NUMERICAL SIMULATION
   The initial simulation states are h0 = 1000m  5m , V0 = 30m / s  5m / s , 0 = 0.995  1 ,
0 = 0.995  1 , and the initial altitude and velocity command are 1000m, 30m/s, respectively. And the
change of sweeping angle is shown in Figure 2.




Figure 2. Curve of variation of sweeping angle

   The control step is 10ms, the network updating step is 100ms, and the simulation time of each
episode is 100s. The algorithm training parameters and the weights of the reward function are shown in
TABLE 1.



                                                          151
TABLE I.    PARAMETERS DESIGN
                  Parameters            Values         Parameters     Values
                  Episodes              2000           Steps          10000
                  Batch Size            256            h             -2
                    1                  0.5            V             -0.5
                    2                  0.5                         -1
                     done              -1000          T             -0.001

4.1.    Result analyses of SAC
    Figure 3 shows the change of the reward with training episodes during the training based on SAC.
A total of 3000 episodes of training were conducted. The thick line is the reward been smoothed. It can
be seen that the average reward generally increases during training. In addition, due to the strong
exploration of SAC, the flight state would divergent sometimes, such as the fluctuations of the light
line. But the upward trend of reward is not be affected.




Figure 3. Cumulative reward during training
4.2.    Control Performance Analyses of the Controller
   In order to verify the effectiveness of the control policy obtained by training, the simulation
verification under nominal state and deviation state is carried out based on the longitudinal plane
motion model of the morphing aircraft.
4.2.1. Simulation under nominal states
   Figure 4 shows the altitude and velocity tracking results, the deflection angle of elevator and the
thrust. The altitude command changes from 1000m to 1050m, and the velocity is always 30m/s. The
blue line is for SAC optimized controller, green dotted line is for pre-training controller.




Figure 4. Altitude and velocity tracking results under nominal states

                                                 152
   The integral absolute error of altitude and velocity before optimization are 167.1219m, 114.5735m/s,
respectively. After optimization they are reduced to 7.4009m, 7.0559m/s, respectively. The control
accuracy has been greatly improved. Besides, the impact of deformation is greatly reduced.
4.2.2. Simulation under deviation states
    In order to verify the robustness to complex external disturbances of the control policy proposed in
this paper. 20% aerodynamic deviation, 15% structural deviation, and 10% density disturbances are
considered. Figure 5 shows the tracking results of altitude and velocity.




Figure 5. Altitude and velocity tracking results under deviation states

   The integral absolute error of altitude and velocity under deviation states are 10.2310m, 7.9277m/s,
respectively. From this show that the control policy trained in this paper can achieve stable control
under the deviation states and ensure the altitude and velocity accuracy in the deformation transition
process, which proves its robustness to external disturbances.
5. Conclusions
    Aiming at the problems of strong nonlinear dynamics model of deformed aircraft and complex
internal and external interference factors in the deformation process, taking a class of variable swept
aircraft as an example, the height and velocity controller design of deformed aircraft was carried out
based on SAC deep reinforcement learning algorithm. Network pre-training was adopted to ensure
stable control at the initial stage of algorithm training and improve sample quality. The simulation
results show that the proposed control policy can improve the accuracy of altitude and velocity control
greatly, and has strong robustness to both internal and external uncertainties of the model during
deformation. However, this paper only carried out experimental verification based on mathematical
simulation, and it needs to continue to carry out practical application verification.
6.REFERENCES
[1] K Boothe, K Fitzpatrick, R Lind, “Controllers for disturbance rejection for a linear input-varying
    class of morphing aircraft,” 46th AIAA/ASME/ASCE/AHS/ASC Structures, Structural
    Dynamics & Materials Conference, Austin, Texas, April, 2005.
[2] W Jiang, Ch Dong, T Wang, Q Wang, “Smooth switching LPV robust control for morphing
    aircraft,” Control and Decision, vol 31, pp. 66-72, January, 2016.
[3] M Ran, Ch Wang, H Liu, et al, “Research status and future development of morphing aircraft
    control technology,” Acta Aeronautica et Astronautica Sinica, vol 43, pp. 527449, 2022.
[4] T Lombaerts, J Kaneshige, S Schuet, “Dynamic inversion based full envelope flight control for an
    VTOL vehicle using a unified framework,” AIAA Scitech 2020 Forum, Orlando, FL, January,
    2020.
[5] H Song, L Jin, “Dynamic modeling and stability control of folding wing aircraft,” Chinese
    Journal of Theoretical and Applied Mechanics, vol 52, pp. 1548-1559, November, 2020.
[6] W Koch, R Mancuso, R West, et al, “Reinforcement learning for UAV attitude control,” ACM
    Transactions on Cyber-Physical Systems, vol 3, pp. 1-21, 2019.
[7] Y Wang, J Sun, H He, et al, “Deterministic policy gradient with integral compensator for robust
    quadrotor control,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol, 50, pp.
    3713-3725, 2019.



                                                  153
[8] K Dally, E V Kampen, “Soft actor-critic deep reinforcement learning for fault tolerant flight
     control,” AIAA Scitech 2022 Forum, San Diego, CA&Virtual, January , 2022.
[9] X Huang, J Liu, Ch Jia, et al, “Deep deterministic policy gradient algorithm for UAV control,”
     Acta Aeronautica et Astronautica Sinica, vol 42, pp. 524688, 2021.
[10] Y Liu, H Wang, T Wu, et al, “Attitude control for hypersonic reentry vehicles: An efficient deep
     reinforcement learning method,” Applied Soft Computing, vol 123, pp. 108865, 2022.
[11] Z Wu, J Lu, Q Zhou, et al. “Modified adaptive neural dynamic surface control for morphing
     aircraft with input and output constraints,” Nonlinear Dyn, vol 87, pp. 2367–83, 2017.
[12] L Gong, Q Wang, Ch Hu, et al, “Switching control of morphing aircraft based on Q-learning,”
     Chinese Journal of Aeronautics,vol 33, pp. 672–687, 2020.




                                                154