=Paper=
{{Paper
|id=Vol-3612/IWESQ_2023_Paper_06
|storemode=property
|title=Reinforcement Learning-based Service Assurance of Microservice Systems
|pdfUrl=https://ceur-ws.org/Vol-3612/IWESQ_2023_Paper_06.pdf
|volume=Vol-3612
|authors=Xiaojian Liu,Yangyang Zhang,Wen Gu,Qiao Duan, Qingqing Ji
|dblpUrl=https://dblp.org/rec/conf/apsec/LiuZGDJ23
}}
==Reinforcement Learning-based Service Assurance of Microservice Systems==
<pdf width="1500px">https://ceur-ws.org/Vol-3612/IWESQ_2023_Paper_06.pdf</pdf>
<pre>
                         Reinforcement Learning-based Service Assurance of
                         Microservice Systems
                         Xiaojian Liu1, Yangyang Zhang2, Wen Gu1, Qiao Duan1 and Qingqing Ji3
                         1 Beijing University of Technology, Beijing, China
                         2 China Electronics Standardization Institute, Beijing, China
                         3 Chinese Academy of Sciences, Beijing, China


                                              Abstract
                                              As microservices architecture has steadily emerged as the prevailing direction in software system
                                              design, the assurance of services within microservices systems has garnered increasing attention. The
                                              concept of intelligent service assurance within microservices systems offers a novel approach to
                                              addressing adaptation challenges in complex, risk-laden environments. This paper introduces a
                                              groundbreaking approach known as the Reinforcement Learning (RL) Based Service Assurance Method
                                              for Microservice Systems (RL-SAMS), which incorporates the fundamental RL principle of "improving
                                              performance through experience" into service assurance activities. Through the implementation of an
                                              intelligent service degradation mechanism, the continuity of services is ensured. Within the framework
                                              of our designed microservices system, two essential components are introduced: the Adapter
                                              Component (AC) and the RL Decision-making Component (RLDC). Each microservice is treated as an
                                              independent RL agent, resulting in the construction of a multi-agent RL decision-making architecture
                                              that balances "centralized learning and decentralized decision-making." This intelligent decision-
                                              making model undergoes training and learning, accumulating positive experiences through continuous
                                              trial and error. Experimental cases demonstrate that RL-SAMS outperforms the widely adopted Hystrix
                                              across various service risk scenarios, particularly excelling in intelligently critical service assurance.

                                              Keywords
                                              Reinforcement learning; Microservice system; Intelligent service assurance 1

                                                                                                                               The autonomy and collaborative interaction
                         1. Introduction                                                                                  among microservices offer both advantages and, at the
                                                                                                                          same time, present significant service reliability risks.
                         In 2014, Martin Fowler formally introduced the                                                   On one hand, this autonomy entails separate
                         concept of "Microservices" through his blog post titled                                          operations, maintenance, and independent decision-
                         "Microservices." This innovative approach to software                                            making. This can lead to a focus on local interests at the
                         architecture involves breaking down a software                                                   expense of global considerations, sometimes even
                         system into numerous small services, each operating                                              resulting in conflicting service assurance efforts
                         independently in its own process. When compared to                                               among microservices. On the other hand, the intricate
                         traditional monolithic systems, microservices                                                    business interactions among microservices often
                         architectures offer several notable advantages,                                                  amplify "local failures" into "cascading failures,"
                         including the ability to deploy independently,                                                   triggering an "avalanche effect." In such cases, problem
                         effortless scalability, and decentralization. An                                                 resolution becomes elusive as the root cause remains
                         increasing number of network applications have made                                              elusive.
                         the transition to microservices architecture, with                                                    The key to addressing these service assurance
                         notable examples including Amazon, Netflix, Twitter,                                             challenges lies in establishing an effective group
                         SoundCloud, and PayPal. To give you an idea of the                                               decision-making mechanism within the microservices
                         scale, a single page on Amazon can trigger                                                       system. This mechanism empowers each microservice
                         approximately 100 to 150 microservice calls, while the                                           with the ability to comprehend the bigger picture and
                         Netflix system manages a staggering 5 billion                                                    make decisions for the entire system. This paper,
                         microservice interactions on a daily basis [1]. It's                                             utilizing a reinforcement learning approach, explores
                         evident that microservice architecture has                                                       a service assurance decision-making method tailored
                         progressively emerged as the predominant                                                         for microservices systems. Each microservice is
                         developmental direction for software system                                                      conceptualized as an independent reinforcement
                         architecture [2][3][4].                                                                          learning agent. Through continuous interactions with

                         5th International Workshop on Experience with SQuaRE Series and
                         its Future Direction, December 04, 2023, Seoul, Korea
                                liuxj@bjut.edu.cn (X. Liu);zhangyy@cesi.cn (Y. Zhang)
                              0000-0002-0666-4102 (X. Liu); 0009-0006-4940-8527(Y.
                         Zhang)
                                           © 2023 Copyright for this paper by its authors. The use permitted under
                                           Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                           CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                     34
the service environment and the operational and                   System and environmental modelling research. Zhao
maintenance environment, the fundamental concept                  et al. [16] proposed a learning framework that
of "enhancing performance through experiential                    integrates online and offline work based on
learning" is woven into the fabric of microservice                reinforcement learning and case sets. Belhaj et al. [17]
assurance. This equips the decision-making system                 put forward a framework named "autonomic
with the capacity to intelligently differentiate between          container", which endows applications with run-time
assurance targets and to flexibly provide assurance for           adaptive action capability based on RL method. With
critical elements.                                                model-based reinforcement learning method, Ho HN
    Section 2 of the paper provides a summary of                  et al. 18] used Markov process to model the
related research, with a particular emphasis on the               environment state, which is applied for the planning
current state of research in microservice assurance               and continuous optimization of adaptive software
technology and reinforcement learning methods. In                 systems. Tesauro et al. [19] utilized reinforcement
Section 3, we present an overview of the RL-SAMS                  learning method to solve the problem of service
method along with an introduction to its key                      ranking.
components. Section 4 showcases the effectiveness of                  Regarding multi-agent RL, the representative
the RL-SAMS method through pre-experimental                       studies in recent years include MADDPG (Multi-Agent
results. Finally, in Section 5, we summarize the                  Deep Deterministic Policy Gradient) [20] and COMA
contributions of this paper and outline potential                 (Counterfactual Multi-Agent actor-critic) [21], both of
directions for future research.                                   which are based on classic Actor-Critic architecture. At
                                                                  present, multi-agent RL is one of the most focused and
                                                                  widely researched directions in reinforcement
2. Related Works                                                  learning methods.
                                                                      In summarizing the current state of research, it's
Technologies related to microservice assurance                    clear that while various technologies and effective
include service degradation technology [5][6], service            measures have been developed for microservice
fault tolerance technology [7], service elastic scaling           system assurance from different angles, most of them
technology [8][9], service current limiting technology            primarily address localized issues and decision-
[10] etc. Santos et al. [6] proposed a strategy for online        making within their own domains. As a result, they
service degradation based on quality of service (QoS),            often fall short in comprehensively addressing the
which aims to minimize request congestion due to lack             decision-making requirements for the overall system's
of system resources; Combining architecture analysis              assurance. The challenge now lies in merging the
method and sensitivity analysis method, Wang et al. [7]           decision-making traits inherent to microservice
proposed a fault-tolerant strategy algorithm based on             architecture with the valuable insights gained from the
reliability criticality measurement; Coulson et al. [9]           remarkable research achievements in reinforcement
designed an automatic expansion system prototype of               learning methods within the realm of adaptive control.
microservice based on supervised learning; Firmani et             The objective is to empower each microservice with a
al. [10] put forward an API call rate limit selection             global perspective and intelligent decision-making
strategy in order to prevent unauthorized users from              capabilities. This remains at the forefront of ongoing
achieving ultra-high SLA. Most of the existing research           research efforts.
on microservice assurance focus on the local situation
of their respective microservices. It is impossible to
comprehensively consider the guarantee of service                 3. RL-SAMS Methodology
expectations from the perspective of users. One of the
key problems that need to be solved is how to establish           The comprehensive architecture of RL-SAMS is
an assurance system of service for global decision-               illustrated in Figure 1. Building upon the Microservice
making without breaking the original distributed and              System Component (MC), we've introduced the
independent framework of microservice.                            Adapter Component (AC) and the RL Decision-making
     The existing research on reinforcement learning-             Component (RLDC). Within the MC, we've enhanced
enabled software adaptive control can be roughly                  each microservice by incorporating the AC. This
divided into: (1) Strategy generation and evolution               enhancement includes the addition of a SMM and a
research. Wang et al. [11] used reinforcement learning            DCM, both of which provide interfaces for interaction
method to solve the problem of dynamic service                    with the RLDC. To keep the illustration
configuration in the integrated adaptive system. Wang             straightforward,      Figure     1    simplifies    the
et al. [12] used reinforcement learning method,                   interdependence among multiple microservices. The
combined with Markov model Gaussian process, to                   RLDC establishes a mechanism characterized by
establish a multi-agent game model, which aims to                 "centralized learning and decentralized decision-
solve the problem of self-adaptive combination of                 making."
services. Rao et al. [13] proposed a distributed                      The fundamental concept of "enhancing
learning mechanism to solve the problem of resource               performance through experiential learning" is
allocation in the cloud environment. Dongsun et al. [14]          embedded into microservice assurance. This
proposed a framework-based online planning method                 integration is achieved through the ongoing
for self-management, which enables the software                   interactive learning of multiple agents, taking into
system to change and improve its plan through online              account the effects of system operation and
RL. Amoui et al. [15] used RL in the planning process             maintenance, user expectations, and various other
to support action selection, and clarifies why, how and           state factors.
when RL can benefit autonomous software systems. (2)


                                                             35
                                          Figure 1: Architecture of RL-SAMS

                                                                     4 Experiment will activate these endpoints to
                                                                     achieve     simple    state    monitoring     to
3.1. Adapter Component                                               demonstrate the effectiveness of RL-SAMS.
                                                                     Customized SMMs and interfaces are also
The core function of the AC is to provide an Interactive             suitable for the mechanism proposed in this
interface for the RLDC to perceive the running service               paper.
state of the microservice system, and to timely control           2. Dynamic configuration module (DCM). To
the configuration and implementation of various types                achieve runtime oriented dynamic assurance, it
of assurance actions. The main functional modules                    is required the RLDC have the ability to
include a state monitoring module (SMM) and a                        dynamically configure and execute assurance
dynamic configuration module (DCM).                                  action without restarting the microservice. We
    1. State monitoring module (SMM). The content                    establish a configuration center server to
        of state monitoring depends on the actual                    centrally manage the configuration files of each
        requirements, such as request volume, correct                microservice, and the RLDC controls the
        rate, response time, etc., and can also be                   content of each microservice configuration file
        specific business parameters, exception codes,               according to the decision result, as well as the
        etc. Spring Cloud framework provides                         action of microservice configuration update, so
        "/metrics" endpoint, "/health" endpoint,                     as to realize the service assurance, as showed
        "/trace" endpoint and other interfaces for                   in Figure 2.
        regular microservice state monitoring. Section


                                    Figure 2: Interaction between AC and RLDC


                                                           36
                                                        Figure 3: Framework of RLDC

                                                                                  2.    Evaluation strategy 𝜇𝑖 takes the current
                                                                                  state of local microservice 𝑠𝑖 as input, and
3.2. RL Decision-making Component                                                 outputs the assurance action 𝑎𝑖 corresponding to
                                                                                  𝑠𝑖 :
In the RLDC, each microservice with decision-making                                                                  𝜇
                                                                                                  𝑎𝑖 = 𝜇𝑖 (𝑠𝑖 |𝜃𝑒𝑣𝑎𝑙 )
ability is modelled as an independent agent for
                                                                                  The evaluation strategy 𝜇𝑖 is continuously
centralized training and decentralized execution. That
                                                                                  trained and learned based on the feedback of Q-
is, in training stage, the learning of each agent is
                                                                                  value from "value decision" module. 𝜃𝑒𝑣𝑎𝑙 is the
performed using globe states to consider strategies of                            parameter of 𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖c.
other agents; in execution stage, each agent only                                 Although decentralized decision-making, each
makes decisions based on its own state perception. In                         microservice is closely related in business logic, so the
addition, an experience replay pool is set up, and the                        service effect of each microservice is mostly
experience replay mechanism is used to solve the                              comprehensive evaluation. Therefore, compared with
problems of correlation between training samples and                          MADDPG, which designs a critic module for each agent,
unfixed probability distribution of training samples.                         this paper designs a shared critic module (i.e., "value
Each state transition are recorded as state-action pair                       decision" module) for all microservices, and outputs
and the corresponding reward and next state, as
                                                                              the corresponding Q-value of each microservice
follows:
                                                                              according to the comprehensive reward function. The
       (𝑠1 , 𝑠2 , … , 𝑠𝑛 ; 𝑎1 , 𝑎2 , … , 𝑎𝑛 ; 𝑅; 𝑠1′ , 𝑠2′ , … , 𝑠𝑛′ )
                                                                              "value decision" module designs two neural networks
    Where 𝑠𝑖 is the current state of each
                                                                              with the same structure: Value decision target
microservice. 𝑎𝑖 is assurance action selected by each
                                                                              network 𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐 and Value decision
microservice. 𝑅 is reward value, such as the degree of
                                                                              evaluation network 𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐, which are
satisfaction of various users’ expectations after each                        used to output the Q-value of each microservice
assurance action is performed. 𝑠𝑖′ is the next state of                       assurance action based on the global state of the
each microservice. The framework and process of the                           microservice system:
two microservices are shown in Figure 3. Each
                                                                                  1. 𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐 takes the next state of the
microservice corresponds to an independent "action
                                                                                  microservice system (𝑠1′ , 𝑠2′ , … , 𝑠𝑛′ ) and the
decision" module and a shared "value decision"
                                                                                  corresponding (𝑎1′ , 𝑎2′ , … , 𝑎𝑛′ ) as the input, and
module. There are two strategy networks with same
                                                                                  outputs the Q-value corresponding to the next state
structure in one "action decision" module: Target
strategy 𝜇𝑖′ and evaluation strategy 𝜇𝑖 , which are                               of each microservice:
                                                                                                                   𝑄
used to assurance decision making based on local                                                  𝑄1′ (𝑠𝑖′ , 𝑎𝑖′ |𝜃𝑡𝑎𝑟𝑔𝑒𝑡 )
microservice state:                                                               where       𝜃𝑡𝑎𝑟𝑔𝑒𝑡         is     the    parameter of
    1.    Target strategy 𝜇𝑖′ takes the next state of                             𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐. 𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐 does not
    local microservice 𝑠𝑖′ as input, and outputs the                              actively train and learn, but periodically updates it
    assurance action 𝑎𝑖′ corresponding to 𝑠𝑖′ :                                   with the continuously learned parameters of
                                               𝜇
                            𝑎𝑖′ = 𝜇𝑖′ (𝑠𝑖′ |𝜃𝑡𝑎𝑟𝑔𝑒𝑡 )                             𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 to increase the stability of
    The target strategy 𝜇𝑖′ does not actively train, but                          the learning process.
    periodically updates it with the parameters of the                            2.    𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 takes the current
    continuously learning evaluation trategy 𝜇𝑖 ,                                 state of the microservice system (𝑠1 , 𝑠2 , … , 𝑠𝑛 )
    thereby increasing the stability of the learning                              and the corresponding (𝑎1 , 𝑎2 , … , 𝑎𝑛 ) as input,
    process.        𝜃𝑡𝑎𝑟𝑔𝑒𝑡          is the parameter of                          and outputs the Q-value corresponding to the
                                                                                  current state of each microservice value:
    𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐.                                                                                             𝑄
                                                                                                  𝑄𝑖 (𝑠𝑖 , 𝑎𝑖 |𝜃𝑡𝑎𝑟𝑔𝑒𝑡 )


                                                                         37
   where        𝜃𝑒𝑣𝑎𝑙    is     the       parameter     of        simulation modules for three business function
   𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖c.        𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐            microservices: 𝐶𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡, 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡, and
   periodically selects several state transition records          𝑃𝑟𝑜𝑣𝑖𝑑𝑒𝑟_𝑢𝑠𝑒𝑟. We set three simulation modules with
   randomly from the experience replay pool for                   different pressure cycles to simulate different
   training and learning, let’s say 𝑁. The process of             pressure sources of the microservice system to verify
   training and learning is the process of continuously           the core business priority assurance capability of RL-
   optimizing the difference between the estimated                SAMS in the face of different pressure sources.
   Q-value and the actual Q-value. The loss function is
   defined as:
                   1                            𝑄
                                                                  4.2. Experimental Design
      𝐿(𝜃𝑒𝑣𝑎𝑙 ) = ∑(𝑟 + 𝛾 ∗ 𝑄𝑖′ (𝑠𝑖′ , 𝑎𝑖′ |𝜃𝑡𝑎𝑟𝑔𝑒𝑡 )
                   𝑁
                                          𝑄       2               The experiment takes whether the two request
                          − 𝑄𝑖 (𝑠𝑖 , 𝑎𝑖 |𝜃𝑒𝑣𝑎𝑙 ))
                                                                  microservices perform service degrade as action space,
   where 𝛾 is the learning rate, 𝛾 ∈ [0,1] . The                  𝑎𝑐𝑜𝑟𝑒 ∈ [on, off] , 𝑎𝑛𝑜𝑛_𝑐𝑜𝑟𝑒 ∈ [on, off] , and compares
   larger the 𝛾 , the more emphasis on long-term                  the average reward value of all heartbeat monitoring
   rewards in the learning process. The evaluation                requests for two client microservices within 15s after
   strategy of each microservice 𝜇𝑖 updates the                   each assurance action. 𝑎𝑐𝑜𝑟𝑒 = 𝑜𝑛 means that the
   parameters according to gradient descent (J1 and               service degradation mechanism is enabled to ensure
   J2 in Figure 3):                                               service continuity, and 𝑎𝑐𝑜𝑟𝑒 = 𝑜𝑓𝑓 means the
            1             𝜇
      ∇𝐽 ≈ ∑ ∇𝜇𝑖 (𝑠𝑖 |𝜃𝑒𝑣𝑎𝑙 ) ∙ ∇ 𝑄𝑖 (𝑠𝑖 , 𝑎𝑖 , 𝜃𝑒𝑣𝑎𝑙 )           opposite. Reward function is defined as:
            𝑁                                                                     ∑ 𝑅CC                 ∑ 𝑅NC
                                                                        𝑅=                    +
                                                                             𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠
4. Experiments                                                        Where，
                                                                                        4,    𝑛𝑜𝑚𝑎𝑙_𝑠𝑒𝑟𝑣𝑖𝑐𝑒
4.1. Experimental scene                                                       𝑅CC = { 1,     𝑑𝑒𝑔𝑟𝑎𝑑𝑒_𝑠𝑒𝑟𝑣𝑖𝑐𝑒
                                                                                      −3,      𝑠𝑒𝑟𝑣𝑖𝑐𝑒_𝑓𝑎𝑖𝑙𝑢𝑟𝑒
                                                                                        1,     𝑛𝑜𝑚𝑎𝑙_𝑠𝑒𝑟𝑣𝑖𝑐𝑒
In order to verify the effectiveness of RL-SAMS, we
                                                                              𝑅NC = { 0,     𝑑𝑒𝑔𝑟𝑎𝑑𝑒_𝑠𝑒𝑟𝑣𝑖𝑐𝑒
build a user-information-querying system consisting
                                                                                      −1,      𝑠𝑒𝑟𝑣𝑖𝑐𝑒_𝑓𝑎𝑖𝑙𝑢𝑟𝑒
five microservices with "VMware Workstation 16 Pro",
                                                                      𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 and 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 are the
as shown in Figure 4. The system includes three
                                                                  total number of microservice state heartbeat
business microservices, one configuration center
                                                                  monitoring requests sent randomly in the
microservice and one registry center microservice.
                                                                  corresponding period, ∑ 𝑅CC and ∑ 𝑅NC are the
Each microservice is developed based on "Spring
                                                                  sum of the heartbeat monitoring request rewards for
Cloud" framework[22] and deployed on an
                                                                  𝐶𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 and 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 respectively. Three
independent VMware virtual machine. The
                                                                  responses are as following:
configuration of each virtual machine is as follows:
                                                                      •     𝑛𝑜𝑟𝑚𝑎𝑙_𝑠𝑒𝑟𝑣𝑖𝑐𝑒. Returning the correct
memory 1GB, number of processors 1, hard disk (SCSI)
                                                                      request result within the specified time;
20GB, operating system Ubuntu-16.04.
                                                                      •     𝑑𝑒𝑔𝑟𝑎𝑑𝑒𝑑_𝑠𝑒𝑟𝑣𝑖𝑐𝑒. The microservice is
   Three business microservices include:
                                                                      degraded and in this experiment, it is designed that
    1.    Two client microservices, 𝐶𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 and
                                                                      a default value is returned without actually
    𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 which are used to receive
                                                                      processing;
    requests for querying user information, and call
    the 𝑃𝑟𝑜𝑣𝑖𝑑𝑒𝑟_𝑢𝑠𝑒𝑟 microservice to return the                      •     𝑠𝑒𝑟𝑣𝑖𝑐𝑒_𝑓𝑎𝑖𝑙𝑢𝑟𝑒. Timing out or returning
    result to the requesting user. There is no difference             error. Different reward value is designed between
    in business logic between the two microservices,                  𝐶𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 and the 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑐𝑙𝑖𝑒𝑛𝑡 to
    just to verify that the RL-SAMS has the ability to                encouraging business-critical service assurance.
                                                                      In RL, a 2-layer 𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐 and
    guarantee core business priority, one of the two
                                                                  𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 are constructed based on
    client microservices is selected as the core
                                                                  TensorFlow. 𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 updates the
    business microservice.
                                                                  parameters to 𝑁𝑒𝑡_𝑡𝑎𝑟𝑔𝑒𝑡_𝑐𝑟𝑖𝑡𝑖𝑐 every 200 learning.
    2.    One        𝑃𝑟𝑜𝑣𝑖𝑑𝑒𝑟_𝑢𝑠𝑒𝑟         microservice,
                                                                  The optimization of the neural network adopts
    responsible for background business processing.
                                                                  RMSprop optimizer. The learning rate 𝛾 is set to 0.9,
    The microservice receives user information query
                                                                  and the exploration strategy 𝜀 is set to 0.8. The
    requests, and returns the query results. In order to
                                                                  capacity of the experience replay pool is 200, and
    simulate the performance bottleneck of each
                                                                  𝑁𝑒𝑡_𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑖𝑜𝑛_𝑐𝑟𝑖𝑡𝑖𝑐 randomly selects 32 sets of
    microservice, set the 𝑃𝑟𝑜𝑣𝑖𝑑𝑒𝑟_𝑢𝑠𝑒𝑟 microservice
                                                                  state transition records from the experience replay
    to execute the information query service after
                                                                  pool every 5 steps as training samples for learning, and
    sleeping for one second.
                                                                  simultaneously trains two behavioral decision
    We simulate high concurrent business requests
                                                                  evaluation strategies.
based on the performance testing framework "Locust".
In the experiment, we deploy three pressure


                                                             38
                                             Figure 4: Experimental Scene

Table 1
Comparative Experiment Scenarios
                            Core_client        Non_core_client                           Expectation
 Service risk scenarios
                          Concurrent users     Concurrent users                   Action                R CC R NC     R
      HJC-HCC-LNC               200                   50               [acore = on, anon_core = off]     1    1       2
      HJC-LCC-HNC                50                  200               [acore = off, anon_core = on]     4    0       4
      LJC-LCC-LNC                50                   50               [acore = off, anon_core = off]    4    1       5
      HJC-LCC-LNC               100                  100               [acore = off, anon_core = on]     4    0       4
      HJC-HCC-HNC               200                  200               [acore = on, anon_core = off]     1    0       1


4.3. Comparative Experiment

4.3.1. Effectiveness Analysis
Experiment takes the widely used Hystrix[23] as
baseline method, and compares assurance effect
between the Hytrix service circuit breaker mechanism
and RL-SAMS in five service risk scenarios shown in
Table 1. In addition, the service effect without any
assurance method, named "Blank" in Figure 5, will be
compared as another baseline to verify the successful
implementation of Hystrix and RL-SAMS.Table 1 shows
five different service risk scenarios and expected
optimal decision action and average reward. The name                          Figure 5: Comparative Experiment
of service risk scenarios is combined by three fields,                 The average reward value of heartbeat monitoring
𝑋1𝐽𝐶−𝑋2𝐶𝐶−𝑋3𝑁𝐶, corresponding different concurrent                requests for three different service assurance methods
pressure models. 𝑋1𝐽𝐶 is joint concurrent field,                  in five service risk scenarios is shown in Figure 5.
meaning if requests from both 𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 and                        In all 𝐻𝐽𝐶 scenarios: (1) The “Blank” method will
𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠        together     will    achieve             cause the response time of all requests to time out.
performance saturation. 𝑋2𝐶𝐶 and 𝑋3𝑁𝐶 is independent              According to the reward function, the average reward
concurrent fields, meaning if requests from                       value is -4. (2) Using the "Hystrix" method, whether it
𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 or 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 respectively                   is high independent concurrency (𝐻𝐶𝐶 or 𝐻𝑁𝐶), will
will achieve performance saturation. H means high                 activate circuit breakers of both two request
concurrent pressure. L means low concurrent                       microservices. The average reward value is 1. Due to
pressure. The preliminary experiments indicate that               the existence of the retransmission mechanism in
around 150 concurrent users can subject the                       "Hystrix", the average reward value fluctuates in the
microservices in this experiment to high concurrency              range of 1+0.2. (3) By comparing the effects of
pressure.                                                         𝐻𝐽𝐶−𝐻𝐶𝐶−𝐿𝑁𝐶, 𝐻𝐽𝐶−𝐿𝐶𝐶−𝐻𝑁𝐶, 𝐻𝐽𝐶−𝐿𝐶𝐶−𝐿𝑁𝐶, it is verified
                                                                  that the decision model trained by the RL-SAMS will
                                                                  intelligently and selectively execute the degrade of the
                                                                  microservices according to source of pressure. In


                                                           39
𝐻𝐽𝐶−𝐻𝐶𝐶−𝐿𝑁𝐶, since 𝐻𝐶𝐶 causes 𝐻𝐽𝐶, 𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 is               Non_core_client will receive the maximum reward
impossible to assurance. So, it is best to degrade its            values with [acore = off, anon_core = on] , and their
service to assurance 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠; In                        corresponding Q-values will also be the highest.
𝐻𝐽𝐶−𝐿𝐶𝐶−𝐻𝑁𝐶, since 𝐻𝑁𝐶 causes 𝐻𝐽𝐶, it is best to degrade          Therefore, as training progresses, the proportion of
𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 to assurance 𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠; In                  [acore = off, anon_core = on] increases. After Period 6,
𝐻𝐽𝐶−𝐿𝐶𝐶−𝐿𝑁𝐶, 𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠 and 𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠                  the proportion of           [acore = off, anon_core = on]
together cause 𝐻𝐽 𝐶, it is also best to degrade and               exceeds 90% and stabilizes, reaching 98% in Period 8.
sacrifice    𝑁𝑜𝑛_𝑐𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠         to      assurance          In other words, in service risk scenario HJC-LCC-HNC, RL-
𝐶𝑜𝑟𝑒_𝑟𝑒𝑞𝑢𝑒𝑠𝑡𝑠, according to reward function.                      SAMS can, with a probability of 98% * 98% = 96%,
    The experiment verify that RL-SAMS can not only               ensure the normal service of the Core_client by only
effectively select the assurance action, but also                 degrading the concurrent requests of the
distinguish the degraded objects according to the                 Non_core_client. The accuracy performance in other
source of the service risk, so as to realize intelligent          service risk scenarios is similar.
elastic Microservice System assurance.

                                                                  5. Conclusion
                                                                  This paper introduces an innovative decision-making
                                                                  method for microservice systems, leveraging
                                                                  reinforcement learning principles. It seamlessly
                                                                  incorporates the core concept of "enhancing
                                                                  performance through experiential learning" into
                                                                  service assurance processes within the microservices
                                                                  architecture. The flexible assurance capability
                                                                  targeting critical assurance components paves the way
                                                                  for novel approaches to intelligent service assurance
                                                                  and maintenance. Through a thorough analysis and
                                                                  validation     via    case    experiments,    RL-SAMS
            Figure 6: RL process in HJC-LCC-HNC                   demonstrates its prowess across various service risk
                                                                  scenarios, particularly excelling in its ability to
                                                                  intelligently differentiate key assurance elements and
4.3.2. Model Accuracy and Training                                proactively ensure the continuity of core business
       Process Analysis                                           operations.
                                                                      While this paper has introduced reinforcement
During the model training process, two Locust                     learning methods into service assurance activities
modules for handling requests as microservices                    within microservice systems, there are still many
continuously simulate concurrent request pressures                aspects that require further research and exploration.
with a random cycle duration of 1800 seconds.                     These include:
Considering coverage of risk scenarios for five types of          •    Efficient Learning with Expanding State and
services and RL state space control to shorten the                     Action Spaces: Reinforcement learning is
learning cycle, the random range for concurrent users                  fundamentally about accumulating experiential
is set to [0, 50, 100, 150, 200]. Logs record the state of             knowledge to maximize rewards and minimize
each step and the selection of safeguarding actions                    losses. As the state and action spaces grow, the
during the model training process. Taking service risk                 cost of model training and learning also increases
scenario HJC-LCC-HNC as an example, Figure 6 presents                  rapidly. It will be necessary to investigate and
the proportion of assurance actions at each stage of                   improve methods for accumulating positive
training.Due to the random nature of simulating                        experiences more efficiently and enhancing
concurrent request pressures, HJC-LCC-HNC does not                     convergence rates.
occur continuously. The number of cycles in Figure 6              •    Decentralized Training and Centralized Learning:
refers to the extraction of all assurance action selection             The approach taken in this paper involves
records when HJC-LCC-HNC occurs throughout the entire                  centralized training and learning. However, in
training process. These records are sorted                             real-world scenarios where microservices come
chronologically, and every 100 data points are used to                 from different providers, there may be obstacles
calculate the proportion of assurance actions in a                     to sharing operational data. Addressing how to
Period. The decision of whether to degrade Core_client                 limit data sharing while enabling decentralized
and Non_core_client microservices to break their                       training for individual microservices and
concurrent requests will be made. As shown in Figure                   centralized learning of experiences is a pressing
6, in Period 1, the intelligent agents of the two request              challenge.
microservices almost randomly decide whether to                   •    Integration with Log Analysis and Risk Prediction:
activate the degradation. Since both client                            Exploring how to combine reinforcement learning
microservices experience low concurrent pressure,                      with log analysis and risk prediction to leverage
they both exhibit a trend of not activating degradation                prior knowledge and accelerate learning
in Period 2, resulting in an increase in the proportion                efficiency is an area worth investigating.
of [acore = off, anon_core = off]. Under the influence of              Integrating reinforcement learning with existing
the "value decision" module, Core_client and                           systems for proactive risk management and


                                                             40
     incident response can enhance the overall                   [12] Hongbing Wang, Qin Wu, Xin Chen, Qi Yu, Zibin
     effectiveness of service assurance activities.                   Zheng, and Athman Bouguettaya. Adaptive and
    These areas of research and improvement will                      dynamic service composition via multiagent
contribute to the further development and refinement                  reinforcement learning. In 2014 IEEE
of reinforcement learning methods in the context of                   international conference on web services, pages
microservices and service assurance.                                  447–454. IEEE, 2014.
                                                                 [13] Jia Rao, Xiangping Bu, Kun Wang, and Cheng-
                                                                      Zhong Xu. Self-adaptive provisioning of
References                                                            virtualized resources in cloud computing. In
                                                                      Proceedings of the ACM SIGMETRICS joint
[1]  Xiang zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu,              international conference on Measurement and
     Chao Ji, and Wenyun Zhao. Poster: Benchmarking                   modeling of computer systems, pages 129–130,
     microservice systems for software engineering                    2011.
     research. In 2018 IEEE/ACM 40th International               [14] Dongsun Kim and Sooyong Park. Reinforcement
     Conference       on    Software       Engineering:               learning-based dynamic adaptation planning
     Companion (ICSE-Companion), pages 323–324.                       method for architecture-based selfmanaged
     IEEE, 2018.                                                      software. In 2009 ICSE Workshop on Software
[2] Holger Knoche and Wilhelm Hasselbring. Using                      Engineering for Adaptive and Self-Managing
     microservices for legacy software modernization.                 Systems, pages 76–85. IEEE, 2009.
     IEEE Software, 35(3):44–49, 2018.                           [15] Mehdi Amoui, Mazeiar Salehie, Siavash Mirarab,
[3] Florian Rademacher, Jonas Sorgalla, and Sabine                    and Ladan Tahvildari. Adaptive action selection
     Sachweh.      Challenges     of     domain-driven                in autonomic software using reinforcement
     microservice design: A model-driven perspective.                 learning. In Fourth International Conference on
     IEEE Software, 35(3):36–43, 2018.                                Autonomic and Autonomous Systems (ICAS’08),
[4] Claus Pahl, Antonio Brogi, Jacopo Soldani, and                    pages 175–181. IEEE, 2008.
     Pooyan Jamshidi. Cloud container technologies: a            [16] Tianqi Zhao, Wei Zhang, Haiyan Zhao, and Zhi Jin.
     state-of-the-art review. IEEE Transactions on                    A reinforcement learning-based framework for
     Cloud Computing, 7(3):677–692, 2017.                             the generation and evolution of adaptation rules.
[5] Zhizhen Zhong, Jipu Li, Nan Hua, Gustavo B                        In 2017 IEEE International Conference on
     Figueiredo, Yanhe Li, Xiaoping Zheng, and                        Autonomic Computing (ICAC), pages 103–112.
     Biswanath Mukherjee. On qos-assured degraded                     IEEE, 2017.
     provisioning in service-differentiated multi-               [17] Nabila Belhaj, Djamel Belaïd, and Hamid Mukhtar.
     layer elastic optical networks. In 2016 IEEE                     Framework        for   building     self-adaptive
     Global        Communications            Conference               component applications based on reinforcement
     (GLOBECOM), pages 1–5. IEEE, 2016.                               learning. In 2018 IEEE International Conference
[6] Alex S Santos, Andre K Horota, Zhizhen Zhong,                     on Services Computing (SCC), pages 17–24. IEEE,
     Juliana De Santi, Gustavo B Figueiredo, Massimo                  2018.
     Tornatore, and Biswanath Mukherjee. An online               [18] Han Nguyen Ho and Eunseok Lee. Model-based
     strategy for service degradation with                            reinforcement learning approach for planning in
     proportional qos in elastic optical networks. In                 self-adaptive software system. In Proceedings of
     2018 IEEE International Conference on                            the 9th International Conference on Ubiquitous
     Communications (ICC), pages 1–6. IEEE, 2018.                     Information Management and Communication,
[7] Lei Wang. Architecture-based reliability-                         pages 1–8, 2015.
     sensitive criticality measure for fault-tolerance           [19] Gerald Tesauro, Nicholas K Jong, Rajarshi Das,
     cloud applications. IEEE Transactions on Parallel                and Mohamed N Bennani. A hybrid
     and Distributed Systems, 30(11):2408–2421,                       reinforcement learning approach to autonomic
     2019.                                                            resource allocation. In 2006 IEEE International
[8] Chenhao Qu, Rodrigo N Calheiros, and Rajkumar                     Conference on Autonomic Computing, pages 65–
     Buyya. Auto-scaling web applications in clouds:                  73. IEEE, 2006.
     A taxonomy and survey. ACM Computing Surveys                [20] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb,
     (CSUR), 51(4):1–33, 2018.                                        OpenAI Pieter Abbeel, and Igor Mordatch. Multi-
[9] Nathan Cruz Coulson, Stelios Sotiriadis, and Nik                  agent actor-critic for mixed cooperative
     Bessis. Adaptive microservice scaling for elastic                competitive environments. Advances in neural
     applications. IEEE Internet of Things Journal,                   information processing systems, 30, 2017.
     7(5):4195–4202, 2020.                                       [21] Jakob Foerster, Gregory Farquhar, Triantafyllos
[10] Donatella Firmani, Francesco Leotta, and                         Afouras, Nantas Nardelli, and Shimon Whiteson.
     Massimo Mecella. On computing throttling rate                    Counterfactual multi-agent policy gradients. In
     limits in web apis through statistical inference. In             Proceedings of the AAAI conference on artificial
     2019 IEEE International Conference on Web                        intelligence, volume 32, 2018.
     Services (ICWS), pages 418–425. IEEE, 2019.                 [22] Cosmina I, Cosmina I. Spring microservices with
[11] Hongbing Wang, Xiaojun Wang, Xingguo Hu,                         spring cloud[J]. Pivotal certified professional
     Xingzhi Zhang, and Mingzhu Gu. A multi-agent                     spring developer exam: a study guide, 2017: 435-
     reinforcement learning approach to dynamic                       459.
     service composition. Information Sciences,                  [23] Molchanov H, Zhmaiev A. Circuit breaker in
     363:96–119, 2016.                                                systems based on microservices architecture[J].
                                                                      2018.


                                                            41

</pre>