=Paper= {{Paper |id=Vol-2978/faacs-paper3 |storemode=property |title=Scenario-based Resilience Evaluation and Improvement of Microservice Architectures: An Experience Report |pdfUrl=https://ceur-ws.org/Vol-2978/faacs-paper3.pdf |volume=Vol-2978 |authors=Sebastian Frank,Alireza Hakamian,Lion Wagner,Dominik Kesim,Jóakim von Kistowski,André van Hoorn |dblpUrl=https://dblp.org/rec/conf/ecsa/FrankHWKKH21 }} ==Scenario-based Resilience Evaluation and Improvement of Microservice Architectures: An Experience Report== https://ceur-ws.org/Vol-2978/faacs-paper3.pdf
Scenario-based Resilience Evaluation and Improvement of
Microservice Architectures: An Experience Report
Sebastian Frank1 , Alireza Hakamian1 , Lion Wagner1 , Dominik Kesim1 , Jóakim von
Kistowski2 and André van Hoorn1
1
    Institute of Software Engineering, University of Stuttgart. Stuttgart, Germany
2
    DATEV eG, Nürnberg, Germany


                                             Abstract
                                             Context. Microservice-based architectures are expected to be resilient. However, various systems still suffer severe quality
                                             degradation from changes, e.g., service failures or workload variations. Problem. In practice, the elicitation of resilience
                                             requirements and the quantitative evaluation of whether the system meets these requirements is not systematic or not even
                                             conducted. Objective. We explore (1) the scenario-based Architecture Trade-Off Analysis Method (ATAM) for resilience
                                             requirement elicitation and (2) resilience testing through chaos experiments for architecture assessment and improvement.
                                             Method. In an industrial case study, we design a structured ATAM-based workshop, including the system’s stakeholders,
                                             to elicit resilience requirements. We specify these requirements into the ATAM scenario template. We transform those
                                             scenarios into resilience experiments to quantitatively evaluate and improve system resilience. Result. We identified 12
                                             resilience scenarios. We use and extend ChaosToolkit to automate and execute two scenarios. We quantitatively evaluate
                                             resilience requirements and suggest resilience improvements in the scope of both scenarios. We share lessons learned from
                                             the case study. In particular, our work provides evidence that an ATAM-based workshop is intuitive to stakeholders in an
                                             industrial setting. Conclusion. Our approach helps requirement and quality engineering teams in the process of resilience
                                             requirements elicitation.


1. Introduction                                                                                                       ture Trade-Off Analysis Method (ATAM) [8] for (1) the
                                                                                                                      system’s resilience requirement elicitation and (2) re-
An intrinsic quality property of the microservices archi-                                                             silience testing through resilience experiments (aka chaos
tectural style is resilience, i.e., the system meets perfor-                                                          experiments) for architecture assessment and improve-
mance and other Quality of Service (QoS) requirements                                                                 ment. We hypothesize that ATAM has already been used
despite different failure modes or workload variations [1].                                                           in practice to elicit and specify quality requirements other
However, real-world postmortems [2] show that systems                                                                 than resilience, e.g., availability, performance, and main-
suffer either unacceptable QoS degradation, or recovery                                                               tainability, and can be adopted for effectively eliciting re-
time. It is necessary to assure system resilience in the                                                              silience requirements and evaluating them through chaos
context of microservice-based architectures.                                                                          experiments. Therefore, our research question is: How
   Practitioners use Chaos Engineering [3], including                                                                 to leverage ATAM to elicit resilience requirements, which
tools such as CTK [4, 5] or Chaos Monkey1 , for resilience                                                            can be utilized to evaluate resilience through resilience ex-
testing. They need to (1) think about hazards [6] as causes                                                           periments and suggest architectural improvements quan-
of QoS degradation, (2) set up chaos experiments by spec-                                                             titatively?
ifying failure mode types and hypotheses of expected                                                                     We use an ATAM-based workshop to elicit and specify
quality behavior, and (3) execute each experiment to de-                                                              resilience requirements by involving system stakehold-
tect deviations from the hypotheses. First, this approach                                                             ers. ATAM allows us to describe resilience requirements
lacks the systematic identification of causes of a hazard                                                             as scenarios in semi-structured textual language. The
through hazard analysis methods. We contributed to                                                                    scenario template consists of the following elements:
this problem in our previous work [7], which serves as a                                                              (1) source, (2) stimuli, (3) artifact(s), (4) system’s envi-
foundation for this paper. Particularly, we now integrate                                                             ronment, (5) its response, and (6) response measure.
hazard analysis into a more systematic elicitation process                                                               The designed structured workshop aims to identify
and use a more formal description of requirements (sce-                                                               hazards and architectural design decisions. During the
narios). Second, the approach lacks a systematic process                                                              workshop, we employ a hazard analysis based on the
of eliciting and refining resilience requirements.                                                                    Fault Tree Analysis (FTA) [6]. The result is a set of 12
   In the context of an industrial case study, our objective                                                          resilience scenarios, which we turn into experiments.
is to explore the application and adoption of the Architec-                                                           Next, we use CTK to automate these experiments, and
                                                                                                                      conduct a measurement-based resilience evaluation. Fur-
ECSA 2021 Companion Volume, Växjö, Sweden, 13-17 September 2021                                                       thermore, we improve system resilience by applying a
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
    CEUR
    Workshop
                  http://ceur-ws.org
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                          CEUR Workshop Proceedings (CEUR-WS.org)                                                     resilience pattern [9, 1], namely retry. We validate the
                                                                                                                      improvement by re-executing the respective scenario.
                  ISSN 1613-0073
    Proceedings


                  1
                      https://github.com/Netflix/chaosmonkey
  To summarize, the paper makes the following contri- scenarios are two activities in requirements engineer-
butions:                                                 ing [11] that benefit the elicitation process. According
                                                         to Pohl [11], scenario development benefits elicitation
• Leveraging ATAM and FTA in an industrial system by making goals understandable for stakeholders and
  to elicit resilience requirements, then evaluating the may refine or identify new goals. Our work uses sce-
  requirements, and improving system’s resilience.       nario development without goal-oriented modeling as all
• Automating scenario execution using CTK for measure- stakeholders know the system’s high-level quality goal.
  ment-based evaluation.                                    To our knowledge, this is the first work using ATAM
                                                         for eliciting and specifying resilience requirements before
• We share lessons learned that benefit both practition- evaluating the resilience through experiments.
  ers and researchers regarding resilience requirement
  elicitation, evaluation, and improvement.
                                                                3. Research Methodology
• Artifacts — including scenarios, resilience experiments,
  and results of the experimentation — are available on- Section 3.1 explains the domain context and describes
  line [10].                                               the high-level architecture of the case study system. Sec-
                                                           tion 3.2 summarizes our research methodology.

2. Related Work                                                 3.1. Domain Context
A workshop is an effective technique for requirement            The case study system’s purpose is to calculate payments.
elicitation [11]. In our case, the workshop’s prepara-          An accounting department’s wage clerks use the pay-
tion and conduction are based on the scenario template          ment accounting system to calculate each registered em-
of Bass et al. [12]. Our difference to existing works on        ployee’s income taxes. The payment accounting system
measurement-based resilience evaluation is that we have         has to gather data from health insurance providers and
an explicit step on eliciting resilience requirements. In       send its results to the corresponding tax office to execute
the next paragraphs, we elaborate on this in more detail.       the calculations. This process presumes that a company
   Cámara et al. [13, 14, 15] propose an approach for           that wants to use the payment accounting system pro-
resilience analysis of self-adaptive systems. The core          vides its employee and tax information to the health
idea consists of three parts: (1) specification of resilience   insurance provider and tax offices.
properties using Probabilistic Computation Tree Logic,             All of the payment accounting system’s tasks are cur-
(2) modeling causes of a hazard, e.g., high-load using ex-      rently taken care of by a monolithic legacy system. In
perimentation and collecting traces of system behavior,         peak times, up to 13 million calculation requests have to
and (3) verification of resilience properties using model       be handled in a day or single night. Under normal circum-
checking. In contrast to model checking-based verifica-         stances, this number is significantly lower. In order to
tion, we evaluate a resilience scenario’s response measure      handle such varying loads more efficiently, stakeholders
by analyzing collected measurements. Furthermore, Cá-           desired a better scaling system. Therefore, the old system
mara et al. do not focus on the elicitation of resilience       will be replaced by a more scalable microservice-based
requirements using requirement engineering methods.             Spring application in the coming years. The investigated
   Chaos Engineering [5, 3] is a technique for evaluating       part of the system under study, which is still under de-
system resilience through injecting failures [16]. There        velopment, consists of seven services. It is deployed to
are works on both (1) using engineering methods to iden-        a Platform as a Service (PaaS), i.e., Cloud Foundry (CF).
tify failure modes [7], i.e., causes of a hazard systemati-     Together with the industrial partner, we decided on a
cally before failure injection, and (2) ad-hoc failure injec-   scenario-based approach, as our industrial partner al-
tion with no systematic failure mode identification [2].        ready employed ATAM for other quality attributes.
However, they do not explicitly specify resilience require-
ments and lack a methodical way for requirement elicita-
tion. Our work is a step toward closing this gap.               3.2. Research Procedure
   In the context of resilience requirement elicitation, Yin    To answer our research question How to leverage ATAM to
et al. [17] propose a goal-oriented technique for represent-    elicit resilience requirements, which can be utilized to eval-
ing resilience requirements. The high-level idea is to rep-     uate resilience through resilience experiments and suggest
resent a resilience goal — e.g., all requests are processed     architectural improvements quantitatively?, we conduct
correctly — and identify possible causes of hazards that        the following steps:
act as obstacles for achieving a resilience goal — e.g., node
failure. However, they do not discuss how to identify haz-      1. We gather relevant system stakeholders, i.e., product
ards and their causes. Goal orientation and developing             owners, software architects, and quality engineers,
   into an ATAM-based workshop. The objective is to For this reason, we refer to the result of this session
   identify resilience scenarios that lead to QoS degrada- as a fault graph. Note that the (directed acyclic) fault
   tion and downtime.                                      graph can be transformed into an equivalent fault tree
                                                           by creating duplicate sub-trees for nodes having more
2. We derive resilience experiments from the scenarios. than one parent.
   The experiment description comprises the stimuli, ar-      Session 3: Resilience Scenarios for collecting and priori-
   tifact, and response according to the scenario.         tizing resilience scenarios based on the previously identi-
3. We use CTK to automate the execution of the re- fied hazards. We provided a scenario template based on
   silience experiments. We assess the response measure the layout used in ATAM. Then, the stakeholders jointly
   by analyzing the QoS metrics measurements.              created scenarios by informally analyzing the fault graph
                                                           in a sequence driven by the associated severity (in de-
4. After executing resilience experiments, we apply suit- scending order) of the hazards.
   able resilience patterns. We re-execute the resilience     Session 4: Retrospective to collect feedback about the
   experiments to assess the pattern’s effect by compar- workshop from the participants and to inform them about
   ing QoS-related behavior with and without the re- the next steps, which comprise (1) refinement resilience
   silience pattern.                                       requirements, and (2) execution of resilience experiments.


4. Elicitation and specification of                              4.2. Workshop Results
   scenarios                                                     Elicited Architecture Description: Figure 1 shows the com-
                                                                 ponent diagram of the system as specified in the first
This section elaborates on the planning, execution, and          session of the workshop. It describes a snapshot of the
results of the workshop.                                         system as used in the workshop and the subsequent activ-
                                                                 ities. It represents a typical microservice-based architec-
                                                                 ture. As such, the system is deployed to a CF and contains
4.1. Elicitation through Structured
                                                                 several services. Each service has its own PostgreSQL
     Workshop                                                    database. The only exception is the Calculations service,
Before the workshop, we received documentation regard-           which employs a Mongo database. The API-Gateway ser-
ing the architecture of the system. This allowed us to           vice handles all incoming connections and routes all com-
specify an architecture model of the case study system,          munications. A Eureka service is employed to provide
including a component diagram and an explanation of              service discovery for all internal components. The Fron-
the implemented components. Using ATAM, we required              tend service is the only external component that a user
to know key architectural design decisions. Therefore,           can directly access. The Calculations service is the cen-
knowing the architecture description in advance allowed          tral hub of the system since the calculation of payments
us to focus more on the hazard analysis and developing           is the system’s main feature. Once this service receives a
resilience scenarios.                                            calculation request from the gateway, it collects all nec-
   The full-day workshop consisted of four sessions lever-       essary data asynchronously from the other services. The
aging different methods, as described next. The modera-          Companies service is used to handle data the Frontend
tors explained each technique and method at the begin-           displays, but is not relevant for the calculation.
ning of each session. The participants were stakeholders            Hazard Analysis: Figure 2 shows the fault graph cre-
of the system and comprised two software architects, one         ated in the second workshop session. The stakeholders
product owner, and one quality assurance engineer.               agreed on unavailability or long response of settlement
   Session 1: Introduction and Architecture Description for      calculations as the main system hazard. Therefore, user’s
achieving a common understanding of the workshop                 settlement can not be calculated is the top event in the
process and the system’s architecture. (1) We resolved           fault graph. We analyzed possible causes from the top
misunderstandings regarding the elicited architecture            event until we reached basic events that we could not
description through asking questions, and (2) refined the        further decompose. We connected different causes by
prepared architectural models.                                   logical operators, i.e., AND and OR. For example, users
   Session 2: Hazard Analysis to identify potential causes       can not calculate their settlement if it is not processed
for degradation in QoS. Index cards were used as a means         in time. This can occur when the assigned instance stalls
to collect hazards. Afterward, the participants arranged         OR responds to slow. We argue that the latter can be
the hazards and their causes in a fault-tree-like fashion.       experienced if the system receives a sudden (work)load
To not break the participants’ creative flow, we relaxed         peak AND its (auto) scaling does not work correctly. The
the strict construction rules of fault trees, e.g., we allowed   hazards at the leaf nodes are potential candidates for
events having multiple parents, which resulted in a graph.       fault/failure injection during resilience experiments and
                                                                                                                                          Eureka
                                                                                                                                         Discovery
                                                                                                                                           Zone
                                                                                                       «Service»
                            «Service»                «Service»                 «Service»
                                                                                                       Working-
                            Payments                Calculations              Employees                                                 «Service»




                                                                                                                        API-Gateway
                                                                                                        Hours




                                                                                                                         «Service»
                                                                                                                                        Frontend


                             «Service»               «Service»                 «Service»                «Service»
                                                                                Social-                                               The API Gateway
                              Eureka                  Taxes                                            Companies                      can communicate
                                                                              Insureances
                                                                                                                                      with each Service
                                                                                                                                      directly



Figure 1: Component Diagram of Payment Accounting System


                                     User's
                                                                        Intermediate
                                   Settlement
                                   cannot be
                                                                        Event                                    stakeholders chose the responses and response measures
                                   Calculated                           Undeveloped
                                                                        Event                                    based on their SLOs.
                                                                                                  AND Gate          The stakeholders elaborated 12 resilience scenarios,
                  (Large) Clients
                     cannot be
                   Processed in
                                      Data cannot
                                      be Captured
                                                       Calculation
                                                         cannot
                                                                              Incorrect
                                                                             Calculation
                                                                                                  OR Gate
                                                                                                                 summarized in Table 1. Scenarios 01 to 04 are different
                                                      be Executed                                Basic Event
                        Time
                                                                                                                 variations of an unexpected load peak, including linear
                                                                                                                 and exponentially increasing loads. Scenario 05 and 06 de-
    Instance
  has too slow
               Instance
                                        Service
                                       does not
                                                     Service answers Calculation
                                                        with techical
                                                                                         Caclulation Technically scribe the failure of a single service instance. Scenario 07
                                                                               with      with Wrong Incorrect
   Response
      Time
                stalled                 answer              error          Inconsitent     Version       Retry   and 08 are about middleware failures. Scenario 09 and 10
                       Gateway                                                 Data
                        Service
                        crashes
                                                                                                                 revolve around gateway failures. Lastly, Scenario 11 and
                                                                Database
                                                                   is full
                                                                                  Data Loss
                                                                                                     Data is not
                                                                                                                 12 describe the failure of multiple instances. Actors such
  Bad
 (auto)
                     external
                     Service
                                            Middleware
                                              crashes
                                                                       Instance
                                                                                                       correctly
                                                                                                      replicated
                                                                                                                 as end-users, elements of the CF platform, different bugs,
           Load                                                       dies while
Scaling
           Peak
                   does not
                     Answer
                                                                     doing work
                                                                                                       between
                                                                                                    Datacenters  and technical issues caused by the middleware or deploy-
                                   Eureka                                                                        ment artifacts and issues intrinsic to individual services
                                  crashes
                                                  other
                                           Middleware          Instance other Types
                                                                                                                 of the system comprise the established sources. In total,
                                               crashes            gets of CF-related
                                                               mirgrated       Errors
                                                                                          Outage Data cannot     all scenarios can affect all services. The environments
                                                                                                   be recoverd
                                                                                                   after outage  cover different states of the system according to the iden-
                                                                                                                 tified system domain context, e.g., payslip calculation
Figure 2: Cleaned Fault Graph
                                                                                                                 periods or simply services being non-idle independent
                                                                                                                 of the different calculations. The response and response
can be initiated by tools such as CTK. The stakeholders measures were specified by the stakeholders based on
selected and prioritized the set of resilience experiments. their internal SLOs.
      Resilience Scenarios: We gave the participants an empty                                                       Retrospection: The brief retrospective at the end of the
table according to the ATAM scenario template with the workshop showed that the participants were satisfied
columns (1) source, (2) stimulus, (3) artifacts, (4) environ- with the agenda, content, and outcomes. However, com-
ment, (5) response, and (6) response measure. Further, ments were made concerning time management.
we explained the meaning of each table column to the
participants. By using index cards again, the participants
steadily added content to the table. We began by iden- 5. Resilience Evaluation
tifying possible sources. The stimuli and artifacts were
then derived from the previously created fault graph. The This section aims to evaluate the case study system’s
environment represents different time periods when the resilience. Therefore, we implemented a subset of the
identified stimuli occur. The responses are the stakehold- previously elicited resilience scenarios into resilience ex-
ers’ assumptions about how the system should respond periments using CTK. We compare the system’s behavior
to the particular stimulus. The response measures are against the expected behavior described in the scenarios’
based on their internal Service Level Objectives (SLOs). response part.
For example, a workload peak resulting in a system fail-
ure was transposed into multiple scenarios. Users of the 5.1. Experiment Setup
system are the source of the scenario since they cause
the load peak. The respective stimulus is the workload 5.1.1. Examined Software System
peak itself. A service was chosen as the artifact to repre- Due to legal constraints and to maintain anonymity, our
sent that a load peak can influence all service instances. industrial partner provided us with a mocked version
As the environment, the payslip calculation period was as a proxy for the real payroll accounting system. This
chosen to imply an existing base workload. At last, the version, shown in Figure 3, is used throughout this paper
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Mock Payroll Accounting System



                                                                                                                                                                                                                                                                                                                                                                                                                                    «Service»               «Service»              «Service»
                                                                                                                                                                                                                                                                                                                                                                                                                                   API-Gateway               payslip               payslip2
Wage calculation ≤ 1 s, in 99 % of the cases, payslip




                                                                                                                   Notification arrives within 1 s in 99 % of the cases
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  «Service»




                                                                                                                                                                                                            Downtime of gateway instance is below 1 min
                                                                                                                                                                                                                                                                                                                                                                                                                    Jollyday API                                                   Eureka




                                                                                                                                                                                                                                                                                                                    Wage calculation response time is below 2 s
                     Response Measure




                                                                                                                                                                                                                                                                                                                                                                                                                   Figure 3: Mocked Payroll Accounting System

                                                                                                                   Developer gets notified within 5 min
calculation ≤ 20 s (300 Employees)




                                                                                                                                                                          Restarts and SLOs are satisfied
                                                                                                                                                                                                                                                                                                                                                                                                                   as the system under test. It implements a similar business




                                                                                                                                                                                                                                                                            Downtime is below 1 min
                                                                                                                                                                                                                                                                                                                                                                                                                   logic but with less computational overhead. The system
                                                                                                                                                                                                                                                                                                                                                                                                                   uses typical patterns of the microservice architectural
                                                                                                                                                                                                                                                                                                                                                                                                                   style, i.e., API-Gateway-service as a central gateway that
                                                                                                                                                                                                                                                                                                                                                                                                                   manages all incoming requests and Eureka [18] to provide
                                                                                                                                                                                                                                                                                                                                                                                                                   service discovery. The payslip-service utilizes an H2 in-
                                                                                                                                                                                                                                                                                                                                                                                                                   memory database and the third-party API Jollyday. It can
                                                                                                                                                                                                                                                                                                                                                                                                                   forward requests to the payslip-service2. Requests can
All requests are handled correctly and




                                                                                         User unaware, calc. correct & in time,




                                                                                                                                                                                                                                                                            Error message and services R restarts


                                                                                                                                                                                                                                                                                                                                                                                                                   also be sent directly to payslip-service2 using a different
                                                                                         Process is aborted but can be picked




                                                                                                                                                                                                                                                                                                                                                                                                                   endpoint.
                                                                                                                                                                                                                                                                                                                    Calculation correct and in time
                                                                                         Frontend shows error, gateway




                                                                                                                                                                                                                                                                                                                                                                                                                      The following six endpoints are used during the exper-
              Response




                                                                                         Developer gets notified




                                                                                                                                                                                                                                                                                                                                                                                                                   iments:
                                                                                         caller will be notified
                                                                                         ≥1 instance running




                                                                                                                                                                                                                                                                                                                                                                  Table 1: Scenarios created during the workshop
                                                                                         Abort calculation,




                                                                                                                                                                                                                                                                                                                                                                                                                   INTERNAL_DEP. — Calls the payslip-service2 via
                                                                                                                                                                                                                                                                                                                                                                                                                        payslip-service.
                                                                                         restarts
in time




                                                                                                                                                                                                                                                                                                                                                                                                                   DB_READ — Reads an entry from the database of the
                                                                                         up




                                                                                                                                                                                                                                                                                                                                                                                                                        payslip-service.
                                                        Not during payslip calculation




                                                                                                                                                                                                                                                                            During wage calculation, one
                                                                                                                                                                                                                                                                            During wage calculation, no




                                                                                                                                                                                                                                                                                                                                                                                                                   EXTERNAL_DEP. — Calls the third-party API Jolly-
                                                                                                                                                                          Async. payslip calculation
Payslip calculation period




                                                                                         During wage calculation


                                                                                                                                            During wage calculation
         Environment




                                                                                                                                                                                                                                                                                                                                                                                                                        days via payslip-service.
                                                                                                                                                                                                                                                                            instance not available
                                                                                                                                                                                                                                                                            instance available




                                                                                                                                                                                                                                                                                                                                                                                                                   DB_WRITE — Writes an entry into the database of the
                                                                                                                                                                                                                                                                                                                                                                                                                       payslip-service.
                                                                                                                                                                                                            Not Idle
                                                        period




                                                                                                                                                                                                                                                                                                                                                                                                                   GATEWAY_PING — Checks whether the API-Gateway-
                                                                                                                                                                                                                                                                                                                                                                                                                       service responds.
  Artifact




                                                                                                                                                                                                            Front- and




                                                                                                                                                                                                                                                                            Receiving
                                                                                                                                                                                                                                                                            Service S,

                                                                                                                                                                                                                                                                            Service R




                                                                                                                                                                                                                                                                                                                                                                                                                   UNAFF._SERVICE — Sends a request directly to
                                                                                                                                            Backend




                                                                                                                                                                                                            Backend
                                                                                         Instance




                                                                                                                                                                                                                                                                            Sending
Service




                                                                                                                                                                                                                                                                                                                                                                                                                        payslip-service2.

                                                                                                                                                                                                                                                                                                                                                                                                                      The actual payment accounting system is deployed to
Linear increasing load peak (cold



Linear increasing load peak (cold
Exponentially increasing load



Exponentially increasing load




                                                                                                                                                                                                                                                                                                                                                                                                                   a paid CF. Due to financial constraints and legal issues,
                                                                                                                                                                          Middleware terminates but




                                                                                                                                                                                                                                                                                                                                                                                                                   the mock system is deployed to a local CF environment
                                                                                                                                            Middleware terminates
            Stimulus




                                                                                                                                                                          Gateway terminates




                                                                                                                                                                                                                                                                                                                                                                                                                   [19], which has similar properties as a paid CF. As CF is a
Instance terminates




                                                                                                                                                                                                                                                                            Service terminates
peak (cold start)



peak (cold start)




                                                                                                                                                                                                                                                                                                                                                                                                                   constraint given by the stakeholders, we did not consider
                                                                                                                                                                                                                                                                                                                                                                                                                   other cloud providers.
                                                                                                                                                                          recovers
start)



start)




                                                                                                                                                                                                                                                                                                                                                                                                                   5.1.2. Experiment Tools
                                                                                                                                                                                                                                                                            Receiving Service
                                                                                                                                                                                                                                                          Technical Issue
                                                                                         Cloud Foundry




                                                                                                                                                                                                                                                                                                                                                                                                                   Figure 4 shows our experiment framework comprising
                                                                                                                                                                                                            Deployment
                                                                                                                   Middleware
                                                                                                                    Operator
Source
 User




                                                                                                                                                                                                                                                                                                                                                                                                                   four tools, i.e., CTK, load generator, hypothesis valida-
                                                                                                                      Bug




                                                                                                                                                                                                                                                                                                                                                                                                                   tor, and dashboard. During an experiment, these tools
                                                                                                                                                                                                                                                                                                                                                                                                                   interact with the system to monitor the experiments and
                                                                                                                                                                                                                                                                                                                                                                                                                   provide detailed insights, e.g., response times of calls to
   Short Name




                                                                                                                                                                                                                                                                                                                                                                                                                   individual endpoints.
FroBac/NoIdle

FroBac/NoIdle
Failure(SerE)/

Failure(SerE)/
Failure(MW)/

Failure(MW)/
Peak(LinCo)/



Peak(LinCo)/




Failure(GW)/
Peak(ExCo)/



Peak(ExCo)/



Bug/Ins/Ber
Failure(CF)/




                                                                                                                                                                                                                                                                                                                                                                                                                      To execute the experiments, we used CTK [20], which
Depl(GW)/
Ser/NoAbr

Ser/NoAbr




Bac/Abr
Bac/Ber
Ser/Abr

Ser/Abr




Ser/Ber
Ins/Ber




Ins/Ber




                                                                                                                                                                                                                                                                                                                                                                                                                   can execute and monitor chaos tests and has drivers for
                                                                                                                                                                                                                                                                                                                                                                                                                   various PaaS solutions. We leveraged the CF driver to
                                                                                                                                                                                                                                                                                                                                                                                                                   terminate a service instance at a specific point in time
ID
01

                                                   02

                                                        03

                                                                                    04

                                                                                         05

                                                                                                                   06
                                                                                                                   07

                                                                                                                                                                          08

                                                                                                                                                                                                            09

                                                                                                                                                                                                                                                          10

                                                                                                                                                                                                                                                                            11

                                                                                                                                                                                                                                                                                                                    12
                                            execute experiment                   Target Service             payslip-service
                                                                                 Experiment Type       Terminate payslip-service
                                                                                                         application instance
  System Under             Hypothesis
                                                           ChaosToolkit          Hypothesis              Response measure of
       Test                 Validation       retrieve
                                              results
                                                                                                          Scenario 05 holds
         generate                  query                                         Blast Radius               payslip-service
         load                     results
                                                                          Table 2
      Load-
                            InfluxDB                                      Resilience experiment design for Scenario 05’
    generator     write
                 results                         Load Profile

                                                                          some noise. The requests are evenly distributed over all
Figure 4: Used structure of the experiment framework                      six endpoints. To assess whether the system still responds
                                                                          correctly and in time, we measure response times of the
                                                                          requests and compute their success rate.
and validate the steady-state hypothesis. The load that
the system receives is controlled by an adapted version
of the load generator from the TeaStore microservices                     5.3. Experiment Results
benchmark [21] that monitors response times, number
                                                            Figure 5 shows the steady-state, injection, and recovery
of successful, dropped, and failed requests. The collected
                                                            phases of the experiment for endpoints INTERNAL_DEP.,
data is written into an InfluxDB [22] for a time series
                                                            GATEWAY_PING, and UNAFF._SERVICE. In the steady-
based evaluation. During the evaluation, a Spring ser-
                                                            state phase, we assume that the system is working as
vice collects the necessary data from the InfluxDB and
                                                            expected, i.e., the response times satisfy the SLOs. In the
calculates whether a hypothesis holds. We also created a
                                                            injection phase, CTK terminates the payslip-service in-
dashboard application that provides convenient features,
                                                            stance. In the recovery phase, we assume that the system
like synchronized starting of CTK and the load gener-
                                                            recovers and returns to a steady state, i.e., the response
ator, live monitoring, and automated CTK setup. Since
                                                            times satisfy the SLOs. We omitted the load generator’s
the dashboard does not add functionalities in executing
                                                            warmup and cooldown phase due to readability and anal-
experiments, it is not part of Figure 4.
                                                            ysis purposes, which refers to the overall first and last
                                                            300 s. Further, a 30 s binning was applied, and extreme
5.2. Experiment Execution                                   outliers (>100 ms) are not shown.
                                                               The success rates at the endpoints INTERNAL_DEP.
Based on Scenarios 04 and 05, we implemented three
                                                            (Figure 5a), DB_READ, EXTERNAL_DEP., and DB_WRITE
resilience experiments. The first experiment investigates
                                                            drop to 0 % as the payslip-service is terminated after
a load peak with an exponential increase (Scenario 04),
                                                            600 s and rises back to 100 % as it recovers in about
while the remaining two investigate instance termina-
                                                            1.5 min. During this downtime, no response times are
tion due to an internal CF error for random instances
                                                            recorded since no requests arrive at the payslip-service.
(Scenario 05) and specifically the payslip-service (Sce-
                                                            During the steady-state and recovery phase, the response
nario 05’). The selection of experiments is based on the
                                                            times are stable at around 20 ms and 15 ms, respectively.
industrial partner’s preferences. In all experiments, the
                                                            During the injection phase, there is a slight increase as
effect on all endpoints is examined. In the following, we
                                                            the payslip-service has restarted. The results for GATE-
will only discuss the results of a subset of endpoints for
                                                            WAY_PING (Figure 5c) and UNAFF._SERVICE (Figure 5e)
Scenario 05’. The residual results can be found in the
                                                            show a similar structure. However, the load generator did
supplementary material [10].
                                                            not record any successful or failed requests during the
   The design of the experiment related to Scenario 05’ is
                                                            downtime. Therefore, no success rate could be calculated.
given in Table 2. The target service of this experiment is
the payslip-service, which holds the core business logic of
the mock system. We use CTK to terminate running CF 5.4. Discussion of Results
application instances to simulate the scenario’s stimulus.
                                                            As visible in Figure 5 (left side), the response time and
The stimulus refers to an error that occurs in CF, which
                                                            success rate values are almost identical in the steady
leads to a loss of an application instance. We assume that
                                                            state phase and the recovery phase. Furthermore, the
the blast radius only affects the payslip-service and that
                                                            increase in the success rate indicates that the payslip-
CF registers the loss of the payslip-service instance and
                                                            service becomes available after 30 s to 60 s. Thus, the CF
starts a new instance. Our hypothesis is that the response
                                                            platform can re-instantiate the payslip-service quickly,
measure of Scenario 05 still holds.
                                                            leading to a quick recovery of the system.
   During the experiments, the system is exposed to an
                                                               Response times are slightly higher while the payslip-
almost constant, synthetic load. We generated a load
                                                            service is re-instantiated, which was expected as normal
profile with a target load of 20 requests per second and
                                                            cold-start behavior. Endpoints GATEWAY_PING and
                                        S04 Direct Run 1                                                                                                            S04 Direct Run 2
                            ID 0 Response Times and Success Rate of                                                                                     ID 0 Response Times and Success Rate of
                                         HTTP Requests                                                                                                               HTTP Requests




                                                                                                       100




                                                                                                                                                                                                                                                   100
                      100




                                                                                                                                                  100
                                                                                                                                                                                            ●
                                                                                                                                                         ●                                  ●
                                                                                                                                                                                                        ●
                                                                                                                                                         Success Rate                       ●
                                                                                                                                                                                                                                                                           Success Rate
Response Times (ms)




                                                                                                                            Response Times (ms)
                                  ●




                                                                                                         Success Rate (%)




                                                                                                                                                                                                                                                     Success Rate (%)
                                                                                                       75




                                                                                                                                                                                                                                                   75
                                                                                            ●
                      75




                                                                                                                                                  75
                                             ●                         ●
                                                                                            ●
                                                                                                                                                              ●
                                                                                                                                                         Response times                                               ●                                                    Response times
                                                  ●                                                                                                                                                               ●       ●       ●       ●




                                                                                                       50




                                                                                                                                                                                                                                                   50
                                                                            ●   ●                                                                                                                                         ●
                                                       ● ●                                                                                                                     ●
                      50




                                                                                                                                                  50
                                                       ●                                                                                                           ●
                                                                                ●
                                                               ●
                             ●
                                     ●
                                       ●
                                             ●
                                             ● ●           ●
                                                                   ●
                                                                   ●
                                                                         ●             ●
                                                                                             ●                                                    Experiment
                                                                                                                                                     ●
                                                                                                                                                     ●   ● ●
                                                                                                                                                             Phases
                                                                                                                                                             ●
                                                                                                                                                                                                ● ●
                                                                                                                                                                                                ●
                                                                                                                                                                                                ● ●
                                                                                                                                                                                                  ● ●
                                                                                                                                                                                                                          ●
                                                                                                                                                                                                                            ● ●
                                                                                                                                                                                                                                                                        Experiment Phases
                               ●     ●   ● ● ●                     ●                                                                                     ●     ●         ● ●                                ● ● ● ● ● ●                   ●
                               ● ●
                                 ● ● ●                                   ● ● ● ●
                                                                               ● ●
                                                                                                                                                         ●     ●       ●       ●                        ●       ●     ●   ● ●             ●
                             ● ●
                               ● ● ●
                               ● ● ●   ● ● ●
                                             ●
                                             ●   ●         ●     ●
                                                               ● ●
                                                                     ●
                                                                     ● ●
                                                                         ● ● ●
                                                                           ●     ● ● ● ● ● ●
                                                                             ● ● ●
                                                                                 ●           ●
                                                                                           ● ●
                                                                                                                                                               ●       ● ●   ●
                                                                                                                                                         Steady state                                                                                                      Steady state




                                                                                                       25




                                                                                                                                                                                                                                                   25
                             ● ●
                               ● ● ● ●
                                     ●     ● ● ●               ●
                                                               ●   ●   ●   ● ●
                                                                             ● ●   ● ●
                                                                                   ● ● ●   ●
                                     ●         ●               ●
                                                               ●                   ●
                                                                                   ● ● ●   ● ●
                                                                                             ● ●
                                                                                             ●
                      25




                                                                                                                                                  25
                                                                                             ●
                                                               ●
                                                                                                                                                         Injection                                                                                                         Injection
                                                                                                                                                         Recovery                                                                                                          Recovery




                                                                                                       0




                                                                                                                                                                                                                                                   0
                      0




                                                                                                                                                  0
                             0




                                                  0




                                                                        0




                                                                                                 00




                                                                                                                                                         0




                                                                                                                                                                                        0




                                                                                                                                                                                                             0




                                                                                                                                                                                                                                              00
                            30




                                                 60




                                                                       80




                                                                                                                                                        30




                                                                                                                                                                                     60




                                                                                                                                                                                                            80
                                                                                                12




                                                                                                                                                                                                                                          12
                 S04 Direct
                Experiment   Run 1(s)
                           Duration                                                                                                                                  S04 Direct
                                                                                                                                                                    Experiment   Run 2(s)
                                                                                                                                                                               Duration
     ID 4 Response Times and Success Rate of                                                                                                            ID 4 Response Times and Success Rate of
(a) INTERNAL_DEP.HTTP
                   (without retry)
                         Requests                                                                                                                         (b) INTERNAL_DEP.
                                                                                                                                                                      HTTP (with   retry)
                                                                                                                                                                             Requests




                                                                                                       100




                                                                                                                                                                                                                                                   100
                      100




                                                                                                                                                  100
                                                                                                                                                                                            ●
                                                                                                                                                                                            ●
                                                                                                                                                                                            ●
                                                                                                                                                         Success Rate                       ●                                                                              Success Rate
Response Times (ms)




                                                                                                                            Response Times (ms)
                                                                                                                                                                                            ●
                                                                                                                                                                                            ●




                                                                                                         Success Rate (%)




                                                                                                                                                                                                                                                     Success Rate (%)
                                                                                                       75




                                                                                                                                                                                                                                                   75
                                                                                                                                                                                            ●
                      75




                                                                                                                                                  75
                                                                                                                                                                                                                          ●
                                                                       ●
                                                                                                                                                         Response times                                                                                                    Response times
                                                                                                                                                                                                                      ●
                                                                                                       50




                                                                                                                                                                                                                                                   50
                      50




                                                                                                                                                  50
                                             ●                                                                                                                                 ●

                                                               ●
                                                                                                                                                  Experiment
                                                                                                                                                          ●
                                                                                                                                                             Phases                                                                                                     Experiment Phases
                                                       ●                                                                                                           ●                                                      ●
                                  ●                                                                                                                                                                                   ●
                                                                                                                                                         Steady state                                                                                                      Steady state
                                                                                                       25




                                                                                                                                                                                                                                                   25
                                             ●                                          ●
                                               ●                                                                                                                       ● ●                                                    ●
                                                                                                                                                                                                                              ●
                      25




                                                                                                                                                  25
                               ●       ●                         ●   ●             ●                                                                                       ●                    ●   ●
                                                                                                                                                                                                    ●     ●
                             ●           ●
                                         ●   ●   ●           ● ● ●   ● ●
                                                                       ●     ●                                                                           ●                                                ●         ● ● ●     ●
                             ● ● ● ● ●
                                     ● ● ●                 ● ●       ● ● ●   ●
                                                                         ● ● ●     ● ● ● ●
                                                                                 ● ●                                                                     ● ● ● ●
                                                                                                                                                               ● ●
                                                                                                                                                                 ● ●
                                                                                                                                                                   ● ●   ●
                                                                                                                                                                         ●                      ● ● ● ● ● ● ● ●   ● ●
                                                                                                                                                                                                                  ● ●   ● ● ●   ● ● ●
                             ● ●
                               ●
                               ●
                               ●
                                 ● ●
                                   ●
                                   ● ●
                                   ● ● ●
                                       ● ●
                                     ● ● ● ●
                                           ● ●
                                           ● ●
                                               ●
                                             ● ●
                                               ● ●
                                               ●
                                                             ● ● ●
                                                                 ●
                                                                 ● ● ● ●
                                                                 ● ●
                                                                   ●   ●
                                                                         ●
                                                                       ● ●
                                                                         ●
                                                                         ●
                                                                               ● ●
                                                                                 ●
                                                                                 ● ●
                                                                                   ● ●
                                                                                 ● ● ●
                                                                                     ●
                                                                                         ●
                                                                                         ● ● ●
                                                                                             ●                                                           Injection
                                                                                                                                                         ● ● ● ● ● ●
                                                                                                                                                                   ● ● ●
                                                                                                                                                                   ●   ● ●
                                                                                                                                                                       ● ● ●
                                                                                                                                                                         ● ● ●
                                                                                                                                                                           ● ●
                                                                                                                                                                                                ●
                                                                                                                                                                                                ●
                                                                                                                                                                                                ●   ●
                                                                                                                                                                                                    ●
                                                                                                                                                                                                      ● ●
                                                                                                                                                                                                    ● ●
                                                                                                                                                                                                      ●
                                                                                                                                                                                                          ● ●
                                                                                                                                                                                                            ●
                                                                                                                                                                                                              ● ●
                                                                                                                                                                                                              ●
                                                                                                                                                                                                            ● ●
                                                                                                                                                                                                              ● ●
                                                                                                                                                                                                                  ●
                                                                                                                                                                                                                ● ●
                                                                                                                                                                                                                  ● ●
                                                                                                                                                                                                                ● ● ●
                                                                                                                                                                                                                      ● ●
                                                                                                                                                                                                                      ●
                                                                                                                                                                                                                      ● ●
                                                                                                                                                                                                                    ● ●
                                                                                                                                                                                                                          ● ●
                                                                                                                                                                                                                        ● ● ● ●
                                                                                                                                                                                                                          ● ●
                                                                                                                                                                                                                                ● ● ●
                                                                                                                                                                                                                              ● ●
                                                                                                                                                                                                                              ● ●
                                                                                                                                                                                                                                ●
                                                                                                                                                                                                                                                                           Injection
                                                                                                                                                                             ●                                            ●           ●
                                                                                                                                                         Recovery                                                                                                          Recovery
                                                                                                       0




                                                                                                                                                                                                                                                   0
                      0




                                                                                                                                                  0
                              0




                                                   0




                                                                         0




                                                                                                  0




                                                                                                                                                          0




                                                                                                                                                                                       0




                                                                                                                                                                                                              0




                                                                                                                                                                                                                                              0
                            30




                                                 60




                                                                       80




                                                                                                   0




                                                                                                                                                        30




                                                                                                                                                                                     60




                                                                                                                                                                                                            80




                                                                                                                                                                                                                                             0
                                                                                                12




                                                                                                                                                                                                                                          12
                 S04 Direct
                Experiment   Run 1(s)
                           Duration                                                                                                                                 S04 Direct
                                                                                                                                                                   Experiment   Run 2(s)
                                                                                                                                                                              Duration
     ID 5 Response Times and Success Rate of                                                                                                            ID 5 Response Times and Success Rate of
(c) GATEWAY_PINGHTTP
                   (without retry)
                         Requests                                                                                                                         (d) GATEWAY_PING
                                                                                                                                                                     HTTP (with   retry)
                                                                                                                                                                            Requests
                                                                                                       100




                                                                                                                                                                                                                                                   100
                      100




                                                                                                                                                  100




                                                                                                                                                         Success Rate                                                                                                      Success Rate
Response Times (ms)




                                                                                                                            Response Times (ms)
                                                                                                         Success Rate (%)




                                                                                                                                                                                                                                                     Success Rate (%)
                                                                                                       75




                                                                                                                                                                                                                                                   75
                      75




                                                                                                                                                  75




                                                                       ●                                                                                 Response times                                                                                                    Response times
                                                                                                       50




                                                                                                                                                                                                                                                   50
                                                                                                                                                                   ●                                                  ●   ●
                      50




                                                                                                                                                  50




                                                                                            ●
                                      ●
                                                                                    ●
                                                                                                                                                  Experiment Phases                                                   ●
                                                                                                                                                                                                                                                                        Experiment Phases
                                                                                                                                                                               ●            ●                         ●
                                             ●    ●
                                                                                                                                                         Steady state                                                                                                      Steady state
                                                                                                       25




                                                                                                                                                                                                                                                   25
                                                                                                                                                                                        ●                                 ●
                                                                       ●                                                                                                                                ●
                                                                             ● ●       ●                                                                                                                                      ●
                      25




                                                                                                                                                  25




                                     ●       ●         ●         ●                                                                                       ● ●
                             ●
                             ● ●
                               ● ●       ●   ●                                 ● ●         ●
                                                                                           ●                                                                                            ●           ●         ●
                                                                                                                                                                                                              ● ● ●
                                     ● ●
                                       ● ●
                                         ●             ●                       ●     ● ●                                                                    ● ● ● ●          ●                      ● ● ● ● ● ●
                                                                                                                                                                                                              ●         ● ● ●     ● ●
                             ●
                             ●
                             ●
                               ●
                               ● ●
                             ● ●
                                 ● ●
                               ● ●
                                 ● ●
                                     ●
                                   ● ●
                                     ● ●
                                     ●
                                   ● ●
                                         ● ●
                                       ● ●
                                       ●
                                           ● ●
                                           ●
                                         ● ●
                                             ●
                                             ● ● ●
                                             ● ●       ●
                                                             ●
                                                           ● ●
                                                             ● ●
                                                             ● ●
                                                                 ● ● ● ● ●
                                                               ● ●
                                                                 ● ●
                                                                         ● ●
                                                                 ● ● ● ● ● ●
                                                                         ● ●
                                                                   ● ● ● ● ●
                                                                             ● ●
                                                                           ● ●
                                                                             ● ●
                                                                             ●
                                                                             ●
                                                                               ● ●
                                                                               ● ●
                                                                                   ● ●
                                                                                     ● ●
                                                                                 ● ● ● ● ● ●
                                                                                       ●
                                                                                     ● ●
                                                                                           ● ●
                                                                                             ●
                                                                                             ●
                                                                                                                                                         Injection
                                                                                                                                                          ● ●
                                                                                                                                                          ● ● ●
                                                                                                                                                              ●    ●
                                                                                                                                                                             ●
                                                                                                                                                                           ● ●
                                                                                                                                                                                    ●
                                                                                                                                                                                    ●
                                                                                                                                                                                                ● ●
                                                                                                                                                                                                ●
                                                                                                                                                                                                  ● ●
                                                                                                                                                                                                ● ● ● ● ●
                                                                                                                                                                                                ●
                                                                                                                                                                                                        ● ● ●
                                                                                                                                                                                                        ●
                                                                                                                                                                                                            ●
                                                                                                                                                                                                            ●
                                                                                                                                                                                                                ● ● ● ● ●
                                                                                                                                                                                                                ● ●
                                                                                                                                                                                                                      ●
                                                                                                                                                                                                                            ●
                                                                                                                                                                                                                  ● ● ● ● ● ● ●
                                                                                                                                                                                                                            ●
                                                                                                                                                                                                                              ● ● ● ●
                                                                                                                                                                                                                                    ●
                                                                                                                                                                                                                                  ● ●
                                                                                                                                                                                                                                  ● ●
                                                                                                                                                                                                                                                                           Injection
                                                                                                                                                                                                        ●   ●         ●     ●     ● ●
                                                                   ●            ●
                                                                                                                                                         Recovery                                                                                                          Recovery
                                                                                                       0




                                                                                                                                                                                                                                                   0
                      0




                                                                                                                                                  0
                              0




                                                   0




                                                                         0




                                                                                                  0




                                                                                                                                                          0




                                                                                                                                                                                       0




                                                                                                                                                                                                              0




                                                                                                                                                                                                                                              0
                            30




                                                 60




                                                                       80




                                                                                                   0




                                                                                                                                                        30




                                                                                                                                                                                     60




                                                                                                                                                                                                            80




                                                                                                                                                                                                                                             0
                                                                                                12




                                                                                                                                                                                                                                          12




                                            Experiment Duration (s)                                                                                                                Experiment Duration (s)

(e) UNAFF._SERVICE (without retry)                                                                                                                       (f) UNAFF._SERVICE (with retry)

Figure 5: Comparison of experiment results at different endpoints with and without the implemented retry pattern



UNAFF._SERVICE should remain unaffected during the                                                                                                       quest responses at the endpoints GATEWAY_PING and
injection because the payslip-service is not required to                                                                                                 UNAFF._SERVICE during injection, which indicates no
answer the requests. Nevertheless, response times at end-                                                                                                requests exist in the system. Another possibility is that
point GATEWAY_PING are affected, which indicates a                                                                                                       requests have been dropped. Looking at the raw data
propagation of the failure effects from the payslip-service                                                                                              tables disproves this argument as there are no dropped
to the API-Gateway-service.                                                                                                                              requests. Another explanation is that no requests arrived
   After the injection started, the success rate drops to                                                                                                at the system, which leads to a lack of data in the time
0 % at the endpoints INTERNAL_DEP., DB_READ, EX-                                                                                                         frame between approximately 600 s and 660 s.
TERNAL_DEP., and DB_WRITE. The CTK terminates                                                                                                               We hypothesized that the response measure of Sce-
the single payslip-service instance. The load generator                                                                                                  nario 05 holds, i.e., requests are answered in time (99 % in
flags all requests as failed, leading to a success rate of                                                                                               less than 1 s) and correctly. As the response times are far
0 %. The plots show neither successful nor failing re-                                                                                                   below 1 s, our hypothesis regarding the response times
                              Steady State                        Injection                          Recovery
                     w/o Pattern        w Pattern       w/o Pattern        w Pattern       w/o Pattern       w Pattern
    Endpoint       𝑝5 𝑥
                      ˜    𝑥 𝑝99 𝑝5 𝑥     ˜  𝑥 𝑝99    𝑝5 𝑥
                                                         ˜    𝑥 𝑝99 𝑝5 𝑥    ˜   𝑥 𝑝99    𝑝5 𝑥
                                                                                            ˜    𝑥 𝑝99 𝑝5 𝑥   ˜   𝑥 𝑝99
 INTERNAL_DEP.     19 22 22.5 33 19 22 24.0 51        19 21 22.4 32 19 22 24.6 90        19 21 22.1 31 19 22 23.0 34
    DB_READ        11 12 13.3 21 11 12 13.0 24        11 12 13.1 20 11 12 13.2 30        11 12 12.9 20 11 12 12.8 19
 EXTERNAL_DEP.     11 12 12.6 21 10 12 13.3 23        11 12 12.5 21 10 12 14.1 31        11 12 12.3 19 10 12 12.2 19
   DB_WRITE        11 13 13.4 21 11 12 13.1 22        11 12 13.1 20 11 12 13.3 27        11 12 13.0 20 11 12 12.7 19
 GATEWAY_PING      11 12 13.3 21 11 12 13.3 24        11 12 13.1 20 11 12 13.6 32        11 12 12.9 19 11 12 12.9 21
 UNAFF._SERVICE    10 11 11.9 19 10 11 11.6 19        10 11 11.7 18 10 11 11.6 19        10 11 11.8 18 10 11 11.5 19

Table 3
Statistical summaries of the three experiment phases. 𝑝𝛼 : 𝛼-th percentile; 𝑥
                                                                            ˜ : median; and 𝑥: mean. Values are given in ms.



is technically fulfilled. However, several requests are not    in Section 5, each plot is divided into the steady state
answered at all, which is indicated by the dropped success     phase, injection phase, and recovery phase. Table 3 shows
rate. We consider these as incorrect response. Therefore,      the associated statistical values.
we assume that the hypothesis regarding correctness is            In general, similar behavior can be observed at all the
not fulfilled.                                                 endpoints. Comparing the plots at left and right of the
                                                               Figure 5, shows that the mean response times in the
                                                               steady state phase do not vary significantly when the
6. Resilience Improvement                                      retry pattern is activated. Although, at the beginning of
                                                               the injection phase, far more high response times can be
The previous section’s experiments showed that the sys-
                                                               observed. In addition, the boxplots show a slightly higher
tem does not respond as described in Scenario 05 to a
                                                               interquartile range in the plot where the retry pattern is
failure of an instance of the payslip-service. While the
                                                               integrated.
response times are technically below 1 s in 99 % of all
                                                                  The plots also show that the success rate does not
cases, requests are temporarily not answered at all, and
                                                               drop to zero anymore when the pattern is active. For
thus, not correctly. Therefore, we aim to improve the
                                                               the endpoints INTERNAL_DEP., the success rate drops
system’s success rate concerning Scenario 05 by applying
                                                               to approximately 70 %. For the two endpoints GATE-
resilience pattern(s). We then determine the efficacy of
                                                               WAY_PING and UNAFF._SERVICE, requests are arriving
improvements to the system’s resilience by re-executing
                                                               and the success rate remains at 100 %.
the experiments.
                                                                  The application of the retry pattern can explain the re-
                                                               sponse time spikes during the injection (see the Figure 5).
6.1. Architectural Modifications                               Requests sent shortly before the restart of the payslip-
The system under test was fortified with a retry pat-          service fail, but are retried by the API-Gateway-service
tern [9], i.e., the API-Gateway-service sends another re-      until the payslip-service recovered after approximately
quest to the payslip-service if a request fails or remains     10 s. However, as several retries have been aggregated,
unanswered. The retry pattern seems to be a reasonable         the payslip-service will have to handle a high amount of
choice since response times are far below the threshold        requests upon recovery, resulting in a visible spike in
of 1 s, as indicated by the previous experiment. Due to its    response times.
specific purpose, the system has to accept requests near          The endpoints UNAFF._SERVICE and GATEWAY_PING
real-time and always answer correctly. Thus, resilience        do not depend on the payslip-service. This explains the
patterns that rely on backup or restricting behavior, like     high success rate at these endpoints.
circuit breakers or flow limiters, are unsuited. To avoid         In contrast to the experiment without the retry pat-
bad retry behavior, we configured the Spring-Retry as          tern, the success rate does not drop entirely. Therefore,
follows. We set the maximum number of retries of each          the retry pattern improves the scenario satisfaction as
payslip-service request to be 4, the initial delay to 10 ms,   it increased the percentage of correct responses while
the factor for the exponential increase to 3, and the max-     keeping the response times below 1 s.
imum delay to 150 ms — resulting in retries after 10 ms,
30 ms, 90 ms, and 150 ms.                                       7. Discussion
6.2. Experiment Results and Discussion                          7.1. Key Lessons Learned
Each plot on the right side of Figure 5 visualizes the          Lesson 1: Elicitation of resilience requirements in-
system’s response times and success rates with the retry        volves hazard analysis. It is essential to include stake-
pattern for an endpoint. As in the experiment presented         holders with different roles and particular expertise in
the business domain to quickly prepare a list of relevant 7.2.1. Workshop
hazards. Other roles, such as software developers and in-
frastructure engineers, help to identify causes of hazards
                                                                 Conclusion validity One threat is the reliability of
that stem from software and its running environment.
                                                             measures, which means repeating the workshop yields
    Lesson 2: ATAM is a useful method to adopt re-
                                                             the same resilience requirements list. Elicitation of re-
silience elicitation. Stakeholders of the software project
                                                             silience requirements involves human judgment. Hence,
were already familiar with scenario development for qual-
                                                             it is a subjective measure. Therefore, we can not entirely
ity requirements. Therefore, the structure of the scenario
                                                             rule out this threat.
template of Bass et al. [12] was intuitive for the stake-
                                                                 Internal validity One threat is instrumentation, which
holders.
                                                             means our tools and techniques were not suitable. We
    Lesson 3: Loose adoption of formalisms is already
                                                             conducted a one-day structured workshop and used the
good enough. Researchers and practitioners have used
                                                             scenario template of Bass et al. [12] for eliciting resilience
fault tree formalism for both qualitative and quantitative
                                                             requirements. We refined all the resilience requirements
hazard analysis in safety engineering. To identify the
                                                             through several iterations after the workshop and vali-
causes of a hazard, we did not have to comply with fault
                                                             dated them against the workshop participants.
tree formalism rigorously. The informal way of construct-
                                                                 Construct validity For us, the main threat in this
ing a fault tree was easy to understand for stakeholders.
                                                             category is mono-method bias, which means we did not
    Lesson 4: The ATAM workshop requires consid-
                                                             use other elicitation methods. Therefore, there is a threat
erable refinement that can be done “offline”. The
                                                             that elicited resilience requirements are biased. We can
outcome of the well prepared one-day workshop needed
                                                             not entirely rule out this threat as we did not apply other
further refinement. In particular, it was necessary to re-
                                                             methods and cross-check the results.
fine the stimulus and response measures parts of each
                                                                 External validity The heterogeneity poses a threat,
scenario, e.g., we modeled the workload and tried to ex-
                                                             i.e., different roles and expertise of participants. Work-
press the scenarios in temporal logic. This revealed that
                                                             shops with less heterogeneity in the stakeholders could
the initial requirements were partially ambiguous and im-
                                                             lead to no resilience requirements. We can not entirely
precise, which was easy to resolve through clarification
                                                             rule out this threat.
requests to the stakeholders. Therefore, we hypothesize
that formalization benefits both validation and quanti-
tative evaluation of resilience requirements and that an 7.2.2. Experiment design
explicit (offline) formalization step could complement the We used the mock system for quantitative evaluation
proposed workshop well.                                      of resilience requirements that are based on the actual
    Lesson 5: A tightly planned one-day workshop is system. There is a threat that evaluation results are in-
sufficient. We managed to collect resilience scenarios in accurate. However, the purpose of the experiments is to
a one-day workshop because it was well prepared (know- exemplary show how elicited requirements and derived
ing the architecture description) and well-conducted (strict experiments can help to improve the system — we do not
time management). Refinement can be done offline by claim the accuracy of the quantitative results. Further-
more skilled engineers in formalizing stimuli and re- more, due to legal issues, we used CF Dev [19]. We faced
sponse measures (similar to writing SLOs). However, instability, e.g., resource drainage of Dev nodes, in the
it is important to ask for feedback to check the validity environment during experimentation. There is a threat
of the requirements.                                         of a negative impact on results due to this instability. To
    Lesson 6: The resilience elicitation helps to re- counteract this threat, we re-executed experiments to
fine “classical” QoS requirements. All response mea- gain insight into approximate measurements, ensuring
sures are based on non-resilience specifications that make reliable data with no unintended node or service crash.
them imprecise. For example, maximum degradation and
time to recovery was not specified. Thus, it is unclear
whether experimentation shows acceptable or unaccept- 8. Conclusion
able degradation in performance or availability quality.
                                                             The successful development of resilience scenarios de-
                                                             pends on the outcome of the hazard analysis. Our ap-
7.2. Threats to Validity                                     proach to scenario-based resilience evaluation assumes
We discuss the threats to validity for the workshop and a business domain expert to derive an initial list of haz-
our experiment design.                                       ards. FTA can then be a means to analyze the hazards
                                                             and derive resilience scenarios. We plan to (1) extend our
                                                             process with an explicit formalization step after the work-
                                                             shop for refinement of the scenarios, (2) formally verify
response measures of resilience scenarios, and (3) create    [10] S. Frank et al., Supplementary material, 2020.
processes for continuous hazard analysis when a system            Artifacts: https://doi.org/10.5281/zenodo.5142006
faces changes, e.g., updates and refinement/development           (Scenarios); https://doi.org/10.24433/CO.0520280.v1
of resilience scenarios.                                          (Code Ocean capsule).
                                                             [11] K. Pohl, Requirements Engineering - Fundamentals,
                                                                  Principles, and Techniques, Springer, 2010.
Acknowledgments                                              [12] L. Bass, P. Clements, R. Kazman, Software Architec-
                                                                  ture in Practice, 2 ed., Addison-Wesley Longman
This work has been supported by the Baden-Württemberg
                                                                  Publishing Co., Inc., USA, 2003.
Stiftung (ORCAS — Efficient Resilience Benchmarking
                                                             [13] J. Cámara, R. de Lemos, Evaluation of resilience
of Microservice Architectures) and the German Federal
                                                                  in self-adaptive systems using probabilistic model-
Ministry of Education and Research (Software Campus
                                                                  checking, in: Proc. 7th Int. Symposium on Software
2.0 — Microproject: DiSpel).
                                                                  Engineering for Adaptive and Self-Managing Sys-
                                                                  tems (SEAMS), 2012, pp. 53–62.
Data Availability                                            [14] J. Cámara, R. de Lemos, M. Vieira, R. Almeida,
Our artifacts [10] comprise (i) the resilience scenarios          R. Ventura, Architecture-based resilience evalua-
and (ii) the data and R scripts as a CodeOcean capsule.           tion for self-adaptive systems, Computing 95 (2013)
We are working on making parts of the created/modified            689–722.
experiment tools available as open-source software. For      [15] J. Cámara, R. de Lemos, N. Laranjeiro, R. Ventura,
confidentiality reasons, the system under test cannot be          M. Vieira, Robustness-driven resilience evalua-
published.                                                        tion of self-adaptive software systems, IEEE Trans-
                                                                  actions on Dependable and Secure Computing 14
                                                                  (2017) 50–64.
References                                                   [16] R. Natella, D. Cotroneo, H. Madeira, Assessing de-
                                                                  pendability with software fault injection: A survey,
 [1] S. Newman, Building Microservices, O’Reilly, 2015.           ACM Computing Surveys (CSUR) 48 (2016) 44:1–
 [2] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. K.              44:55.
     Reiter, V. Sekar, Gremlin: Systematic resilience        [17] K. Yin, Q. Du, W. Wang, J. Qiu, J. Xu, On
     testing of microservices, in: Proc. 36th IEEE Int.           representing and eliciting resilience requirements
     Conf. on Distributed Computing Systems (ICDCS),              of microservice architecture systems,          CoRR
     2016, pp. 57–66.                                             abs/1909.13096 (2020). URL: https://arxiv.org/abs/
 [3] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein,             1909.13096v3. arXiv:1909.13096.
     L. Kosewski, J. Reynolds, C. Rosenthal, Chaos engi-     [18] Netflix Inc., Eureka, 2020. URL: https://github.com/
     neering, IEEE Softw. 33 (2016) 35–41.                        Netflix/eureka.
 [4] Chaos toolkit, 2020. URL: https://github.com/           [19] Cloud Foundry Foundation, Cloud foundry dev
     chaostoolkit.                                                documentation, 2020. URL: https://github.com/
 [5] R. Miles, Learning Chaos Engineering – Discover-             cloudfoundry-incubator/cfdev.
     ing and Overcoming System Weaknesses through            [20] Chaos Toolkit, Chaos toolkit documentation, 2020.
     Experimentation, O’Reilly Media, Inc., 2019.                 URL: https://chaostoolkit.org.
 [6] N. G. Leveson, Safeware — System Safety and Com-        [21] J. von Kistowski, S. Eismann, N. Schmitt, A. Bauer,
     puters: A Guide to Preventing Accidents and Losses           J. Grohmann, S. Kounev, Teastore: A micro-service
     Caused by Technology, Addison-Wesley, 1995.                  reference application for benchmarking, modeling
 [7] D. Kesim, A. van Hoorn, S. Frank, M. Häussler, Iden-         and resource management research, in: Proc. IEEE
     tifying and prioritizing chaos experiments by using          26th Int. Symp. on Modeling, Analysis, and Simula-
     established risk analysis techniques, in: Proc. 31st         tion of Computer and Telecommunication Systems
     Int. Symposium on Software Reliability Engineer-             (MASCOTS), 2018, pp. 223–236.
     ing (ISSRE), 2020.                                      [22] InfluxData Inc., InfluxDB website, 2020. URL: https:
 [8] R. Kazman, M. Klein, M. Barbacci, T. Longstaff,              //www.influxdata.com/.
     H. Lipson, J. Carriere, The architecture tradeoff
     analysis method, in: Proc. 4th IEEE Int. Conf. on En-
     gineering of Complex Computer Systems (ICECCS),
     1998, pp. 68–78.
 [9] M. T. Nygard, Release It!: Design and Deploy
     Production-ready Software, Pragmatic Bookshelf,
     2018.