=Paper=
{{Paper
|id=Vol-2978/faacs-paper3
|storemode=property
|title=Scenario-based Resilience Evaluation and Improvement of Microservice Architectures: An Experience Report
|pdfUrl=https://ceur-ws.org/Vol-2978/faacs-paper3.pdf
|volume=Vol-2978
|authors=Sebastian Frank,Alireza Hakamian,Lion Wagner,Dominik Kesim,Jóakim von Kistowski,André van Hoorn
|dblpUrl=https://dblp.org/rec/conf/ecsa/FrankHWKKH21
}}
==Scenario-based Resilience Evaluation and Improvement of Microservice Architectures: An Experience Report==
Scenario-based Resilience Evaluation and Improvement of Microservice Architectures: An Experience Report Sebastian Frank1 , Alireza Hakamian1 , Lion Wagner1 , Dominik Kesim1 , Jóakim von Kistowski2 and André van Hoorn1 1 Institute of Software Engineering, University of Stuttgart. Stuttgart, Germany 2 DATEV eG, Nürnberg, Germany Abstract Context. Microservice-based architectures are expected to be resilient. However, various systems still suffer severe quality degradation from changes, e.g., service failures or workload variations. Problem. In practice, the elicitation of resilience requirements and the quantitative evaluation of whether the system meets these requirements is not systematic or not even conducted. Objective. We explore (1) the scenario-based Architecture Trade-Off Analysis Method (ATAM) for resilience requirement elicitation and (2) resilience testing through chaos experiments for architecture assessment and improvement. Method. In an industrial case study, we design a structured ATAM-based workshop, including the system’s stakeholders, to elicit resilience requirements. We specify these requirements into the ATAM scenario template. We transform those scenarios into resilience experiments to quantitatively evaluate and improve system resilience. Result. We identified 12 resilience scenarios. We use and extend ChaosToolkit to automate and execute two scenarios. We quantitatively evaluate resilience requirements and suggest resilience improvements in the scope of both scenarios. We share lessons learned from the case study. In particular, our work provides evidence that an ATAM-based workshop is intuitive to stakeholders in an industrial setting. Conclusion. Our approach helps requirement and quality engineering teams in the process of resilience requirements elicitation. 1. Introduction ture Trade-Off Analysis Method (ATAM) [8] for (1) the system’s resilience requirement elicitation and (2) re- An intrinsic quality property of the microservices archi- silience testing through resilience experiments (aka chaos tectural style is resilience, i.e., the system meets perfor- experiments) for architecture assessment and improve- mance and other Quality of Service (QoS) requirements ment. We hypothesize that ATAM has already been used despite different failure modes or workload variations [1]. in practice to elicit and specify quality requirements other However, real-world postmortems [2] show that systems than resilience, e.g., availability, performance, and main- suffer either unacceptable QoS degradation, or recovery tainability, and can be adopted for effectively eliciting re- time. It is necessary to assure system resilience in the silience requirements and evaluating them through chaos context of microservice-based architectures. experiments. Therefore, our research question is: How Practitioners use Chaos Engineering [3], including to leverage ATAM to elicit resilience requirements, which tools such as CTK [4, 5] or Chaos Monkey1 , for resilience can be utilized to evaluate resilience through resilience ex- testing. They need to (1) think about hazards [6] as causes periments and suggest architectural improvements quan- of QoS degradation, (2) set up chaos experiments by spec- titatively? ifying failure mode types and hypotheses of expected We use an ATAM-based workshop to elicit and specify quality behavior, and (3) execute each experiment to de- resilience requirements by involving system stakehold- tect deviations from the hypotheses. First, this approach ers. ATAM allows us to describe resilience requirements lacks the systematic identification of causes of a hazard as scenarios in semi-structured textual language. The through hazard analysis methods. We contributed to scenario template consists of the following elements: this problem in our previous work [7], which serves as a (1) source, (2) stimuli, (3) artifact(s), (4) system’s envi- foundation for this paper. Particularly, we now integrate ronment, (5) its response, and (6) response measure. hazard analysis into a more systematic elicitation process The designed structured workshop aims to identify and use a more formal description of requirements (sce- hazards and architectural design decisions. During the narios). Second, the approach lacks a systematic process workshop, we employ a hazard analysis based on the of eliciting and refining resilience requirements. Fault Tree Analysis (FTA) [6]. The result is a set of 12 In the context of an industrial case study, our objective resilience scenarios, which we turn into experiments. is to explore the application and adoption of the Architec- Next, we use CTK to automate these experiments, and conduct a measurement-based resilience evaluation. Fur- ECSA 2021 Companion Volume, Växjö, Sweden, 13-17 September 2021 thermore, we improve system resilience by applying a © 2021 Copyright for this paper by its authors. Use permitted under Creative CEUR Workshop http://ceur-ws.org Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) resilience pattern [9, 1], namely retry. We validate the improvement by re-executing the respective scenario. ISSN 1613-0073 Proceedings 1 https://github.com/Netflix/chaosmonkey To summarize, the paper makes the following contri- scenarios are two activities in requirements engineer- butions: ing [11] that benefit the elicitation process. According to Pohl [11], scenario development benefits elicitation • Leveraging ATAM and FTA in an industrial system by making goals understandable for stakeholders and to elicit resilience requirements, then evaluating the may refine or identify new goals. Our work uses sce- requirements, and improving system’s resilience. nario development without goal-oriented modeling as all • Automating scenario execution using CTK for measure- stakeholders know the system’s high-level quality goal. ment-based evaluation. To our knowledge, this is the first work using ATAM for eliciting and specifying resilience requirements before • We share lessons learned that benefit both practition- evaluating the resilience through experiments. ers and researchers regarding resilience requirement elicitation, evaluation, and improvement. 3. Research Methodology • Artifacts — including scenarios, resilience experiments, and results of the experimentation — are available on- Section 3.1 explains the domain context and describes line [10]. the high-level architecture of the case study system. Sec- tion 3.2 summarizes our research methodology. 2. Related Work 3.1. Domain Context A workshop is an effective technique for requirement The case study system’s purpose is to calculate payments. elicitation [11]. In our case, the workshop’s prepara- An accounting department’s wage clerks use the pay- tion and conduction are based on the scenario template ment accounting system to calculate each registered em- of Bass et al. [12]. Our difference to existing works on ployee’s income taxes. The payment accounting system measurement-based resilience evaluation is that we have has to gather data from health insurance providers and an explicit step on eliciting resilience requirements. In send its results to the corresponding tax office to execute the next paragraphs, we elaborate on this in more detail. the calculations. This process presumes that a company Cámara et al. [13, 14, 15] propose an approach for that wants to use the payment accounting system pro- resilience analysis of self-adaptive systems. The core vides its employee and tax information to the health idea consists of three parts: (1) specification of resilience insurance provider and tax offices. properties using Probabilistic Computation Tree Logic, All of the payment accounting system’s tasks are cur- (2) modeling causes of a hazard, e.g., high-load using ex- rently taken care of by a monolithic legacy system. In perimentation and collecting traces of system behavior, peak times, up to 13 million calculation requests have to and (3) verification of resilience properties using model be handled in a day or single night. Under normal circum- checking. In contrast to model checking-based verifica- stances, this number is significantly lower. In order to tion, we evaluate a resilience scenario’s response measure handle such varying loads more efficiently, stakeholders by analyzing collected measurements. Furthermore, Cá- desired a better scaling system. Therefore, the old system mara et al. do not focus on the elicitation of resilience will be replaced by a more scalable microservice-based requirements using requirement engineering methods. Spring application in the coming years. The investigated Chaos Engineering [5, 3] is a technique for evaluating part of the system under study, which is still under de- system resilience through injecting failures [16]. There velopment, consists of seven services. It is deployed to are works on both (1) using engineering methods to iden- a Platform as a Service (PaaS), i.e., Cloud Foundry (CF). tify failure modes [7], i.e., causes of a hazard systemati- Together with the industrial partner, we decided on a cally before failure injection, and (2) ad-hoc failure injec- scenario-based approach, as our industrial partner al- tion with no systematic failure mode identification [2]. ready employed ATAM for other quality attributes. However, they do not explicitly specify resilience require- ments and lack a methodical way for requirement elicita- tion. Our work is a step toward closing this gap. 3.2. Research Procedure In the context of resilience requirement elicitation, Yin To answer our research question How to leverage ATAM to et al. [17] propose a goal-oriented technique for represent- elicit resilience requirements, which can be utilized to eval- ing resilience requirements. The high-level idea is to rep- uate resilience through resilience experiments and suggest resent a resilience goal — e.g., all requests are processed architectural improvements quantitatively?, we conduct correctly — and identify possible causes of hazards that the following steps: act as obstacles for achieving a resilience goal — e.g., node failure. However, they do not discuss how to identify haz- 1. We gather relevant system stakeholders, i.e., product ards and their causes. Goal orientation and developing owners, software architects, and quality engineers, into an ATAM-based workshop. The objective is to For this reason, we refer to the result of this session identify resilience scenarios that lead to QoS degrada- as a fault graph. Note that the (directed acyclic) fault tion and downtime. graph can be transformed into an equivalent fault tree by creating duplicate sub-trees for nodes having more 2. We derive resilience experiments from the scenarios. than one parent. The experiment description comprises the stimuli, ar- Session 3: Resilience Scenarios for collecting and priori- tifact, and response according to the scenario. tizing resilience scenarios based on the previously identi- 3. We use CTK to automate the execution of the re- fied hazards. We provided a scenario template based on silience experiments. We assess the response measure the layout used in ATAM. Then, the stakeholders jointly by analyzing the QoS metrics measurements. created scenarios by informally analyzing the fault graph in a sequence driven by the associated severity (in de- 4. After executing resilience experiments, we apply suit- scending order) of the hazards. able resilience patterns. We re-execute the resilience Session 4: Retrospective to collect feedback about the experiments to assess the pattern’s effect by compar- workshop from the participants and to inform them about ing QoS-related behavior with and without the re- the next steps, which comprise (1) refinement resilience silience pattern. requirements, and (2) execution of resilience experiments. 4. Elicitation and specification of 4.2. Workshop Results scenarios Elicited Architecture Description: Figure 1 shows the com- ponent diagram of the system as specified in the first This section elaborates on the planning, execution, and session of the workshop. It describes a snapshot of the results of the workshop. system as used in the workshop and the subsequent activ- ities. It represents a typical microservice-based architec- ture. As such, the system is deployed to a CF and contains 4.1. Elicitation through Structured several services. Each service has its own PostgreSQL Workshop database. The only exception is the Calculations service, Before the workshop, we received documentation regard- which employs a Mongo database. The API-Gateway ser- ing the architecture of the system. This allowed us to vice handles all incoming connections and routes all com- specify an architecture model of the case study system, munications. A Eureka service is employed to provide including a component diagram and an explanation of service discovery for all internal components. The Fron- the implemented components. Using ATAM, we required tend service is the only external component that a user to know key architectural design decisions. Therefore, can directly access. The Calculations service is the cen- knowing the architecture description in advance allowed tral hub of the system since the calculation of payments us to focus more on the hazard analysis and developing is the system’s main feature. Once this service receives a resilience scenarios. calculation request from the gateway, it collects all nec- The full-day workshop consisted of four sessions lever- essary data asynchronously from the other services. The aging different methods, as described next. The modera- Companies service is used to handle data the Frontend tors explained each technique and method at the begin- displays, but is not relevant for the calculation. ning of each session. The participants were stakeholders Hazard Analysis: Figure 2 shows the fault graph cre- of the system and comprised two software architects, one ated in the second workshop session. The stakeholders product owner, and one quality assurance engineer. agreed on unavailability or long response of settlement Session 1: Introduction and Architecture Description for calculations as the main system hazard. Therefore, user’s achieving a common understanding of the workshop settlement can not be calculated is the top event in the process and the system’s architecture. (1) We resolved fault graph. We analyzed possible causes from the top misunderstandings regarding the elicited architecture event until we reached basic events that we could not description through asking questions, and (2) refined the further decompose. We connected different causes by prepared architectural models. logical operators, i.e., AND and OR. For example, users Session 2: Hazard Analysis to identify potential causes can not calculate their settlement if it is not processed for degradation in QoS. Index cards were used as a means in time. This can occur when the assigned instance stalls to collect hazards. Afterward, the participants arranged OR responds to slow. We argue that the latter can be the hazards and their causes in a fault-tree-like fashion. experienced if the system receives a sudden (work)load To not break the participants’ creative flow, we relaxed peak AND its (auto) scaling does not work correctly. The the strict construction rules of fault trees, e.g., we allowed hazards at the leaf nodes are potential candidates for events having multiple parents, which resulted in a graph. fault/failure injection during resilience experiments and Eureka Discovery Zone «Service» «Service» «Service» «Service» Working- Payments Calculations Employees «Service» API-Gateway Hours «Service» Frontend «Service» «Service» «Service» «Service» Social- The API Gateway Eureka Taxes Companies can communicate Insureances with each Service directly Figure 1: Component Diagram of Payment Accounting System User's Intermediate Settlement cannot be Event stakeholders chose the responses and response measures Calculated Undeveloped Event based on their SLOs. AND Gate The stakeholders elaborated 12 resilience scenarios, (Large) Clients cannot be Processed in Data cannot be Captured Calculation cannot Incorrect Calculation OR Gate summarized in Table 1. Scenarios 01 to 04 are different be Executed Basic Event Time variations of an unexpected load peak, including linear and exponentially increasing loads. Scenario 05 and 06 de- Instance has too slow Instance Service does not Service answers Calculation with techical Caclulation Technically scribe the failure of a single service instance. Scenario 07 with with Wrong Incorrect Response Time stalled answer error Inconsitent Version Retry and 08 are about middleware failures. Scenario 09 and 10 Gateway Data Service crashes revolve around gateway failures. Lastly, Scenario 11 and Database is full Data Loss Data is not 12 describe the failure of multiple instances. Actors such Bad (auto) external Service Middleware crashes Instance correctly replicated as end-users, elements of the CF platform, different bugs, Load dies while Scaling Peak does not Answer doing work between Datacenters and technical issues caused by the middleware or deploy- Eureka ment artifacts and issues intrinsic to individual services crashes other Middleware Instance other Types of the system comprise the established sources. In total, crashes gets of CF-related mirgrated Errors Outage Data cannot all scenarios can affect all services. The environments be recoverd after outage cover different states of the system according to the iden- tified system domain context, e.g., payslip calculation Figure 2: Cleaned Fault Graph periods or simply services being non-idle independent of the different calculations. The response and response can be initiated by tools such as CTK. The stakeholders measures were specified by the stakeholders based on selected and prioritized the set of resilience experiments. their internal SLOs. Resilience Scenarios: We gave the participants an empty Retrospection: The brief retrospective at the end of the table according to the ATAM scenario template with the workshop showed that the participants were satisfied columns (1) source, (2) stimulus, (3) artifacts, (4) environ- with the agenda, content, and outcomes. However, com- ment, (5) response, and (6) response measure. Further, ments were made concerning time management. we explained the meaning of each table column to the participants. By using index cards again, the participants steadily added content to the table. We began by iden- 5. Resilience Evaluation tifying possible sources. The stimuli and artifacts were then derived from the previously created fault graph. The This section aims to evaluate the case study system’s environment represents different time periods when the resilience. Therefore, we implemented a subset of the identified stimuli occur. The responses are the stakehold- previously elicited resilience scenarios into resilience ex- ers’ assumptions about how the system should respond periments using CTK. We compare the system’s behavior to the particular stimulus. The response measures are against the expected behavior described in the scenarios’ based on their internal Service Level Objectives (SLOs). response part. For example, a workload peak resulting in a system fail- ure was transposed into multiple scenarios. Users of the 5.1. Experiment Setup system are the source of the scenario since they cause the load peak. The respective stimulus is the workload 5.1.1. Examined Software System peak itself. A service was chosen as the artifact to repre- Due to legal constraints and to maintain anonymity, our sent that a load peak can influence all service instances. industrial partner provided us with a mocked version As the environment, the payslip calculation period was as a proxy for the real payroll accounting system. This chosen to imply an existing base workload. At last, the version, shown in Figure 3, is used throughout this paper Mock Payroll Accounting System «Service» «Service» «Service» API-Gateway payslip payslip2 Wage calculation ≤ 1 s, in 99 % of the cases, payslip Notification arrives within 1 s in 99 % of the cases «Service» Downtime of gateway instance is below 1 min Jollyday API Eureka Wage calculation response time is below 2 s Response Measure Figure 3: Mocked Payroll Accounting System Developer gets notified within 5 min calculation ≤ 20 s (300 Employees) Restarts and SLOs are satisfied as the system under test. It implements a similar business Downtime is below 1 min logic but with less computational overhead. The system uses typical patterns of the microservice architectural style, i.e., API-Gateway-service as a central gateway that manages all incoming requests and Eureka [18] to provide service discovery. The payslip-service utilizes an H2 in- memory database and the third-party API Jollyday. It can forward requests to the payslip-service2. Requests can All requests are handled correctly and User unaware, calc. correct & in time, Error message and services R restarts also be sent directly to payslip-service2 using a different Process is aborted but can be picked endpoint. Calculation correct and in time Frontend shows error, gateway The following six endpoints are used during the exper- Response Developer gets notified iments: caller will be notified ≥1 instance running Table 1: Scenarios created during the workshop Abort calculation, INTERNAL_DEP. — Calls the payslip-service2 via payslip-service. restarts in time DB_READ — Reads an entry from the database of the up payslip-service. Not during payslip calculation During wage calculation, one During wage calculation, no EXTERNAL_DEP. — Calls the third-party API Jolly- Async. payslip calculation Payslip calculation period During wage calculation During wage calculation Environment days via payslip-service. instance not available instance available DB_WRITE — Writes an entry into the database of the payslip-service. Not Idle period GATEWAY_PING — Checks whether the API-Gateway- service responds. Artifact Front- and Receiving Service S, Service R UNAFF._SERVICE — Sends a request directly to Backend Backend Instance Sending Service payslip-service2. The actual payment accounting system is deployed to Linear increasing load peak (cold Linear increasing load peak (cold Exponentially increasing load Exponentially increasing load a paid CF. Due to financial constraints and legal issues, Middleware terminates but the mock system is deployed to a local CF environment Middleware terminates Stimulus Gateway terminates [19], which has similar properties as a paid CF. As CF is a Instance terminates Service terminates peak (cold start) peak (cold start) constraint given by the stakeholders, we did not consider other cloud providers. recovers start) start) 5.1.2. Experiment Tools Receiving Service Technical Issue Cloud Foundry Figure 4 shows our experiment framework comprising Deployment Middleware Operator Source User four tools, i.e., CTK, load generator, hypothesis valida- Bug tor, and dashboard. During an experiment, these tools interact with the system to monitor the experiments and provide detailed insights, e.g., response times of calls to Short Name individual endpoints. FroBac/NoIdle FroBac/NoIdle Failure(SerE)/ Failure(SerE)/ Failure(MW)/ Failure(MW)/ Peak(LinCo)/ Peak(LinCo)/ Failure(GW)/ Peak(ExCo)/ Peak(ExCo)/ Bug/Ins/Ber Failure(CF)/ To execute the experiments, we used CTK [20], which Depl(GW)/ Ser/NoAbr Ser/NoAbr Bac/Abr Bac/Ber Ser/Abr Ser/Abr Ser/Ber Ins/Ber Ins/Ber can execute and monitor chaos tests and has drivers for various PaaS solutions. We leveraged the CF driver to terminate a service instance at a specific point in time ID 01 02 03 04 05 06 07 08 09 10 11 12 execute experiment Target Service payslip-service Experiment Type Terminate payslip-service application instance System Under Hypothesis ChaosToolkit Hypothesis Response measure of Test Validation retrieve results Scenario 05 holds generate query Blast Radius payslip-service load results Table 2 Load- InfluxDB Resilience experiment design for Scenario 05’ generator write results Load Profile some noise. The requests are evenly distributed over all Figure 4: Used structure of the experiment framework six endpoints. To assess whether the system still responds correctly and in time, we measure response times of the requests and compute their success rate. and validate the steady-state hypothesis. The load that the system receives is controlled by an adapted version of the load generator from the TeaStore microservices 5.3. Experiment Results benchmark [21] that monitors response times, number Figure 5 shows the steady-state, injection, and recovery of successful, dropped, and failed requests. The collected phases of the experiment for endpoints INTERNAL_DEP., data is written into an InfluxDB [22] for a time series GATEWAY_PING, and UNAFF._SERVICE. In the steady- based evaluation. During the evaluation, a Spring ser- state phase, we assume that the system is working as vice collects the necessary data from the InfluxDB and expected, i.e., the response times satisfy the SLOs. In the calculates whether a hypothesis holds. We also created a injection phase, CTK terminates the payslip-service in- dashboard application that provides convenient features, stance. In the recovery phase, we assume that the system like synchronized starting of CTK and the load gener- recovers and returns to a steady state, i.e., the response ator, live monitoring, and automated CTK setup. Since times satisfy the SLOs. We omitted the load generator’s the dashboard does not add functionalities in executing warmup and cooldown phase due to readability and anal- experiments, it is not part of Figure 4. ysis purposes, which refers to the overall first and last 300 s. Further, a 30 s binning was applied, and extreme 5.2. Experiment Execution outliers (>100 ms) are not shown. The success rates at the endpoints INTERNAL_DEP. Based on Scenarios 04 and 05, we implemented three (Figure 5a), DB_READ, EXTERNAL_DEP., and DB_WRITE resilience experiments. The first experiment investigates drop to 0 % as the payslip-service is terminated after a load peak with an exponential increase (Scenario 04), 600 s and rises back to 100 % as it recovers in about while the remaining two investigate instance termina- 1.5 min. During this downtime, no response times are tion due to an internal CF error for random instances recorded since no requests arrive at the payslip-service. (Scenario 05) and specifically the payslip-service (Sce- During the steady-state and recovery phase, the response nario 05’). The selection of experiments is based on the times are stable at around 20 ms and 15 ms, respectively. industrial partner’s preferences. In all experiments, the During the injection phase, there is a slight increase as effect on all endpoints is examined. In the following, we the payslip-service has restarted. The results for GATE- will only discuss the results of a subset of endpoints for WAY_PING (Figure 5c) and UNAFF._SERVICE (Figure 5e) Scenario 05’. The residual results can be found in the show a similar structure. However, the load generator did supplementary material [10]. not record any successful or failed requests during the The design of the experiment related to Scenario 05’ is downtime. Therefore, no success rate could be calculated. given in Table 2. The target service of this experiment is the payslip-service, which holds the core business logic of the mock system. We use CTK to terminate running CF 5.4. Discussion of Results application instances to simulate the scenario’s stimulus. As visible in Figure 5 (left side), the response time and The stimulus refers to an error that occurs in CF, which success rate values are almost identical in the steady leads to a loss of an application instance. We assume that state phase and the recovery phase. Furthermore, the the blast radius only affects the payslip-service and that increase in the success rate indicates that the payslip- CF registers the loss of the payslip-service instance and service becomes available after 30 s to 60 s. Thus, the CF starts a new instance. Our hypothesis is that the response platform can re-instantiate the payslip-service quickly, measure of Scenario 05 still holds. leading to a quick recovery of the system. During the experiments, the system is exposed to an Response times are slightly higher while the payslip- almost constant, synthetic load. We generated a load service is re-instantiated, which was expected as normal profile with a target load of 20 requests per second and cold-start behavior. Endpoints GATEWAY_PING and S04 Direct Run 1 S04 Direct Run 2 ID 0 Response Times and Success Rate of ID 0 Response Times and Success Rate of HTTP Requests HTTP Requests 100 100 100 100 ● ● ● ● Success Rate ● Success Rate Response Times (ms) Response Times (ms) ● Success Rate (%) Success Rate (%) 75 75 ● 75 75 ● ● ● ● Response times ● Response times ● ● ● ● ● 50 50 ● ● ● ● ● ● 50 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Experiment ● ● ● ● Phases ● ● ● ● ● ● ● ● ● ● ● Experiment Phases ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Steady state Steady state 25 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 25 ● ● Injection Injection Recovery Recovery 0 0 0 0 0 0 0 00 0 0 0 00 30 60 80 30 60 80 12 12 S04 Direct Experiment Run 1(s) Duration S04 Direct Experiment Run 2(s) Duration ID 4 Response Times and Success Rate of ID 4 Response Times and Success Rate of (a) INTERNAL_DEP.HTTP (without retry) Requests (b) INTERNAL_DEP. HTTP (with retry) Requests 100 100 100 100 ● ● ● Success Rate ● Success Rate Response Times (ms) Response Times (ms) ● ● Success Rate (%) Success Rate (%) 75 75 ● 75 75 ● ● Response times Response times ● 50 50 50 50 ● ● ● Experiment ● Phases Experiment Phases ● ● ● ● ● Steady state Steady state 25 25 ● ● ● ● ● ● ● 25 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Injection ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Injection ● ● ● Recovery Recovery 0 0 0 0 0 0 0 0 0 0 0 0 30 60 80 0 30 60 80 0 12 12 S04 Direct Experiment Run 1(s) Duration S04 Direct Experiment Run 2(s) Duration ID 5 Response Times and Success Rate of ID 5 Response Times and Success Rate of (c) GATEWAY_PINGHTTP (without retry) Requests (d) GATEWAY_PING HTTP (with retry) Requests 100 100 100 100 Success Rate Success Rate Response Times (ms) Response Times (ms) Success Rate (%) Success Rate (%) 75 75 75 75 ● Response times Response times 50 50 ● ● ● 50 50 ● ● ● Experiment Phases ● Experiment Phases ● ● ● ● ● Steady state Steady state 25 25 ● ● ● ● ● ● ● ● 25 25 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Injection ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Injection ● ● ● ● ● ● ● ● Recovery Recovery 0 0 0 0 0 0 0 0 0 0 0 0 30 60 80 0 30 60 80 0 12 12 Experiment Duration (s) Experiment Duration (s) (e) UNAFF._SERVICE (without retry) (f) UNAFF._SERVICE (with retry) Figure 5: Comparison of experiment results at different endpoints with and without the implemented retry pattern UNAFF._SERVICE should remain unaffected during the quest responses at the endpoints GATEWAY_PING and injection because the payslip-service is not required to UNAFF._SERVICE during injection, which indicates no answer the requests. Nevertheless, response times at end- requests exist in the system. Another possibility is that point GATEWAY_PING are affected, which indicates a requests have been dropped. Looking at the raw data propagation of the failure effects from the payslip-service tables disproves this argument as there are no dropped to the API-Gateway-service. requests. Another explanation is that no requests arrived After the injection started, the success rate drops to at the system, which leads to a lack of data in the time 0 % at the endpoints INTERNAL_DEP., DB_READ, EX- frame between approximately 600 s and 660 s. TERNAL_DEP., and DB_WRITE. The CTK terminates We hypothesized that the response measure of Sce- the single payslip-service instance. The load generator nario 05 holds, i.e., requests are answered in time (99 % in flags all requests as failed, leading to a success rate of less than 1 s) and correctly. As the response times are far 0 %. The plots show neither successful nor failing re- below 1 s, our hypothesis regarding the response times Steady State Injection Recovery w/o Pattern w Pattern w/o Pattern w Pattern w/o Pattern w Pattern Endpoint 𝑝5 𝑥 ˜ 𝑥 𝑝99 𝑝5 𝑥 ˜ 𝑥 𝑝99 𝑝5 𝑥 ˜ 𝑥 𝑝99 𝑝5 𝑥 ˜ 𝑥 𝑝99 𝑝5 𝑥 ˜ 𝑥 𝑝99 𝑝5 𝑥 ˜ 𝑥 𝑝99 INTERNAL_DEP. 19 22 22.5 33 19 22 24.0 51 19 21 22.4 32 19 22 24.6 90 19 21 22.1 31 19 22 23.0 34 DB_READ 11 12 13.3 21 11 12 13.0 24 11 12 13.1 20 11 12 13.2 30 11 12 12.9 20 11 12 12.8 19 EXTERNAL_DEP. 11 12 12.6 21 10 12 13.3 23 11 12 12.5 21 10 12 14.1 31 11 12 12.3 19 10 12 12.2 19 DB_WRITE 11 13 13.4 21 11 12 13.1 22 11 12 13.1 20 11 12 13.3 27 11 12 13.0 20 11 12 12.7 19 GATEWAY_PING 11 12 13.3 21 11 12 13.3 24 11 12 13.1 20 11 12 13.6 32 11 12 12.9 19 11 12 12.9 21 UNAFF._SERVICE 10 11 11.9 19 10 11 11.6 19 10 11 11.7 18 10 11 11.6 19 10 11 11.8 18 10 11 11.5 19 Table 3 Statistical summaries of the three experiment phases. 𝑝𝛼 : 𝛼-th percentile; 𝑥 ˜ : median; and 𝑥: mean. Values are given in ms. is technically fulfilled. However, several requests are not in Section 5, each plot is divided into the steady state answered at all, which is indicated by the dropped success phase, injection phase, and recovery phase. Table 3 shows rate. We consider these as incorrect response. Therefore, the associated statistical values. we assume that the hypothesis regarding correctness is In general, similar behavior can be observed at all the not fulfilled. endpoints. Comparing the plots at left and right of the Figure 5, shows that the mean response times in the steady state phase do not vary significantly when the 6. Resilience Improvement retry pattern is activated. Although, at the beginning of the injection phase, far more high response times can be The previous section’s experiments showed that the sys- observed. In addition, the boxplots show a slightly higher tem does not respond as described in Scenario 05 to a interquartile range in the plot where the retry pattern is failure of an instance of the payslip-service. While the integrated. response times are technically below 1 s in 99 % of all The plots also show that the success rate does not cases, requests are temporarily not answered at all, and drop to zero anymore when the pattern is active. For thus, not correctly. Therefore, we aim to improve the the endpoints INTERNAL_DEP., the success rate drops system’s success rate concerning Scenario 05 by applying to approximately 70 %. For the two endpoints GATE- resilience pattern(s). We then determine the efficacy of WAY_PING and UNAFF._SERVICE, requests are arriving improvements to the system’s resilience by re-executing and the success rate remains at 100 %. the experiments. The application of the retry pattern can explain the re- sponse time spikes during the injection (see the Figure 5). 6.1. Architectural Modifications Requests sent shortly before the restart of the payslip- The system under test was fortified with a retry pat- service fail, but are retried by the API-Gateway-service tern [9], i.e., the API-Gateway-service sends another re- until the payslip-service recovered after approximately quest to the payslip-service if a request fails or remains 10 s. However, as several retries have been aggregated, unanswered. The retry pattern seems to be a reasonable the payslip-service will have to handle a high amount of choice since response times are far below the threshold requests upon recovery, resulting in a visible spike in of 1 s, as indicated by the previous experiment. Due to its response times. specific purpose, the system has to accept requests near The endpoints UNAFF._SERVICE and GATEWAY_PING real-time and always answer correctly. Thus, resilience do not depend on the payslip-service. This explains the patterns that rely on backup or restricting behavior, like high success rate at these endpoints. circuit breakers or flow limiters, are unsuited. To avoid In contrast to the experiment without the retry pat- bad retry behavior, we configured the Spring-Retry as tern, the success rate does not drop entirely. Therefore, follows. We set the maximum number of retries of each the retry pattern improves the scenario satisfaction as payslip-service request to be 4, the initial delay to 10 ms, it increased the percentage of correct responses while the factor for the exponential increase to 3, and the max- keeping the response times below 1 s. imum delay to 150 ms — resulting in retries after 10 ms, 30 ms, 90 ms, and 150 ms. 7. Discussion 6.2. Experiment Results and Discussion 7.1. Key Lessons Learned Each plot on the right side of Figure 5 visualizes the Lesson 1: Elicitation of resilience requirements in- system’s response times and success rates with the retry volves hazard analysis. It is essential to include stake- pattern for an endpoint. As in the experiment presented holders with different roles and particular expertise in the business domain to quickly prepare a list of relevant 7.2.1. Workshop hazards. Other roles, such as software developers and in- frastructure engineers, help to identify causes of hazards Conclusion validity One threat is the reliability of that stem from software and its running environment. measures, which means repeating the workshop yields Lesson 2: ATAM is a useful method to adopt re- the same resilience requirements list. Elicitation of re- silience elicitation. Stakeholders of the software project silience requirements involves human judgment. Hence, were already familiar with scenario development for qual- it is a subjective measure. Therefore, we can not entirely ity requirements. Therefore, the structure of the scenario rule out this threat. template of Bass et al. [12] was intuitive for the stake- Internal validity One threat is instrumentation, which holders. means our tools and techniques were not suitable. We Lesson 3: Loose adoption of formalisms is already conducted a one-day structured workshop and used the good enough. Researchers and practitioners have used scenario template of Bass et al. [12] for eliciting resilience fault tree formalism for both qualitative and quantitative requirements. We refined all the resilience requirements hazard analysis in safety engineering. To identify the through several iterations after the workshop and vali- causes of a hazard, we did not have to comply with fault dated them against the workshop participants. tree formalism rigorously. The informal way of construct- Construct validity For us, the main threat in this ing a fault tree was easy to understand for stakeholders. category is mono-method bias, which means we did not Lesson 4: The ATAM workshop requires consid- use other elicitation methods. Therefore, there is a threat erable refinement that can be done “offline”. The that elicited resilience requirements are biased. We can outcome of the well prepared one-day workshop needed not entirely rule out this threat as we did not apply other further refinement. In particular, it was necessary to re- methods and cross-check the results. fine the stimulus and response measures parts of each External validity The heterogeneity poses a threat, scenario, e.g., we modeled the workload and tried to ex- i.e., different roles and expertise of participants. Work- press the scenarios in temporal logic. This revealed that shops with less heterogeneity in the stakeholders could the initial requirements were partially ambiguous and im- lead to no resilience requirements. We can not entirely precise, which was easy to resolve through clarification rule out this threat. requests to the stakeholders. Therefore, we hypothesize that formalization benefits both validation and quanti- tative evaluation of resilience requirements and that an 7.2.2. Experiment design explicit (offline) formalization step could complement the We used the mock system for quantitative evaluation proposed workshop well. of resilience requirements that are based on the actual Lesson 5: A tightly planned one-day workshop is system. There is a threat that evaluation results are in- sufficient. We managed to collect resilience scenarios in accurate. However, the purpose of the experiments is to a one-day workshop because it was well prepared (know- exemplary show how elicited requirements and derived ing the architecture description) and well-conducted (strict experiments can help to improve the system — we do not time management). Refinement can be done offline by claim the accuracy of the quantitative results. Further- more skilled engineers in formalizing stimuli and re- more, due to legal issues, we used CF Dev [19]. We faced sponse measures (similar to writing SLOs). However, instability, e.g., resource drainage of Dev nodes, in the it is important to ask for feedback to check the validity environment during experimentation. There is a threat of the requirements. of a negative impact on results due to this instability. To Lesson 6: The resilience elicitation helps to re- counteract this threat, we re-executed experiments to fine “classical” QoS requirements. All response mea- gain insight into approximate measurements, ensuring sures are based on non-resilience specifications that make reliable data with no unintended node or service crash. them imprecise. For example, maximum degradation and time to recovery was not specified. Thus, it is unclear whether experimentation shows acceptable or unaccept- 8. Conclusion able degradation in performance or availability quality. The successful development of resilience scenarios de- pends on the outcome of the hazard analysis. Our ap- 7.2. Threats to Validity proach to scenario-based resilience evaluation assumes We discuss the threats to validity for the workshop and a business domain expert to derive an initial list of haz- our experiment design. ards. FTA can then be a means to analyze the hazards and derive resilience scenarios. We plan to (1) extend our process with an explicit formalization step after the work- shop for refinement of the scenarios, (2) formally verify response measures of resilience scenarios, and (3) create [10] S. Frank et al., Supplementary material, 2020. processes for continuous hazard analysis when a system Artifacts: https://doi.org/10.5281/zenodo.5142006 faces changes, e.g., updates and refinement/development (Scenarios); https://doi.org/10.24433/CO.0520280.v1 of resilience scenarios. (Code Ocean capsule). [11] K. Pohl, Requirements Engineering - Fundamentals, Principles, and Techniques, Springer, 2010. Acknowledgments [12] L. Bass, P. Clements, R. Kazman, Software Architec- ture in Practice, 2 ed., Addison-Wesley Longman This work has been supported by the Baden-Württemberg Publishing Co., Inc., USA, 2003. Stiftung (ORCAS — Efficient Resilience Benchmarking [13] J. Cámara, R. de Lemos, Evaluation of resilience of Microservice Architectures) and the German Federal in self-adaptive systems using probabilistic model- Ministry of Education and Research (Software Campus checking, in: Proc. 7th Int. Symposium on Software 2.0 — Microproject: DiSpel). Engineering for Adaptive and Self-Managing Sys- tems (SEAMS), 2012, pp. 53–62. Data Availability [14] J. Cámara, R. de Lemos, M. Vieira, R. Almeida, Our artifacts [10] comprise (i) the resilience scenarios R. Ventura, Architecture-based resilience evalua- and (ii) the data and R scripts as a CodeOcean capsule. tion for self-adaptive systems, Computing 95 (2013) We are working on making parts of the created/modified 689–722. experiment tools available as open-source software. For [15] J. Cámara, R. de Lemos, N. Laranjeiro, R. Ventura, confidentiality reasons, the system under test cannot be M. Vieira, Robustness-driven resilience evalua- published. tion of self-adaptive software systems, IEEE Trans- actions on Dependable and Secure Computing 14 (2017) 50–64. References [16] R. Natella, D. Cotroneo, H. Madeira, Assessing de- pendability with software fault injection: A survey, [1] S. Newman, Building Microservices, O’Reilly, 2015. ACM Computing Surveys (CSUR) 48 (2016) 44:1– [2] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. K. 44:55. Reiter, V. Sekar, Gremlin: Systematic resilience [17] K. Yin, Q. Du, W. Wang, J. Qiu, J. Xu, On testing of microservices, in: Proc. 36th IEEE Int. representing and eliciting resilience requirements Conf. on Distributed Computing Systems (ICDCS), of microservice architecture systems, CoRR 2016, pp. 57–66. abs/1909.13096 (2020). URL: https://arxiv.org/abs/ [3] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, 1909.13096v3. arXiv:1909.13096. L. Kosewski, J. Reynolds, C. Rosenthal, Chaos engi- [18] Netflix Inc., Eureka, 2020. URL: https://github.com/ neering, IEEE Softw. 33 (2016) 35–41. Netflix/eureka. [4] Chaos toolkit, 2020. URL: https://github.com/ [19] Cloud Foundry Foundation, Cloud foundry dev chaostoolkit. documentation, 2020. URL: https://github.com/ [5] R. Miles, Learning Chaos Engineering – Discover- cloudfoundry-incubator/cfdev. ing and Overcoming System Weaknesses through [20] Chaos Toolkit, Chaos toolkit documentation, 2020. Experimentation, O’Reilly Media, Inc., 2019. URL: https://chaostoolkit.org. [6] N. G. Leveson, Safeware — System Safety and Com- [21] J. von Kistowski, S. Eismann, N. Schmitt, A. Bauer, puters: A Guide to Preventing Accidents and Losses J. Grohmann, S. Kounev, Teastore: A micro-service Caused by Technology, Addison-Wesley, 1995. reference application for benchmarking, modeling [7] D. Kesim, A. van Hoorn, S. Frank, M. Häussler, Iden- and resource management research, in: Proc. IEEE tifying and prioritizing chaos experiments by using 26th Int. Symp. on Modeling, Analysis, and Simula- established risk analysis techniques, in: Proc. 31st tion of Computer and Telecommunication Systems Int. Symposium on Software Reliability Engineer- (MASCOTS), 2018, pp. 223–236. ing (ISSRE), 2020. [22] InfluxData Inc., InfluxDB website, 2020. URL: https: [8] R. Kazman, M. Klein, M. Barbacci, T. Longstaff, //www.influxdata.com/. H. Lipson, J. Carriere, The architecture tradeoff analysis method, in: Proc. 4th IEEE Int. Conf. on En- gineering of Complex Computer Systems (ICECCS), 1998, pp. 68–78. [9] M. T. Nygard, Release It!: Design and Deploy Production-ready Software, Pragmatic Bookshelf, 2018.