Detecting Operator Errors In Cloud Computing Using Anti-Patterns Arthur Vetter Horus software GmbH, Ettlingen, Germany arthur.vetter@horus.biz Abstract. IT services are subject of several maintenance operations like up- grades, reconfigurations or redeployments. Monitoring those changes is crucial to detect operator errors, which are a main source of service failures. Another challenge, which exacerbates operator errors is the increasing frequency of changes, e.g. because of continuous deployments. In this paper, we propose a monitoring approach to detect operator errors in real-time by using complex event processing and anti-patterns. The basis of the monitoring approach is a novel business process modelling method, combining TOSCA and Petri nets. This model is used to derive pattern instances, which are input for a complex event processing engine in order to analyze them against the generated events of the monitored applications. Keywords: Complex Event Processing, Anti-Pattern, TOSCA, IT Service Man- agement, Anomaly Detection. 1 Introduction Operator errors have been one of the major reasons for IT service failures [1]–[6] and will probably continue to be regarding current trends like continuous delivery, DevOps and infrastructure-as-code [7]. In recent years, several studies and methods were devel- oped to detect errors in very complex IT systems [8]. Those traditional methods are suited for detecting errors during “normal” operations, but not during change operations like reconfigurations or rolling upgrades, when one node after the other is upgraded [9]. This paper presents current research results of a novel monitoring approach for those change operations. The monitoring approach is based on a process model, combining TOSCA and high-level Petri nets [8], which explicitly models the maintenance opera- tions of the IT service applications. This process model is used to derive pattern in- stances from it. Those pattern instances are checked through a complex event pro- cessing engine against state events and transaction events. State events describe the state of the application, whereas transaction events describe each single operation per- formed on the application. Therefore, the logs of the applications are filtered for mean- ingful transaction events and are sent to the complex event processing engine, allowing 68 the detection of operator errors almost in real-time. The complex event processing en- gine compares the pattern instances with the generated events through anti-patterns and creates an error message, when an anti-pattern instance was detected. Fig. 1 gives an overview of the general monitoring approach. IT service operator actions application 1 transaction events operator actions application 2 complex event anti-pattern instances processing engine (error message) . state events (checking for anti- . patterns) . operator actions application 3 monitoring application pattern instances process model Fig. 1. General monitoring approach The remainder of this paper is organized as follows: The next section gives a short overview of typical operator errors. Section three describes the fundamentals of TOSCA and XML nets, which are used to model the actual maintenance. Section four describes the concept of patterns and anti-patterns. Section five presents the proof of concept implementation. Afterwards related work is presented. Section seven con- cludes the paper. 2 OPERATOR ERRORS Oppenheimer et al. [5] and many other authors like [4], [5], [11], [12] classify operator errors in process errors and configuration errors. Process errors can be further differen- tiated in following errors: forgotten activity, an unneeded activity was executed, a wrong activity was executed or actual correct activities were executed in the wrong order. Configuration errors can be separated in formatting errors and configuration value errors [13]. Formatting errors can be further separated in lexical errors, syntactical errors and typos. Configuration value errors can be further classified in local value in- consistencies and global environment inconsistencies. A monitoring approach to detect operator errors should be able to detect all those process and configuration error types. Table I gives an example for every type of operator error and a reference to a study with further information and examples. 69 Table 1. Operator error examples Operator Error Example Description Refer- ence Forgotten activity Forgot to restart a server [4] Unneeded activity Unnecessary restart of a server [9] Wrongly Restoration of a wrong backup [4] executed activity Wrong order Bringing down two servers in parallel for configura- [9] tion instead of sequentially maintaining the servers Local log_output = "Table" According to the value “log”, [10] Inconsistency log = query.log the user wanted to store logs in a file, but the value “log.output” controls to store data in a database table Global datadir = “datadir” points to an old [10] Inconsistency /some/old/path path, which does not exist an- ymore. Lexical Errors InitiatorName: Only lowercase letters are al- [10] iqn:DEV_domain lowed (“DEV”) Syntactical Errors “mysql.so” depends on “re- [10] extension = mysql.so code.so” and was configured ..... in the wrong order extension = recode.so Typo extension = recdoe.so The correct writing of [10] extension = mysql.so “recdoe.so” is “recode.so” 3 FUNDAMENTALS The process model is a combination of TOSCA and XML nets and was introduced in a former paper [8]. In this chapter, we describe the fundamentals of TOSCA and XML nets shortly and then describe how maintenance operations can be modelled with TOSCA and XML nets. 3.1 TOSCA TOSCA (Topology and Orchestration Specification for Cloud Applications) is a stand- ard, released by OASIS [14] to support the portability of cloud applications between different cloud providers and the automation of cloud application provisioning. There- fore, TOSCA provides a modelling language to describe cloud applications as Service Templates. A Service Template consists of a Topology Template and of optional Plans. 70 Fig. 2. TOSCA Service Template A Topology Template describes the structure of a cloud application as a directed graph and consists of Node Templates and Relationship Templates. A Node Template represents a component of the cloud application, e.g. an applica- tion server and is described by a Node Type. A Node Type defines • properties of the component (Properties Definition), • available operations to manipulate the component (Interfaces), • requirements of the component (Requirement Definitions), • possible lifecycle states of the component (Instance States) and • capabilities it offers to satisfy other components’ requirements (Capability Defini- tions). Plans are models to orchestrate the management Operations, which are offered by the cloud application components and can be written in BPMN, BPEL or other languages. We use the notation of XML nets for the creation of Plans, which we name “maintenance plan” in the rest of the paper. 3.2 XML nets XML nets [15] are a high-level variant of Petri nets, in which places represent contain- ers for XML documents. The XML documents must conform to the XML Schema, which is assigned to a specific place. Edges are labeled with Filter Schemas, which are used to read or manipulate XML documents. Transitions can be inscribed by a logical expression, whose variables are contained in the adjacent edges. A transition in an XML net is enabled and can be fired for a given marking, when the following three conditions hold. First, every place in the pre-set of the transition holds at least one valid XML document, which conforms to the Filter Schema inscribing the edge to the transition. Second, every place in the post-set of a transition must contain one valid XML docu- ment, if the XML document has to be modified. If an XML document has to be created from scratch the place must not already contain this XML document. Third, for the 71 given instantiation of the variables, the transition inscription has to be evaluated to true in order to enable the transition. If an enabled transition is fired, XML documents in the pre-set places are (partially) deleted or read for the given instantiation of variables, and new XML documents are created or existing XML documents are modified in the post-set places of the transition. 3.3 Modelling maintenance plans This section describes the modelling of maintenance plans with TOSCA and XML nets, which allows to model applications and the orchestration of applications’ management operations in one integrated model. Such a model can then be used to derive pattern instances. Therefore, we extend our former approach, introduced in [8]. The following adjustments are made to the general definition of TOSCA Node Templates: • A Node Template represents exactly one instance of an application, that means the attributes minIstances, maxInstances:=1. • Node Templates are extended with the complex element InstanceState, which stores the current state of the corresponding application. The notation of XML nets is adjusted as follows: • Places are containers for Service Templates. Every place is assigned to the general TOSCA XML schema and additionally to a single Node Type, which restricts the allowed filter schemas for corresponding Node Templates. • Transitions represent operations, defined in Interfaces of the adjacent Node Types. • Filter Schemas can either be used to select Node Templates or to modify Properties, or Instance States of a Node Template. Deleting whole Node Templates is in contrast to general XML nets not allowed. Node Templates can only change their status, e.g. to undeploy, but they cannot be deleted. The reason is, that for error detection pur- poses, even an undeployed Node has to be monitored to be sure it was really un- deployed and e.g. has not been deployed by accident afterwards again. Deleting parts of a Node Template are allowed, e.g. deleting a property. • Transitions hold the attributes start and end, which define when the operation has to be executed earliest and latest. We define a maintenance plan as a tuple MP = < P, T, A, Ψ, I+ , I, , I- , I. , M/ >, where (i) < P, T, A > is a Petri net with a set of places P, a set 𝑇 of transitions, and a set 𝐴 of edges connecting places and transitions (the definition and description of petri nets is excluded in this paper, but can be found, e.g., in [11]). (ii) Ψ =< 𝐷, 𝐹𝑇, 𝑃𝑅 > is a structure consisting of a finite and non-empty individual set 𝐷, a set of term and formula functions 𝐹𝑇 defined on 𝐷, and a set of predicates 𝑃𝑅 defined on 𝐷. (iii) I+ is the function that assigns the TOSCA XML Schema to each place. 72 (iv) I, is the function that assigns additionally a Node Type to each place. (v) I- is the function that assigns a Filter Schema to each edge. The Filter Schema must conform to the XML Schema and Node Type of the adjacent place. (v) I. is the function that assigns a predicate logical expression as inscription to each transition. The inscription is built on a given structure Ψ and a set of variables. Only variables, which are contained in the Filter schemas of adjacent arcs, are allowed. The inscription must evaluate to true in order to enable the transition. (vi) Each transition represents a value of the element operation, which is defined in the complex element Interfaces of the Node Type in the postset of the transition. (vii) M/ is the initial marking. Markings are TOSCA Service Templates. (viii) Each transition holds the attributes start and end. Fig. 3 shows an example of a maintenance plan to configure the database connection of the application MyApplication (Filter Schemas are written informally for readability reasons). MyApplication is hosted on MyAppServer and requires additionally the data- base TestDatabase. It is assumed, that when the change is performed, MyApplication is started. In the first place, which is linked to a Node Type Application, MyApplication is one possible representation. The first Filter Schema selects MyApplication with the condition, that it is started. Before MyApplication can be configured it has to be stopped, which is represented in the first transition. Stopping is one possible operation, which is given by the Node Type Application. If at the beginning of executing the change, My- Application is already stopped, it is a hint, that an incident or something unexpected happened, so the change execution should be interrupted. When MyApplication is stopped, the database connection can be set. Therefore, the Node Template TestData- base is selected and the database connection is built up on the properties of TestData- base and inserted in MyApplication through the Filter Schema FS5. Afterwards MyAp- plication can be started again, but only if TestDatabase is running. 4 Pattern and Anti-Pattern for operator error detection In computer science the term pattern is popular since the publication of the book about design patterns from Gamma et al. [12]. In this book, Gamma et al. describe patterns as solutions for recurring problems in a specific context. Aalst et al. [13] used the concept of patterns for business process modelling and described several patterns for the control flow perspective. Since then, many patterns were described for different perspectives of business process modelling, like for the data perspective [14], [15]. Riehle and Zül- lighoven define a pattern more general as an abstraction of a recurring concrete form in a specific context [16]. A form is a finite number of distinguishable elements and their relationships [16]. A context restricts the possible usage of a form, because the form has to fit into this specific context. Based on this definition we define a pattern and anti- pattern as following: 73 7 DEFINITION 4.1: (PATTERN). A pattern is an abstraction of a welcomed, recur- ring, concrete form in a specific context. DEFINITION 4.2: (ANTI-PATTERN). An anti-pattern is as an abstraction of an unwelcomed, concrete form in a specific context. In our work, we use patterns to describe the planned to be control flow, application configurations and application states for the scheduled maintenance. So, patterns are used during the design phase. Anti-patterns are used to check during the actual execu- tion of the maintenance, if a form of events exists, which does not fit to the planned forms. In the following we restrict and formalize the context of the used patterns and anti-patterns as well as the form of these patterns and anti-patterns. NodeTemplate NT1 NodeTemplate NT3 id: MyApplication id: MyApplication name: My Application name: My Application type: Application type: Application Properties Requirements InstanceState Properties Requirements InstanceState WebServer: MyAppServer state: Started User WebServer: MyAppServer state: Stopped User Database: MyDatabase Password Database: MyDatabase Password Capabilities Capabilities DB Connection: DB Connection WebService: MyWebService jdbc:mysql://loc WebService: MyWebService alhost:1521/XE Change Application. FS2 FS6 Select Application where InstanceState.state := Stopped where id=“MyApplication“ Stop Application.id=“MyApplication“ Configure Start Application.Requirements. Application.InstanceState. Database = Database.id Application.InstanceState. state=Stopped ∧ Database.id = ∧ Database.Instance state=Started Application.Requirements. State.state=Started ∧ Database Application.InstanceState. FS3 state=Stopped Select Application where FS1 id=“MyApplication“ Select Application where Ändere Application.Properties.DB Connection id=“MyApplication“ NodeTemplate NT2 := „jdbc:“ & Database.Properties.Type & „://“ & id: MyDatabase Database.Properties.Host & „:“ & name: My Database Database.Properties.Port & „/“ & type: Database Database.Properties.Name mit Application.id=“MyApplication“ FS5 Properties Requirements InstanceState Select FS4/7 Select Database Type: MySQL Host: DBServer state: Started NodeTemplate Host: localhost Port: 1521 Name: XE Fig. 3. Example of a TOSCA based XML net 4.1 Context As described in chapter three, the monitoring approach is based on the comparison be- tween produced events of monitored applications and pattern instances of the TOSCA management plan. Those parameters build the context of the patterns. We separate two kinds of events in our context: state events and transaction events. Definitions 4.3 and 4.4 formalize state events and transaction events in this paper. DEFINITON 4.3: (STATE EVENT). A state event is a tuple se=(timestamp, app, state), where: • timestamp is the timestamp of the event creation. • app is the Node Template id of the monitored application. • state is the actual state of the application. Only values are allowed, which are defined in the Node Type of the application by the element Instance States. The set of all state events is defined as SES. DEFINITON 4.4: (TRANSACTION EVENT). A transaction event is a tuple te= (timestamp, st, app, op, prop, value), where: • timestamp is the timestamp of the event creation. • st is the Service Template id, which identifies the service the application belongs to. • app is the Node Template id of the monitored application. • op describes the operation, which was conducted on the application. The value of op must correspond to one of the values, which are defined in the element operation of the Node Type of the application. • prop describes the property, which was changed when the operation was executed. If no property was changed during the operation prop is null. • value is the value of the property, which was changed. If prop is null, value also has to be null. The set of all transaction events is defined as TES. State events and transaction events represent the actual events during a maintenance. The corresponding “to be” events are conditions and activities, which can be derived from a TOSCA management plan. A condition represents a possible transition inscrip- tion, whereas activities represent firing sequences. DEFINITON 4.5: (CONDITION). A condition is a tuple (app, op, prop, zapp, state), where: • app is the id of the Node Template, on which the operation is performed. • op is the operation, which is performed on the Node Template and is restricted in the Node Type of the Node Template. • prop is the property of the Node Template, which is changed during the operation. • zapp is the id of the Node Template, which has to be in a specific state in order to perform the operation. • state describes in which state zapp has to be. Let SM be the set of all maintenance plans. The set of all conditions of a maintenance plan is defined as 𝑆𝐶< , 𝑖 ∈ 𝑆𝑀. The set of all transition inscriptions of a maintenance plan is defined as 𝑆𝑇𝐼< , 𝑖 ∈ 𝑆𝑀. The function 𝐹: 𝑆𝐶< → 𝑆𝑇𝐼< assigns a transition to each activity. DEFINITON 4.6: (ACTIVITY). An activity is a tuple a=(st, app, op, prop, value, start, end), where: • st is the Service Template id, which identifies the service template in the TOSCA management plan. • app is the id of the Node Template, on which the operation is performed. • op is the operation, which is performed on the Node Template and is restricted in the Node Type of the Node Template. 75 • prop is the property of the Node Template, which is changed during the operation. • value is the value of the property, which was changes. If prop is null, value also has to be null. • start describes when the activity has to start earliest. • end describes when the activity has to end latest. Be SM the set of all maintenance plans. The set of all activities of a maintenance plan is defined as 𝑆𝐴< , 𝑖 ∈ 𝑆𝑀. The set of all transitions of a maintenance plan is defined as 𝑆𝑇< , 𝑖 ∈ 𝑆𝑀. The function 𝐹: 𝑆𝐴< → 𝑆𝑇< assigns a transition to each activity. Additionally, for some anti-patterns we need the history of transaction events and the latest state of an application called the state event history. DEFINITON 4.7: (TRANSACTION EVENT HISTORY). A transaction event his- tory is a selection 𝜎 on the set of transaction events, which are in the time scope of the scheduled maintenance: TEH ≔ 𝜎G e2= incoming_te [e1.appfar == e2.app and e1.opfar == e2.op and e1.propfar == e2.prop] select e1.timestamp, e1.appcur, e1.opcur, e1.propcur, e1.appnex, e1.opnex, e1.propnex, e2.timestamp as ti- mestampfar insert into #temp2; from #temp2 [not((appnex == TEH.app and opnex == TEH.op and propnex == TEH.prop in and timestamp < TEH.timestamp and timestampentf > TEH.timestamp) in TEH)] 1 www.horus.biz 2 https://elastic.co 3 https://nagios.org 4 https://aws.amazon.com/en/cloudwatch/ 5 https://wso2.com/products/complex-event-processor/ 6 https://github.com/wso2/siddhi 79 select str:concat("The activity ",appnex, ", ", opnex, ", ", propnex, " was not performed after the activity ", ap- pcur, " ,", opcur, ", ", propcur, ".") as message insert into error_message; Apart of the modelling component all components and Siddhi queries are already implemented in a prototype. In the next months, an evaluation of the whole method will be performed. Therefore it is planned to replay common configuration settings like they are described for exam- ple in [9], [20]. Main evaluation criteria will be the time needed to detect operator er- rors, the ratio of detected/injected operator errors and the false positive rate. Fig. 4. Implementation architecture 6 Related Work Related work can be separated in different areas of work. One area of work is the auto- mation of typical operations like redeployments and integrated error exception han- dling, like it is provided by popular configuration management tools, e.g. Chef [21]. Those tools have the disadvantage, that they have just local information for error han- dling and no global view of the whole maintenance, which also could involve legacy systems [20]. Another area of work is the detection of configuration errors. Those approaches can be divided in rule based methods and online configuration validation [22]. Rule based methods try to avoid configuration errors a priori by correctness checks. These, help to detect wrong planned configuration errors. However, those approaches do not check if the configuration operation itself was executed as planned. So, forgotten configurations e.g. because a server was down or typos, when the configuration was done manually, cannot be detected. 80 The most related work to ours is the work of Xu et al. [20] and Farshchi et al. [23]. Both works describe an approach to monitor sporadic operations in cloud environments. Xu et al. developed a method called “POD-Diagnosis”. They use a process model to detect operator errors through token replay by checking the conformance of observed logs with the pre-build model and an additional fault tree analyses in order to find the root cause of the error. In contrast to our work only the control flow of the process is modelled and can therefore be checked. Apart of that, in our approach no additional fault tree has to be build. Farshchi et al. build a regression-based model to find correla- tion and causalities between events described in logs and overserved metrics of re- sources. In their approach, assertions are derived from the regression-based model. However, they are also limited to control flow. Additionally, enough learning data is needed, which practically limits their approach to automated cloud environments. Our approach does not have to learn data and therefore can also be used to monitor manually executed steps or changes in legacy systems. 7 Conclusion In this paper, we describe an approach to detect operator errors during the execution of maintenance operations. Therefore, we define different anti-patterns, which are imple- mented as complex event processing queries and check in real time log entries and state metrics of observed resources against pattern instances of a pre-defined process model. The process model itself is realized as a TOSCA based XML net, combining the mod- elling of the control-flow and the resources. A prototype to check the effectiveness of our approach is currently under construction. In order to evaluate the approach typical maintenance operations will be performed like the configuration of servers or a rolling upgrade. During these maintenance operations, typical errors will be injected on pur- pose. The prototype should be able to detect all those injected errors. References [1] H. S. Gunawi et al., “What Bugs Live in the Cloud?: A Study of 3000+ Issues in Cloud Systems,” in Proceedings of the ACM Symposium on Cloud Computing, 2014, pp. 1–14. [2] S. Hagen, M. Seibold, and A. Kemper, “Efficient verification of IT change operations or: How we could have prevented Amazon’s cloud outage,” presented at the Network Operations and Management Symposium (NOMS), 2012 IEEE, 2012, pp. 368–376. [3] T. Dumitraş and P. Narasimhan, “Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system,” in Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware, 2009, p. 18. [4] S. Pertet and P. Narasimhan, “Causes of failure in web applications,” Parallel Data Lab., p. 48, 2005. [5] D. Oppenheimer, A. Ganapathi, and D. A. Patterson, “Why Do Internet Ser- vices Fail, and What Can Be Done About It?,” in Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume 4, Berkeley, CA, 81 USA, 2003, pp. 1–1. [6] D. Scott, “Making smart investments to reduce unplanned downtime,” Tacti- cal Guidel. Res. Note Note TG-07-4033 Gart. Group Stamford CT, 1999. [7] S. Elliot, “DevOps and the cost of downtime: Fortune 1000 best practice met- rics quantified,” Int. Data Corp. IDC, 2014. [8] A. Vetter, “Detecting Operator Errors in Cloud Maintenance Operations,” in 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2016, pp. 639–644. [9] K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, and T. D. Nguyen, “Un- derstanding and Dealing with Operator Mistakes in Internet Services,” in OSDI ’04: 6th Symposium on Operating Systems Design and Implementation, 2004. [10] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S. Pasupathy, “An empirical study on configuration errors in commercial and open source systems,” in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, 2011, pp. 159–172. [11] J. L. Peterson, “Petri net theory and the modeling of systems,” 1981. [12] E. Gamma, R. Helm, R. Johnson, and J. Vlissides, Design patterns: elements of reusable object-oriented software. Pearson Education, 1994. [13] W. M. van der Aalst, A. H. Ter Hofstede, B. Kiepuszewski, and A. P. Barros, “Workflow patterns,” Distrib. Parallel Databases, vol. 14, no. 1, pp. 5–51, 2003. [14] N. Russell, A. H. Ter Hofstede, D. Edmond, and W. M. van der Aalst, “Work- flow data patterns,” QUT Technical report, FIT-TR-2004-01, Queensland University of Technology, Brisbane, 2004. [15] N. Russell, A. H. Ter Hofstede, D. Edmond, and W. M. van der Aalst, “Work- flow resource patterns,” 2005. [16] D. Riehle and H. Züllighoven, “Understanding and using patterns in software development,” TAPOS, vol. 2, no. 1, pp. 3–13, 1996. [17] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett, “Property specification pat- terns for finite-state verification,” in Proceedings of the second workshop on Formal methods in software practice, 1998, pp. 7–15. [18] W. Van Der Aalst, Process mining: discovery, conformance and enhancement of business processes. Springer Science & Business Media, 2011. [19] M. Weidlich, J. Mendling, and M. Weske, “Computation of behavioural pro- files of process models,” Bus. Process Technol. Hasso Plattner Inst. IT-Syst. Eng. Pots- dam, 2009. [20] X. Xu, L. Zhu, I. Weber, L. Bass, and others, “POD-diagnosis: Error diagnosis of sporadic operations on cloud applications,” in 2014 44th Annual IEEE/IFIP Inter- national Conference on Dependable Systems and Networks, 2014, pp. 252–263. [21] Chef, “About Handlers,” 08-Nov-2017. [Online]. Available: https://docs.chef.io/handlers.html. [22] T. XU and Y. ZHOU, “Systems Approaches to Tackling Configuration Errors: A Survey,” 2014. [23] M. Farshchi, J.-G. Schneider, I. Weber, and J. Grundy, “Metric selection and anomaly detection for cloud operations using log and metric correlation analysis,” J. Syst. Softw., Mar. 2017. 82