<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Operator Errors In Cloud Computing Using Anti-Patterns</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arthur Vetter</string-name>
          <email>arthur.vetter@horus.biz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Horus software GmbH</institution>
          ,
          <addr-line>Ettlingen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>68</fpage>
      <lpage>82</lpage>
      <abstract>
        <p>IT services are subject of several maintenance operations like upgrades, reconfigurations or redeployments. Monitoring those changes is crucial to detect operator errors, which are a main source of service failures. Another challenge, which exacerbates operator errors is the increasing frequency of changes, e.g. because of continuous deployments. In this paper, we propose a monitoring approach to detect operator errors in real-time by using complex event processing and anti-patterns. The basis of the monitoring approach is a novel business process modelling method, combining TOSCA and Petri nets. This model is used to derive pattern instances, which are input for a complex event processing engine in order to analyze them against the generated events of the monitored applications.</p>
      </abstract>
      <kwd-group>
        <kwd>Complex Event Processing</kwd>
        <kwd>Anti-Pattern</kwd>
        <kwd>TOSCA</kwd>
        <kwd>IT Service Management</kwd>
        <kwd>Anomaly Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Operator errors have been one of the major reasons for IT service failures [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]–[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and
will probably continue to be regarding current trends like continuous delivery, DevOps
and infrastructure-as-code [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In recent years, several studies and methods were
developed to detect errors in very complex IT systems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Those traditional methods are
suited for detecting errors during “normal” operations, but not during change operations
like reconfigurations or rolling upgrades, when one node after the other is upgraded [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        This paper presents current research results of a novel monitoring approach for those
change operations. The monitoring approach is based on a process model, combining
TOSCA and high-level Petri nets [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which explicitly models the maintenance
operations of the IT service applications. This process model is used to derive pattern
instances from it. Those pattern instances are checked through a complex event
processing engine against state events and transaction events. State events describe the
state of the application, whereas transaction events describe each single operation
performed on the application. Therefore, the logs of the applications are filtered for
meaningful transaction events and are sent to the complex event processing engine, allowing
the detection of operator errors almost in real-time. The complex event processing
engine compares the pattern instances with the generated events through anti-patterns and
creates an error message, when an anti-pattern instance was detected. Fig. 1 gives an
overview of the general monitoring approach.
      </p>
      <p>operator actions
operator actions
operator actions</p>
      <p>IT service
application 1
application 2
.
.</p>
      <p>.
application 3
monitoring
application
process model
transaction events</p>
      <p>state events
pattern instances
complex event
processing engine
(checking for
antipatterns)
anti-pattern instances
(error message)</p>
      <p>The remainder of this paper is organized as follows: The next section gives a short
overview of typical operator errors. Section three describes the fundamentals of
TOSCA and XML nets, which are used to model the actual maintenance. Section four
describes the concept of patterns and anti-patterns. Section five presents the proof of
concept implementation. Afterwards related work is presented. Section seven
concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>OPERATOR ERRORS</title>
      <p>
        Oppenheimer et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and many other authors like [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] classify operator
errors in process errors and configuration errors. Process errors can be further
differentiated in following errors: forgotten activity, an unneeded activity was executed, a
wrong activity was executed or actual correct activities were executed in the wrong
order. Configuration errors can be separated in formatting errors and configuration
value errors [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Formatting errors can be further separated in lexical errors, syntactical
errors and typos. Configuration value errors can be further classified in local value
inconsistencies and global environment inconsistencies. A monitoring approach to detect
operator errors should be able to detect all those process and configuration error types.
      </p>
      <p>Table I gives an example for every type of operator error and a reference to a study
with further information and examples.</p>
    </sec>
    <sec id="sec-3">
      <title>FUNDAMENTALS</title>
      <p>
        The process model is a combination of TOSCA and XML nets and was introduced in a
former paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this chapter, we describe the fundamentals of TOSCA and XML
nets shortly and then describe how maintenance operations can be modelled with
TOSCA and XML nets.
3.1
      </p>
      <sec id="sec-3-1">
        <title>TOSCA</title>
        <p>
          TOSCA (Topology and Orchestration Specification for Cloud Applications) is a
standard, released by OASIS [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] to support the portability of cloud applications between
different cloud providers and the automation of cloud application provisioning.
Therefore, TOSCA provides a modelling language to describe cloud applications as Service
Templates. A Service Template consists of a Topology Template and of optional Plans.
        </p>
        <p>A Topology Template describes the structure of a cloud application as a directed
graph and consists of Node Templates and Relationship Templates.</p>
        <p>A Node Template represents a component of the cloud application, e.g. an
application server and is described by a Node Type. A Node Type defines
• properties of the component (Properties Definition),
• available operations to manipulate the component (Interfaces),
• requirements of the component (Requirement Definitions),
• possible lifecycle states of the component (Instance States) and
• capabilities it offers to satisfy other components’ requirements (Capability
Definitions).</p>
        <p>Plans are models to orchestrate the management Operations, which are offered by
the cloud application components and can be written in BPMN, BPEL or other
languages. We use the notation of XML nets for the creation of Plans, which we name
“maintenance plan” in the rest of the paper.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>XML nets</title>
        <p>
          XML nets [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] are a high-level variant of Petri nets, in which places represent
containers for XML documents. The XML documents must conform to the XML Schema,
which is assigned to a specific place. Edges are labeled with Filter Schemas, which are
used to read or manipulate XML documents. Transitions can be inscribed by a logical
expression, whose variables are contained in the adjacent edges. A transition in an XML
net is enabled and can be fired for a given marking, when the following three conditions
hold. First, every place in the pre-set of the transition holds at least one valid XML
document, which conforms to the Filter Schema inscribing the edge to the transition.
Second, every place in the post-set of a transition must contain one valid XML
document, if the XML document has to be modified. If an XML document has to be created
from scratch the place must not already contain this XML document. Third, for the
given instantiation of the variables, the transition inscription has to be evaluated to true
in order to enable the transition. If an enabled transition is fired, XML documents in
the pre-set places are (partially) deleted or read for the given instantiation of variables,
and new XML documents are created or existing XML documents are modified in the
post-set places of the transition.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Modelling maintenance plans</title>
        <p>
          This section describes the modelling of maintenance plans with TOSCA and XML nets,
which allows to model applications and the orchestration of applications’ management
operations in one integrated model. Such a model can then be used to derive pattern
instances. Therefore, we extend our former approach, introduced in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The following
adjustments are made to the general definition of TOSCA Node Templates:
• A Node Template represents exactly one instance of an application, that means the
attributes minIstances, maxInstances:=1.
• Node Templates are extended with the complex element InstanceState, which stores
the current state of the corresponding application.
        </p>
        <p>The notation of XML nets is adjusted as follows:
• Places are containers for Service Templates. Every place is assigned to the general
TOSCA XML schema and additionally to a single Node Type, which restricts the
allowed filter schemas for corresponding Node Templates.
• Transitions represent operations, defined in Interfaces of the adjacent Node Types.
• Filter Schemas can either be used to select Node Templates or to modify Properties,
or Instance States of a Node Template. Deleting whole Node Templates is in contrast
to general XML nets not allowed. Node Templates can only change their status, e.g.
to undeploy, but they cannot be deleted. The reason is, that for error detection
purposes, even an undeployed Node has to be monitored to be sure it was really
undeployed and e.g. has not been deployed by accident afterwards again. Deleting parts
of a Node Template are allowed, e.g. deleting a property.
• Transitions hold the attributes start and end, which define when the operation has to
be executed earliest and latest.</p>
        <p>We define a maintenance plan as a tuple MP = &lt; P, T, A, Ψ, I+, I, , I-, I. , M/ &gt;,
where</p>
        <p>
          (i) &lt; P, T, A &gt; is a Petri net with a set of places P, a set  of transitions, and a set 
of edges connecting places and transitions (the definition and description of petri nets
is excluded in this paper, but can be found, e.g., in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]).
        </p>
        <p>(ii) Ψ =&lt; , ,  &gt; is a structure consisting of a finite and non-empty individual
set , a set of term and formula functions  defined on , and a set of predicates 
defined on .</p>
        <p>(iii) I+ is the function that assigns the TOSCA XML Schema to each place.
(iv) I, is the function that assigns additionally a Node Type to each place.
(v) I- is the function that assigns a Filter Schema to each edge. The Filter Schema
must conform to the XML Schema and Node Type of the adjacent place.</p>
        <p>(v) I. is the function that assigns a predicate logical expression as inscription to each
transition. The inscription is built on a given structure Ψ and a set of variables. Only
variables, which are contained in the Filter schemas of adjacent arcs, are allowed. The
inscription must evaluate to true in order to enable the transition.</p>
        <p>(vi) Each transition represents a value of the element operation, which is defined in
the complex element Interfaces of the Node Type in the postset of the transition.
(vii) M/ is the initial marking. Markings are TOSCA Service Templates.
(viii) Each transition holds the attributes start and end.</p>
        <p>Fig. 3 shows an example of a maintenance plan to configure the database connection
of the application MyApplication (Filter Schemas are written informally for readability
reasons). MyApplication is hosted on MyAppServer and requires additionally the
database TestDatabase. It is assumed, that when the change is performed, MyApplication is
started. In the first place, which is linked to a Node Type Application, MyApplication
is one possible representation. The first Filter Schema selects MyApplication with the
condition, that it is started. Before MyApplication can be configured it has to be stopped,
which is represented in the first transition. Stopping is one possible operation, which is
given by the Node Type Application. If at the beginning of executing the change,
MyApplication is already stopped, it is a hint, that an incident or something unexpected
happened, so the change execution should be interrupted. When MyApplication is
stopped, the database connection can be set. Therefore, the Node Template
TestDatabase is selected and the database connection is built up on the properties of
TestDatabase and inserted in MyApplication through the Filter Schema FS5. Afterwards
MyApplication can be started again, but only if TestDatabase is running.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Pattern and Anti-Pattern for operator error detection</title>
      <p>
        In computer science the term pattern is popular since the publication of the book about
design patterns from Gamma et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In this book, Gamma et al. describe patterns as
solutions for recurring problems in a specific context. Aalst et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] used the concept
of patterns for business process modelling and described several patterns for the control
flow perspective. Since then, many patterns were described for different perspectives
of business process modelling, like for the data perspective [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Riehle and
Züllighoven define a pattern more general as an abstraction of a recurring concrete form in
a specific context [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A form is a finite number of distinguishable elements and their
relationships [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A context restricts the possible usage of a form, because the form
has to fit into this specific context. Based on this definition we define a pattern and
antipattern as following:
      </p>
      <p>DEFINITION 4.1: (PATTERN). A pattern is an abstraction of a welcomed,
recurring, concrete form in a specific context.</p>
      <p>DEFINITION 4.2: (ANTI-PATTERN). An anti-pattern is as an abstraction of an
unwelcomed, concrete form in a specific context.</p>
      <p>In our work, we use patterns to describe the planned to be control flow, application
configurations and application states for the scheduled maintenance. So, patterns are
used during the design phase. Anti-patterns are used to check during the actual
execution of the maintenance, if a form of events exists, which does not fit to the planned
forms. In the following we restrict and formalize the context of the used patterns and
anti-patterns as well as the form of these patterns and anti-patterns.</p>
      <p>Properties
User
Password
DB Connection</p>
      <p>FS1
SelectApplication where
id=“MyApplication“</p>
      <p>Properties
Type: MySQL
Host: localhost
Port: 1521
Name:XE</p>
      <p>NodeTemplate
id: MyApplication
name: My Application
type: Application</p>
      <p>NT1</p>
      <sec id="sec-4-1">
        <title>DWaetabSbRaerseveqe:urMi:rMeyDmyaAetpnaptbsSaesrever sItnasttea:nScteCaSarttpaeatdbeilities</title>
        <p>WebService:MyWebService</p>
        <p>Change Application. FS2</p>
        <p>InstanceState.state:=Stopped where</p>
        <p>Stop Application.id=“MyApplication“
Application.InstanceState.</p>
        <p>state=Started</p>
        <p>Properties
User
Password
DB Connection:
jdbc:mysql://loc
alhost:1521/XE</p>
        <p>Configure
Application.InstanceState.
state=Stopped∧ Database.id=
Application.Requirements.</p>
        <p>Database
SelectApplication wheFrSe3
id=“MyApplication“
NodeTemplate
id: MyDatabase
name: My Database
type: Database
NT2
HRoestq:uDirBeSmeervnetrs sItnasttea:nScteaSrttaetde
NodeTemplate
id: MyApplication
name: My Application
type: Application
NT3</p>
      </sec>
      <sec id="sec-4-2">
        <title>DWaetabSbRearseveqe:urMi:rMeyDmyaAetpnaptbsSaesrever sItnasttea:nSctCeoaSpptpaaetbdeilities</title>
        <p>WebService:MyWebService</p>
        <p>FS6
SelectApplication where
id=“MyApplication“ Start</p>
        <p>Application.Requirements.</p>
        <p>Database= Database.id
∧ Database.Instance
State.state=Started∧
Application.InstanceState.</p>
        <p>state=Stopped
Ändere Application.Properties.DB Connection
:= „jdbc:“ &amp;Database.Properties.Type &amp; „://“&amp;</p>
        <p>Database.Properties.Host &amp; „:“ &amp;
Database.Properties.Port &amp; „/“&amp;</p>
        <p>Database.Properties.Name
mitApplication.id=“MyApplication“ FS5</p>
        <p>Select FS4/7
SelectDatabase</p>
        <p>NodeTemplate
As described in chapter three, the monitoring approach is based on the comparison
between produced events of monitored applications and pattern instances of the TOSCA
management plan. Those parameters build the context of the patterns. We separate two
kinds of events in our context: state events and transaction events. Definitions 4.3 and
4.4 formalize state events and transaction events in this paper.</p>
        <p>DEFINITON 4.3: (STATE EVENT). A state event is a tuple se=(timestamp, app,
state), where:
• timestamp is the timestamp of the event creation.
• app is the Node Template id of the monitored application.
• state is the actual state of the application. Only values are allowed, which are defined
in the Node Type of the application by the element Instance States.</p>
        <p>The set of all state events is defined as SES.</p>
        <p>DEFINITON 4.4: (TRANSACTION EVENT). A transaction event is a tuple te=
(timestamp, st, app, op, prop, value), where:
• timestamp is the timestamp of the event creation.
• st is the Service Template id, which identifies the service the application belongs to.
• app is the Node Template id of the monitored application.
• op describes the operation, which was conducted on the application. The value of op
must correspond to one of the values, which are defined in the element operation of
the Node Type of the application.
• prop describes the property, which was changed when the operation was executed.</p>
        <p>If no property was changed during the operation prop is null.
• value is the value of the property, which was changed. If prop is null, value also has
to be null.</p>
        <p>The set of all transaction events is defined as TES.</p>
        <p>State events and transaction events represent the actual events during a maintenance.
The corresponding “to be” events are conditions and activities, which can be derived
from a TOSCA management plan. A condition represents a possible transition
inscription, whereas activities represent firing sequences.</p>
        <p>DEFINITON 4.5: (CONDITION). A condition is a tuple (app, op, prop, zapp, state),
where:
• app is the id of the Node Template, on which the operation is performed.
• op is the operation, which is performed on the Node Template and is restricted in the</p>
        <p>Node Type of the Node Template.
• prop is the property of the Node Template, which is changed during the operation.
• zapp is the id of the Node Template, which has to be in a specific state in order to
perform the operation.
• state describes in which state zapp has to be.</p>
        <p>Let SM be the set of all maintenance plans. The set of all conditions of a maintenance
plan is defined as &lt;,  ∈ . The set of all transition inscriptions of a maintenance
plan is defined as &lt;,  ∈ . The function : &lt;→ &lt;assigns a transition to each
activity.</p>
        <p>DEFINITON 4.6: (ACTIVITY). An activity is a tuple a=(st, app, op, prop, value,
start, end), where:
• st is the Service Template id, which identifies the service template in the TOSCA
management plan.
• app is the id of the Node Template, on which the operation is performed.
• op is the operation, which is performed on the Node Template and is restricted in the
Node Type of the Node Template.
• prop is the property of the Node Template, which is changed during the operation.
• value is the value of the property, which was changes. If prop is null, value also has
to be null.
• start describes when the activity has to start earliest.
• end describes when the activity has to end latest.</p>
        <p>Be SM the set of all maintenance plans. The set of all activities of a maintenance plan
is defined as &lt;,  ∈ . The set of all transitions of a maintenance plan is defined as
&lt;,  ∈ . The function : &lt;→ &lt;assigns a transition to each activity.</p>
        <p>Additionally, for some anti-patterns we need the history of transaction events and
the latest state of an application called the state event history.</p>
        <p>DEFINITON 4.7: (TRANSACTION EVENT HISTORY). A transaction event
history is a selection  on the set of transaction events, which are in the time scope of the
scheduled maintenance:
TEH ≔ G&lt;HIJGKHL MHK&lt;NGINKNOI _JGKQG ∧G&lt;HIJGKHL S HK&lt;NGINKNOI _INT</p>
        <p>DEFINITON 4.8: (STATE EVENT HISTORY). The state event history SEH stores
the latest state for each application in SES.</p>
        <p>Furthermore, we define three functions, time, countTE and countA.</p>
        <p>DEFINITON 4.9: (TIME). time is a function, which returns the current timestamp.</p>
        <p>DEFINITON 4.10: (COUNTTE). countTE(te,TEH) is a function, which counts the
number of occurrences of the transaction event te in the transaction event history.</p>
        <p>DEFINITON 4.11: (COUNTA). countA(a, S) is a function, which counts the number
of occurrences of an activity a in a set S.</p>
        <p>After the description and definition of the context, the patterns and anti-patterns are
described.
4.2</p>
        <sec id="sec-4-2-1">
          <title>Pattern and Anti-Pattern</title>
          <p>
            All in all, we define ten patterns/anti-patterns in order to detect operation errors. These
are NEXT, IMMEDIATELY NEXT, PRECEDENCE, IMMEDIATELY
PRECEDENCE, OCCURRENCE, ALTERNATIVE OCCURRENCE, ABSENCE,
ALTERNATIVE ABSENCE, VALUE and STAE-CONDITION. The first eight
patterns are highly influenced by the specification pattern of Dwyer at al. [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] and are used
to detect process errors. Whereas the VALUE anti-pattern is used to detect
configuration errors. The STATE-CONDITION anti-pattern is used to check, if a resource is in
the planned state in order to perform a task on it. To describe the patterns and
antipatterns following template is used:
• Name: The name of the pattern must be unique and should describe the purpose of
the pattern.
• Description: Here the form of the pattern is described, which should occur in the
maintenance.
• Instances: Here it is described, how instances of the pattern can be derived from the
maintenance plan.
• Anti-pattern: A description of the corresponding anti-pattern and which type of
operator errors can be detected with the anti-pattern. Additionally, we formalize the
conditions, which have to be violated in order to detect an operator error.
• Similar pattern: Here, similar patterns are referenced and differences are named.
Due to space limitations, we only present the three patterns and anti-patterns NEXT,
          </p>
          <p>STATE-CONDITION and VALUE in detail.</p>
          <p>Pattern NEXT
• Description: This pattern describes pairs of activities, defining which activity has to
occur after another. The pattern is used for controlling AND-joins, AND-splits and
concurrent sequences in a maintenance plan.
• Instances: To get all instances of this pattern for a TOSCA management plan i we
create a relation 1&lt;≔ &lt;×&lt;×&lt;with the tuples (OYQ, NIZ , [KQ) where,
─ the corresponding transitions of the activities OYQ and NIZ are connected through
the same place,
─ OYQ, NIZ und [KQ have to occur in the same path,
─ [KQ always has to occur after OYQ,
─ [KQ and OYQ may not be connected through the same place.
• Anti-pattern: The anti-pattern allows to detect operator errors of the type “wrong
order”. Besides it is possible to detect operator errors of the type syntactical error, if
a configuration parameter was changed in the wrong order.</p>
          <p>An error message is created, when a transaction event te_`a in the event stream ES
occurs and none of the next events tecde conforms to the next activity acde .
However, one of the next events conforms to an activity agha:
πhjj,kj,jakjtehlm ∈ πhnop.hjj,hnop.kj,hnop.jakjP1r ≻</p>
          <p>πhjj,kj,jakjtecde ∉
πhjj,kj,jakj πhuvw σhnop.hjjymdnop.hjj∧hnop.kjymdnop.jj∧hnop.jakjymdnop.jakjP1r
∧
πhjj,kj,jakjte ∈
πhjj,kj,jakj πhz{p σh{|}.hjjymdnop.hjj∧hnop.kjymdnop.jj∧hnop.jakjymdnop.jakjP1r
• Similar pattern: The pattern IMMEDIATELY NEXT allows also to detect operator
errors of the type “wrong order”, however the pattern IMMEDIATELY NEXT
would create wrong error messages for concurrent sequences and can only be used
for non-concurrent activities.
Pattern STATE-CONDITION
• Description: This pattern describes the state an application should have in order to
be able to perform an operation on either the same or another application. Example:
in order to shut down an application server, the database server must be in the state
offline.
• Instances: Instances of this pattern are all conditions &lt;for a maintenance plan .
• Anti-pattern: This anti-pattern does actually not detect an error like described in
chapter 2. Instead, it detects malicious prerequisites, which would lead to an
operation error. This is done by comparing the latest state of an application with the
planned state:</p>
          <p>πKLL,~L,LQ~L ∈ πKLL,~L,LQ~L&lt;∧
KLL,JGKGI (KLLyI .KLL∧~LyI .~L∧LQ~LyI .LQ~LSC) ∕ KLL,JGKGI () ≠ ∅
• Similar pattern: There are no similar patterns for the STATE-CONDITION pattern.
Pattern VALUE
• Description: This pattern describes the value of a configuration parameter which
has to be changed during the maintenance
• Instances: To get all instances of this pattern a selection on the set of all activities
of the maintenance plan is performed in order to get only those activities which
include a change of a property: 3&lt;≔ πKLL,~L,LQ~L,KYI (σLQ~L&lt;).
• Anti-pattern: This anti-pattern allows to detect operator errors of the types
“wrongly executed activity”, “lexical error”, “local inconsistency”, “global
inconsistency” and “typo” by checking the element value of a transaction event te:
KLL,~L,LQ~L ∈ KLL,~L,LQ~L3&lt;∧ KLL,~L,LQ~L,KYI  ∉ 3&lt;
• Similar pattern: This pattern can be seen as a more detailed version of the
OCCURRENCE pattern; however, the OCCURRENCE pattern just checks for
executed operations and properties, but not for the actual values of the operations.
4.3</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Derivation of pattern instances</title>
          <p>
            Pattern instances can be derived from the maintenance plan by simulating it. Therefore,
the maintenance plan is marked with the Service Template of the to be maintained IT
Service. The resulting simulation log is used to create log-based ordering relations and
footprints like they are used in process mining and described in [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. Based on
these ordering relations and the functions described in chapter 4.1 all pattern instances
of the maintenance plan can be derived automatically.
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Implementation</title>
      <p>The architecture of the proof of concept implementation consists of four main
components. The first component is a modelling component, which allows to model
maintenance plans and derive pattern instances of a maintenance plan. The modelling
component is implemented in the software tool Horus1 and already allows to model generic
XML nets. The extension of the tool in order to model TOSCA service templates and
link them to an XML net is currently under construction.</p>
      <p>The second main component are the log agents. Log agents are used to get every
new log entry of an application, transform the log entry into the format of a transaction
event and send it to the complex event processing engine. In the proof of concept log
agents are implemented with Beats and Logstash2. Both products are developed for fast
log data extraction. Besides, Logstash contains a powerful regular expression engine,
which supports the transformation of proprietary log entries into the generic format of
transaction events.</p>
      <p>The third component is an IT infrastructure monitoring tool like Nagios3, or
CloudWatch4, which allows to check the state of an application in order to generate the state
events.</p>
      <p>The fourth component is the complex event processing engine, which checks
incoming state and transaction events against the pattern instances of the maintenance plan.
In the proof of concept the complex event processing system of WSO25 is used. All
anti-patterns are implemented as event queries in the event pattern language Siddhi6
and have to be implemented only once. In order to check future maintenance plans,
only the corresponding pattern instances have to be transferred to the complex event
processing system. As an example, for an anti-pattern written in Siddhi, see the
following anti-pattern NEXT, implemented as Siddhi query:
from te [(app == NEXT.appcur and op == NACHFOLGER.opcur
and prop == NEXT.propcur) in NEXT] insert into #temp;
from #temp as t join NEXT as n on t.app == n.appcur and
t.op == n.opcur and t.prop == n.propcur
select t.timestamp, n.appcur, n.opcur, n.propcur,
n.appnex, n.opnex, n.propnex, n.appfar, n.opfar, n.propfar
insert into #temp1;
from e1=#temp1 -&gt; e2= incoming_te [e1.appfar == e2.app
and e1.opfar == e2.op and e1.propfar == e2.prop]
select e1.timestamp, e1.appcur, e1.opcur, e1.propcur,
e1.appnex, e1.opnex, e1.propnex, e2.timestamp as
timestampfar insert into #temp2;
from #temp2 [not((appnex == TEH.app and opnex == TEH.op
and propnex == TEH.prop in and timestamp &lt; TEH.timestamp
and timestampentf &gt; TEH.timestamp) in TEH)]
1 www.horus.biz
2 https://elastic.co
3 https://nagios.org
4 https://aws.amazon.com/en/cloudwatch/
5 https://wso2.com/products/complex-event-processor/
6 https://github.com/wso2/siddhi
select str:concat("The activity ",appnex, ", ", opnex, ",
", propnex, " was not performed after the activity ",
appcur, " ,", opcur, ", ", propcur, ".") as message insert
into error_message;</p>
      <p>Apart of the modelling component all components and Siddhi queries are already
implemented in a prototype.</p>
      <p>
        In the next months, an evaluation of the whole method will be performed. Therefore
it is planned to replay common configuration settings like they are described for
example in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Main evaluation criteria will be the time needed to detect operator
errors, the ratio of detected/injected operator errors and the false positive rate.
Related work can be separated in different areas of work. One area of work is the
automation of typical operations like redeployments and integrated error exception
handling, like it is provided by popular configuration management tools, e.g. Chef [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Those tools have the disadvantage, that they have just local information for error
handling and no global view of the whole maintenance, which also could involve legacy
systems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        Another area of work is the detection of configuration errors. Those approaches can
be divided in rule based methods and online configuration validation [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. Rule based
methods try to avoid configuration errors a priori by correctness checks. These, help to
detect wrong planned configuration errors. However, those approaches do not check if
the configuration operation itself was executed as planned. So, forgotten configurations
e.g. because a server was down or typos, when the configuration was done manually,
cannot be detected.
      </p>
      <p>
        The most related work to ours is the work of Xu et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and Farshchi et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
Both works describe an approach to monitor sporadic operations in cloud environments.
Xu et al. developed a method called “POD-Diagnosis”. They use a process model to
detect operator errors through token replay by checking the conformance of observed
logs with the pre-build model and an additional fault tree analyses in order to find the
root cause of the error. In contrast to our work only the control flow of the process is
modelled and can therefore be checked. Apart of that, in our approach no additional
fault tree has to be build. Farshchi et al. build a regression-based model to find
correlation and causalities between events described in logs and overserved metrics of
resources. In their approach, assertions are derived from the regression-based model.
However, they are also limited to control flow. Additionally, enough learning data is
needed, which practically limits their approach to automated cloud environments. Our
approach does not have to learn data and therefore can also be used to monitor manually
executed steps or changes in legacy systems.
7
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we describe an approach to detect operator errors during the execution of
maintenance operations. Therefore, we define different anti-patterns, which are
implemented as complex event processing queries and check in real time log entries and state
metrics of observed resources against pattern instances of a pre-defined process model.
The process model itself is realized as a TOSCA based XML net, combining the
modelling of the control-flow and the resources. A prototype to check the effectiveness of
our approach is currently under construction. In order to evaluate the approach typical
maintenance operations will be performed like the configuration of servers or a rolling
upgrade. During these maintenance operations, typical errors will be injected on
purpose. The prototype should be able to detect all those injected errors.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Gunawi</surname>
          </string-name>
          et al.,
          <source>“What Bugs Live in the Cloud?: A Study of 3000+ Issues in Cloud Systems,” in Proceedings of the ACM Symposium on Cloud Computing</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seibold</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          , “
          <article-title>Efficient verification of IT change operations or: How we could have prevented Amazon's cloud outage,” presented at the Network Operations and Management Symposium (NOMS</article-title>
          ),
          <year>2012</year>
          IEEE,
          <year>2012</year>
          , pp.
          <fpage>368</fpage>
          -
          <lpage>376</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dumitraş</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , “
          <article-title>Why do upgrades fail and what can we do about it?: toward dependable, online upgrades in enterprise system</article-title>
          ,”
          <source>in Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware</source>
          ,
          <year>2009</year>
          , p.
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pertet</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , “
          <article-title>Causes of failure in web applications,” Parallel Data Lab</article-title>
          ., p.
          <fpage>48</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Oppenheimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganapathi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Patterson</surname>
          </string-name>
          , “Why Do Internet Services Fail, and What Can Be Done About It?,” in
          <source>Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems - Volume</source>
          <volume>4</volume>
          , Berkeley, CA, USA,
          <year>2003</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Scott</surname>
          </string-name>
          , “
          <article-title>Making smart investments to reduce unplanned downtime,” Tactical Guidel</article-title>
          .
          <source>Res. Note Note</source>
          TG-
          <volume>07</volume>
          -4033 Gart. Group
          <string-name>
            <surname>Stamford</surname>
            <given-names>CT</given-names>
          </string-name>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Elliot</surname>
          </string-name>
          , “
          <article-title>DevOps and the cost of downtime: Fortune 1000 best practice metrics quantified</article-title>
          ,
          <source>” Int. Data Corp. IDC</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vetter</surname>
          </string-name>
          , “Detecting Operator Errors in Cloud Maintenance Operations,” in
          <source>2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>639</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagaraja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Martin</surname>
          </string-name>
          , and T. D. Nguyen, “
          <article-title>Understanding and Dealing with Operator Mistakes in Internet Services,”</article-title>
          <source>in OSDI '04: 6th Symposium on Operating Systems Design and Implementation</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , J. Zheng,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. N.</given-names>
            <surname>Bairavasundaram</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pasupathy</surname>
          </string-name>
          , “
          <article-title>An empirical study on configuration errors in commercial and open source systems</article-title>
          ,”
          <source>in Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Peterson</surname>
          </string-name>
          , “
          <article-title>Petri net theory and the modeling of systems</article-title>
          ,”
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Helm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vlissides</surname>
          </string-name>
          ,
          <article-title>Design patterns: elements of reusable object-oriented software</article-title>
          .
          <source>Pearson Education</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>W. M. van der Aalst</surname>
            ,
            <given-names>A. H.</given-names>
          </string-name>
          <string-name>
            <surname>Ter Hofstede</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kiepuszewski</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Barros</surname>
          </string-name>
          , “Workflow patterns,
          <source>” Distrib. Parallel Databases</source>
          , vol.
          <volume>14</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>51</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Ter Hofstede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Edmond</surname>
          </string-name>
          , and W. M. van der Aalst, “
          <article-title>Workflow data patterns,” QUT Technical report</article-title>
          , FIT-TR-2004-01, Queensland University of Technology, Brisbane,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Ter Hofstede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Edmond</surname>
          </string-name>
          , and W. M. van der Aalst, “Workflow resource patterns,”
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Riehle</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Züllighoven</surname>
          </string-name>
          , “
          <article-title>Understanding and using patterns in software development</article-title>
          ,
          <source>” TAPOS</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>M. B. Dwyer</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          <string-name>
            <surname>Avrunin</surname>
            , and
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Corbett</surname>
          </string-name>
          , “
          <article-title>Property specification patterns for finite-state verification</article-title>
          ,”
          <source>in Proceedings of the second workshop on Formal methods in software practice</source>
          ,
          <year>1998</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>W.</given-names>
            <surname>Van Der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining: discovery, conformance and enhancement of business processes</article-title>
          .
          <source>Springer Science &amp; Business Media</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Weidlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Weske</surname>
          </string-name>
          , “
          <article-title>Computation of behavioural profiles of process models,”</article-title>
          <string-name>
            <given-names>Bus. Process</given-names>
            <surname>Technol</surname>
          </string-name>
          .
          <source>Hasso Plattner Inst. IT-Syst. Eng. Potsdam</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bass</surname>
          </string-name>
          , and others, “
          <article-title>POD-diagnosis: Error diagnosis of sporadic operations on cloud applications</article-title>
          ,” in
          <source>2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Chef</surname>
          </string-name>
          , “About Handlers,”
          <fpage>08</fpage>
          -Nov-
          <year>2017</year>
          . [Online]. Available: https://docs.chef.io/handlers.html.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>XU</surname>
          </string-name>
          and
          <string-name>
            <surname>Y. ZHOU</surname>
          </string-name>
          , “Systems Approaches to Tackling Configuration Errors: A Survey,”
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Farshchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and J.</given-names>
            <surname>Grundy</surname>
          </string-name>
          , “
          <article-title>Metric selection and anomaly detection for cloud operations using log and metric correlation analysis,”</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Syst</surname>
          </string-name>
          . Softw.,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>