<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Distributed Methods for CBR Systems for Monitoring Business Process Workflows</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ioannis Agorgianitis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miltos Petridis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stelios Kapetanakis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Fish</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>I.Agorgianitis</institution>
          ,
          <addr-line>M.Petridis, S.Kapetanakis</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing, Engineering and Mathematical Sciences, University of Brighton</institution>
          ,
          <addr-line>Watts Building, Lewes Road, Brighton, BN2 4GJ</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>122</fpage>
      <lpage>131</lpage>
      <abstract>
        <p>This paper evaluates the potential for distributed business process workflow monitoring and management using the CBR paradigm. Recent developments in distributed computing technologies have shown capable for efficiency gains from effective distribution. Current models of CBR distribution are discussed and an extension is proposed focusing on the challenge posed from large data volumes. This is shown to be also associated with more challenges in the quality of data and the requirement for real time processing. The proposed approach is presented and a novel architecture for distribution of CBR systems is proposed. An evaluation of the approach and architecture is conducted and presented based on a set of experiments in the area of business process workflow management. The experiments establish a serial execution baseline and show promising high speedup gains, especially at large volumes (exceeding millions) of cases. It is shown that at high enough data volumes, there is a clear benefit of distributing some of the early parts of the CBR life cycle to a finer data and process granularity level. Such approach seems to maximise the benefit from the use of modern distribution technologies. Concluding, this paper signposts future areas of research leading to a more generic model that can maximise the efficiency gains of distribution in CBR systems.</p>
      </abstract>
      <kwd-group>
        <kwd>Case-Based Reasoning</kwd>
        <kwd>Distributed Architectures</kwd>
        <kwd>Distributed Case-based Reasoning</kwd>
        <kwd>Business Process Workflows</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Modern industrial environments can be complex, incorporating multiple interrelated
business processes. Increasingly, within organisations, business processes are captured,
monitored and controlled by enterprise software systems. A side effect emerging in
such organisations is the generation, propagation and utilisation of large datasets on a
daily basis. The complexity and interoperability of business processes is responsible
for the migration of standalone and “isolated” bespoke systems to large scale,
distributed ones, sometimes utilising hundreds of thousands of physical computational nodes.</p>
      <p>
        Advanced systems currently exist, which are capable of performing continuous
process monitoring and control, like Distributed Control Systems (DCSs). Such systems
can be responsible for storing, controlling, visualizing and analysing huge amounts of
data related to internal business processes in real time. However, such systems usually
rely upon human monitoring in order to reason upon workflow executions and provide
corrective actions whenever they are required [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Human intervention can introduce high levels of uncertainty due to possible:
malicious user behaviour, regular occurrence of human errors and expert and / or
stakeholder absence at key business process execution stages. Additionally, such human
related interventions may not be fully captured by systems and actions can be based on
further communication and/or information exchange outside digital Business process
monitoring systems.</p>
      <p>
        Recent advances in information systems technologies have provided new
capabilities for data management and exploitation of data, especially based on modern
enterprise architectures and cloud computing. The literature shows a number of successful
implementations related to the diagnosis and monitoring of business workflows using
the CBR paradigm [
        <xref ref-type="bibr" rid="ref14 ref16 ref2">2, 14, 16</xref>
        ]. However, current attempts invariably focus on
smallscale systems, leaving unknowns in terms of scaling and distribution, a frequent
prerequisite for modern industrial implementations.
      </p>
      <p>This work attempts to investigate the issues of scalability and distribution on CBR
applications specialised on business workflow monitoring. Its structure will be as
follows: Section 2 presents the relevant literature in intelligent business process
management; Section 3 discusses the key challenges of distribution and presents possible
models to address the challenge and proposes an operations-efficient architecture and
Section 4 presents an evaluation that verifies its validity and performance. Finally, Section
5 presents the conclusions from this work and proposes further work in this area.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Workflows and business process are two interrelated concepts. Nowadays business
process definitions have been standardized to a large extend using industry acceptable
standards like the Business Process Model and Notation (BPMN) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the XML Process
Definition Language (XPDL) [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] and the Web Services Business Process Execution
Language (WS-BPEL) an executable language standard introduced by OASIS for
specifying the behavioural aspect of business processes utilising Web Services [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Case-based reasoning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has been proven an efficient mechanism in monitoring
business process workflows [
        <xref ref-type="bibr" rid="ref1 ref14 ref16 ref2">1, 2, 14, 16</xref>
        ] and possibly a competent model upon
tackling uncertainty and fuzziness making use of past experiences along with extensive
domain knowledge and expertise. Minor et al [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have presented a CBR approach for
representation and index-based retrieval of agile workflows, Dijkman et al [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] have
shown a process model ranking against a repository of process models and Kapetanakis
et al [
        <xref ref-type="bibr" rid="ref16 ref2">2, 16</xref>
        ] have proposed a generic architecture and framework, for intelligent
monitoring of business workflows using CBR.
      </p>
      <p>
        However, over the past years great advances have been noted in the area of highly
distributed systems like the HTCondor system which provided multi-scheme utilisation
and management of resources including exploitation of idle machines processing
power, multiple installations and collaborations of HTCondor instances[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], the
Berkley Open Infrastructure for Network Computing (BOINC) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] which utilizes device
idle time and general projects for commodity machine utilization like Apache Hadoop
and Spark [
        <xref ref-type="bibr" rid="ref18 ref20 ref21">18, 20, 21</xref>
        ].
      </p>
      <p>
        Plaza and McGinty[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have proposed a classification for distributed CBR systems,
based on the magnitude of knowledge and processing of data. This work suggested that
the CBR knowledge content for distribution purposes is correlated to the number of
case bases present. Despite the case that most CBR systems use a single case base,
increasingly multiple case bases appear in systems due to the complexity and
multiprovenance of data that can be seen in modern enterprise systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The integration and verification of such distribution techniques is becoming more
and more challenging since it requires increased research efforts and labour. Our
research work focuses on the identification of a potential effective approach and
architecture for distributed CBR based implementations for large-scale business process
workflows monitoring and management.
3</p>
    </sec>
    <sec id="sec-3">
      <title>An Enhanced Categorisation of Distributed CBR systems</title>
      <p>
        Our presented approach is driven by the current state of technology for data related
operations, and additionally, the maximisation of operational utilisation across large
data volumes. Our possible solution on the data volume issue comprises four main
characteristics, building on top of the existing classification on CBR distributed systems
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]: Ground-up distribution, Agent capability to process big amount of data, Single
case base to be distributed on demand and Distributed processing units managed by
agents; enhancing distributed CBR systems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with the additional aspect of the data
volume. Apart from the extra dimension of data volume, the initial classification
remains the same.
      </p>
      <p>
        It is worth mentioning at this point that a number of issues could arise by
implementing distribution in a ground-up manner in terms of cases where the used algorithms are
optimised for serial execution (such as business process workflows management and
monitoring graph-based approaches [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref2">2,14,15,16</xref>
        ]). In such cases, alternative algorithms
may be used, or the distribution model may need to be adapted to provide efficiencies
using the most appropriate type and architecture for distribution.
      </p>
      <sec id="sec-3-1">
        <title>3.1 The Distribution Lifecycle of CBR in Business Process Workflows</title>
        <p>Based on our proposed new distributed CBR component for data volume handing, this
study proceeds further by presenting the areas on which the distribution could take
place along with the identification of the CBR cycle’s operations to be conducted in
each and every stage.</p>
        <p>The data for the CBR system could come in various forms (Figure 1) such as raw
text data, distributed raw text data (e.g. HDFS), document based data (e.g. MongoDB,
Elastic), and so on. For their distribution to remote processing units (agents) an agent
is selected (Coordinator) which may have control on the data dispatch, data
redistribution, etc.</p>
        <p>
          Having distributed the case base (raw) data in various nodes based on its volume,
the system proceeds by loading the actual cases of the case base. As already discussed,
business process workflow cases would normally be represented using graph based
formats [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The process of graph initialisation is a computational intensive task, and
consequently, a distributed approach could improve performance and avoid bottlenecks
in the cycle execution. The indexing of the data may be implemented in various stages:
At first, an initial indexing mechanism should occur at case creation stage which is
beyond the current phase of the cycle. Moreover, post indexing could be applied at case
loading time and / or during the similarity computations. Either approach does require
advanced computations and processing power, which leads to this also occurring in a
distributed manner.
6
RetaCiansmeent
        </p>
        <p>ETL
Uniform</p>
        <p>Raw Data
Case
Creation</p>
        <p>New Case
Coordinator
5</p>
        <sec id="sec-3-1-1">
          <title>Coordinator 4</title>
          <p>1
2
Case Base
Raw Data
Distributed
Case Base
Raw Data</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3 CLaoseadBeadse</title>
          <p>Graphs
Distributed</p>
          <p>Computations
Case Reuse
Case Revision
(e.Rga.wtexDtaftilaes) Distri(beu.gte.dHDRaFSw)Data</p>
          <p>Document Data
(e.g. Elastic)
Split Raw Data</p>
          <p>Split Raw Data</p>
          <p>Split Raw Data
Graphs</p>
          <p>Graphs</p>
          <p>Graphs
Worker</p>
          <p>Worker</p>
          <p>Worker
Indexing</p>
          <p>Similarities
worker</p>
          <p>Indexing</p>
          <p>Similarities
worker</p>
          <p>Indexing</p>
          <p>Similarities
worker</p>
          <p>Pre
Processing</p>
          <p>Case
Representation</p>
          <p>Case
Retrieval</p>
          <p>
            The case retrieval phase of the CBR cycle computes suitable similarity measures in
order to select the most similar case. The complexity of the computed similarities varies
depending on the algorithmic approach and techniques to be used in the process.
Moreover, it is related to the case representation which in business process workflows is
usually quite complex (graphs). In many cases, the computational complexity of
graphbased similarity algorithms is reported to be an NP-hard problem requiring heuristics
and only computable due to the relatively small size and complexity of the graphs [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
As a result, the distribution of processing, can reduce performance bottlenecks within
the system. Figure 1, shows an architecture based on this approach that allows the
distribution of key elements of the CBR lifecycle for Business workflow monitoring
systems. This approach and architecture has been adopted and implemented on a number
of workflow monitoring systems in order to evaluate the approach and architecture. The
experiments and results of the evaluation will be shown in the next section.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        For the evaluation of this work we focused on the verification of the proposed CBR
categorisation focusing on the data volume aspect. The experimental part involved two
distinct phases: First, a similarity algorithm was developed, specifically designed for
isomorphic and acyclic graphs [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Then, a number of experiments were conducted on
a specific workflow monitoring domain. The experimental runs involved a large variety
of case numbers ranging from: 20 to 106.
      </p>
      <sec id="sec-4-1">
        <title>4.1 The CBR Domain of Business Process Workflow Monitoring</title>
        <p>For the needs of our evaluation a business workflow monitoring system was chosen
from the retail industry involving: new orders generation, order preparation, passing
from the various departments and finally, the dispatching and delivery of the goods. Its
business process definition can be seen in Figure 2. The domain knowledge was
acquired through past working experience within the described domain throughout all
departments of the workflow lifecycle. Key characteristics for this system were the
strictly defined times frames between the various actions of the described workflow as
presented in the BPMN.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 The Links Isomorphic Graph Similarity Algorithm</title>
        <p>The selected business process posed increased data volumes in CBR systems leading
to case bases with millions of cases. The domain in question had very large numbers of
workflow instances (necessary for our evaluation purposes) as well as a fine-grained
business process which could be easily represented in a graph-based format. The latter
was important since the complexity of the graph representation was not our investigated
element but instead the evaluation of the effect from increased number of stored cases.</p>
        <p>
          The business process (Figure 2) was composed by a certain number of available
actions which should be present in any instance in order to have a “completed” workflow
instance. Moreover, it was apparent that each action should take place at a given order
which means that actions were connected with each other in a specific way. For
example, it was impossible to have an “order dispatched” before the “an order generation”.
As a result, workflow instances of the selected BPMN could be regarded as isomorphic
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], given the fact that the number of actions (nodes) and the way that actions were
connected (links) were known, fixed and unchanged. Workflow instances of the
experiments were known to be isomorphic and acyclic. In this respect, a similarity algorithm
could be developed capable to measure similarities among isomorphic graphs. The idea
behind the development of the proposed algorithm is that given the fact that all
workflow instances have the same number of nodes and furthermore, the nodes are
connected in the same way, the similarity between two given graphs could be calculated
by measuring the distances of the corresponding links between those graphs.
Record Invoice
dispatch to
Smal Goods
        </p>
        <p>Record Invoice
dispatch to
Tobaccos</p>
        <p>Record Packaged
Orderarrival to</p>
        <p>Delivery</p>
        <p>Record Start
of Delivery</p>
        <p>Logging
Datastorage
Record Order
Arrival to Retail</p>
        <p>Point
Orders Log System
Invoice Department</p>
        <p>StartOrder
Small Goods Department
Tobaccos Department
Delivery Department</p>
        <p>Record new
Invoice
Generation
Generate</p>
        <p>Invoice
Invoice</p>
        <p>Smal Goods
Packaging
Invoice</p>
        <p>Tobaccos
Packaging
Invoice</p>
        <p>OrderLoading
to Tracks</p>
        <p>Order
Dispatch</p>
        <p>Order</p>
        <p>Order Received</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Experimental Design</title>
        <p>The distributed CBR lifecycle was evaluated using 7 auto-generated datasets. Each
dataset contained log entries with information related to actions occurred on workflow
instances. Each completed workflow instance represented a new case for our
experiments. A typical log entry comprised various comma separated values, such as: case
index, event occurrence date, event name (e.g. invoice generation), personnel name,
execution time and reasoning for delays if any.</p>
        <p>The data generator application was developed in full cooperation with experts from
the organization owning the business process definition having random delays in the
execution of actions and the provision of random reasoning for each and every time
where a delay occurrence was introduced within the workflow instance. Having
completed an extensive discussion with business stakeholders, a number of guidelines were
established in terms of delay occurrences. Delays were classified in 4 main categories
(0, 20, 40 and 60%). Then, business experts provided reasoning for each delay category.</p>
        <p>A number of business rules were established based on domain knowledge expertise.
As an example, it was known that summer months were hectic due to increased demand
and as a result, there was generally an overall increase on delays (10-30%). The afore
mentioned knowledge and expertise were incorporated in the data generator application
which was responsible for assigning random delays to each action along with random
reasoning. Random delays were also added with an additional fluctuation factor in order
to provide an additional level of realism.</p>
        <p>The serial execution experiments used SQL Server as the medium to store the case
base whereas the distributed approach used text files. A delay threshold of 50% was
defined which classified any workflow instance with an execution time above the
threshold as delayed.</p>
        <p>The actual evaluation of the distributed CBR lifecycle was conducted by developing
a basic implementation of a k-NN classification algorithm in order to classify any new
case. The classification was delivered by a simple voting mechanism of the k most
similar cases. There were 3 categories of experimental runs.</p>
        <p>The first was the serial execution path where case loading, retrieval, adaptation and
classification were performed by a conventional serial execution program written in
Python with a case base stored in a MS SQL Server 2014.</p>
        <p>The two additional execution run experiments represent a materialization of the
proposed distributed CBR lifecycle where case loading, retrieval, adaptation and
classification take place in a distributed manner. The first distributed implementation performs
exactly the same steps with the serial execution run so as to classify the new case
coming to the system, whereas the second one is an optimized version of the first by
reducing the number of the data structures used though the CBR phases.</p>
        <p>All experiments runs were conducted on the same machine with 8 cores at 2.4 GHz
and 16 GB RAM (DDR3 1066 MHz).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4 Experimental Results</title>
        <p>
          The experiments phase was composed of 21 experimental runs of the CBR cycle. The
initial 7 were the serial execution approach which was also used as the base line
measure. Then, there were two batches of 7 runs, one for the initial distributed approach and
a second attempt being an optimised version of the first.
5 * 106 1847.182 278.475 73.432
107 2370.668 595.346 140.628
The serial execution ran for case bases greater than 106 stored cases start to increase
considerably in terms of execution time (secs) [Table 1]. The increase in question is
strictly related to the underlying sorting algorithm. The sorting algorithm used in all
cases was timesort and its implementation varies between serial and distributed
approaches. Alteration of the algorithm in question is beyond the scope of this work and
it is based on Python and PySpark implementations [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ].
        </p>
        <p>As indicated in Table1, both distributed lifecycle approaches outperform the serial
execution especially for large numbers of stored cases. The first attempt begins to
provide performance gains (speedup &gt; 1) around 106 of stored cases whereas the second
one at some point after 100000. This was due to the fact that distribution comes always
with overheads related to the orchestration and setup of the distribution itself.</p>
        <p>
          In terms of speedup [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], the optimised distributed approach achieved a massive
improvement of 25. This number is extremely high (Figure 3). Nonetheless, we have to
consider that this version of the code not only uses distribution for case loading,
similarity computations, indexing and retrieval but also changes the implementation logic
by reducing the used data structures utilised throughout the CBR cycle. Moreover, it is
important to note that serial execution and distributed approaches harness two
completely different forms of media for data store. The former used a relational database
whereas the last ones utilised text files with degradation of data. The latter by itself
increased performance since text storage is faster in retrieval operations, let alone in
our case in which data partitioning takes place in both distributed approaches.
        </p>
        <p>Finally, the speedup reduction factor for very large number of stored cases (Table 2
– 107 stored cases experiment). This is strictly related to the data partitioning approach
used in both distributed approaches. The partitioning of the data, and as a result the
processing, was performed in a fixed way based on the dataset size. This is not the most
optimum scenario and further research and developed should take place on the domain
of dynamic data size aware partitioning algorithms for large datasets. This issue is also
adversely affected by the fact that we operated our experiments with a fixed size of
computational cores.</p>
        <p>CBR Cycle Execution Time
Serial Execution (Python
&amp; MS SQL Server)
100.000 Batches</p>
        <p>Number of Stored Cases</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>This research has argued that the current perspective of distribution in CBR does not
take into consideration the importance of data volume as a key prerequisite in the
integration of distribution in CBR for business process workflows. We proposed a new
categorisation of distributed CBR systems with a new dimension, this of data volume.
Our research shows that gains from distribution can be found in the parts of the CBR
cycle where data volume could generate deficiencies. In this respect, initial CBR stages
can be massively distributed so as to enhance the performance of CBR systems. An
approach and associated architecture was proposed. A number of experiments were
conducted trying to evaluate the proposed distributed CBR approach and architecture.
The experimental results show that distribution does increase performance of the CBR
cycle reaching high speedup factors, only at large number of stored cases which is due
to the fact that current technology and available frameworks allow the development of
highly optimized serial executions such as ones with extensive in memory exploitation,
as used in our experiments to establish baseline measures.</p>
      <p>Further research could be on the distribution techniques and approaches to be used
in the case representation and case retrieval. The current approach focuses on Data
Volume. Further work could focus on the other associated challenges posed by large
volumes of data. This work aims at the production of a more generic distribution model
and architecture. Additionally, it is expected that more research areas and challenges
will emerge in this area, including data and process distributed pipelines, dynamic data
size aware partitioning algorithms for large datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kapetanakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petridis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Bacon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A CBR Approach for the Monitoring of Business Workflows</article-title>
          ,
          <source>18th Int Conf on CBR, ICCBR</source>
          <year>2010</year>
          , Alessandria, Italy,
          <string-name>
            <surname>LNAI</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kapetanakis</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petridis</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            <given-names>J.</given-names>
          </string-name>
          , Bacon L., “
          <article-title>Providing explanations for the intelligent monitoring of business workflows using case-based reasoning</article-title>
          ”,
          <year>2010</year>
          , Proceedings of the Fifth International Workshop on Explanation-aware Computing,
          <source>ExaCt</source>
          <year>2010</year>
          , Lisbon, Portugal
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Teodorescu</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petridis</surname>
            ,
            <given-names>M. “</given-names>
          </string-name>
          <article-title>An agent based framework for Multiple</article-title>
          ,
          <source>Heterogeneous Case Based Reasoning” in Proceedings of ICCBR2013</source>
          , Saratoga Springs, NY,
          <string-name>
            <surname>Springer</surname>
            <given-names>LNAI</given-names>
          </string-name>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Object Management Group:
          <source>BPMN Version 2</source>
          .0:
          <string-name>
            <given-names>OMG</given-names>
            <surname>Specifications</surname>
          </string-name>
          , January,
          <year>2011</year>
          , http://www.omg.org/spec/BPMN/2.0/, accessed August 2015
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Workflow</given-names>
            <surname>Management</surname>
          </string-name>
          <article-title>Coalition: Workflow Standard, Process Definition Interface, XML Process Definition Language, Version 2</article-title>
          .2,
          <string-name>
            <surname>August</surname>
          </string-name>
          ,
          <year>2012</year>
          , http://www.xpdl.org/standards/xpdl2.2/XPDL%202.2%
          <issue>20</issue>
          (
          <issue>2012</issue>
          -08-30).pdf,
          <source>accessed August</source>
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>XML</given-names>
            <surname>Process</surname>
          </string-name>
          <article-title>Definition Language (XPDL): A standard of the Workflow Management Coalition (WfMC)</article-title>
          , http://www.xpdl.org/index.html,
          <source>accessed August</source>
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. OASIS:
          <string-name>
            <surname>Web Services Business Process Execution Language</surname>
          </string-name>
          ,
          <source>Version</source>
          <volume>2</volume>
          .0,
          <string-name>
            <surname>April</surname>
          </string-name>
          ,
          <year>2007</year>
          , http://docs.oasis-open.
          <source>org/wsbpel/2</source>
          .0/OS/wsbpel-v2.
          <fpage>0</fpage>
          -OS.html,
          <source>accessed August</source>
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Aamodt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          . E., “
          <article-title>Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches”</article-title>
          ,
          <source>March</source>
          <year>1994</year>
          ,
          <source>Artificial Intelligence Communications</source>
          , Vol.
          <volume>7</volume>
          , no.
          <issue>1</issue>
          ,
          <fpage>39</fpage>
          -
          <lpage>52</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Python.org, Mercurial Repositories: listsort, https://hg.python.org/cpython/file/default/Objects/listsort.txt, v.
          <source>March</source>
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Databricks</surname>
          </string-name>
          <article-title>: Spark the fastest open source engine for sorting a petabyte</article-title>
          ,
          <source>Octomber</source>
          ,
          <year>2014</year>
          ,
          <string-name>
            <given-names>Reynold</given-names>
            <surname>Xin</surname>
          </string-name>
          , https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html,
          <source>accessed March</source>
          <year>2016</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ruohonen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          “
          <source>Graph Theory”</source>
          ,
          <year>2008</year>
          , Tampere University of Technology, Chapter
          <volume>1</volume>
          ,
          <string-name>
            <surname>Section</surname>
            <given-names>5</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>El-Nashar</surname>
            ,
            <given-names>A. I.</given-names>
          </string-name>
          “
          <article-title>To parallelize or not to parallelize, speed up issue</article-title>
          ”,
          <year>2011</year>
          ,
          <source>International Journal of Distributed and Parallel Systems (IJDPS)</source>
          Vol.
          <volume>2</volume>
          , No.2
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          . E. and
          <string-name>
            <surname>McGinty</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <source>“Distributed case-based reasoning”</source>
          ,
          <year>2006</year>
          ,
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Engineering</surname>
          </string-name>
          <string-name>
            <surname>Review</surname>
          </string-name>
          , Vol.
          <volume>20</volume>
          :
          <issue>3</issue>
          ,
          <fpage>261</fpage>
          -
          <lpage>265</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Minor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tartakovski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bergmann</surname>
          </string-name>
          , R., “
          <article-title>Representation and structure-based similarity assessment for Agile workflows</article-title>
          ”,
          <year>2007</year>
          , Weber,
          <string-name>
            <given-names>R.O.</given-names>
            ,
            <surname>Richter</surname>
          </string-name>
          , M.M. (eds.)
          <source>CBR Research and Development, Proceedings of the 7th Int Conf on CBR, ICCBR</source>
          <year>2007</year>
          ,
          <article-title>Belfast</article-title>
          . LNAI, vol.
          <volume>4626</volume>
          , pp.
          <fpage>224</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Dijkman</surname>
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Banuelos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , “
          <article-title>Graph matching algorithms for business process model similarity search</article-title>
          ”,
          <year>2009</year>
          , Dayal,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Eder</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (eds.),
          <source>Proceedings of the 7th International Conference on Business Process Management. LNCS</source>
          , vol.
          <volume>5701</volume>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>63</lpage>
          . Springer, Berlin (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kapetanakis</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petridis</surname>
            <given-names>M.</given-names>
          </string-name>
          , “
          <article-title>Evaluating a Case-Based Reasoning Architecture for the Intelligent Monitoring of Business Workflows"</article-title>
          ,
          <year>2014</year>
          ,
          <source>Successful Case-based Reasoning Applications-2, Studies in Computational Intelligence</source>
          <volume>494</volume>
          , Springer-Verlag Berlin Heidelberg
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Thain</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannenbaum</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Livny</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          “
          <article-title>Distributed Computing in Practice: The Condor Experience”</article-title>
          ,
          <year>2005</year>
          , Concurrency and
          <source>Computation: Practice and Experience</source>
          , Vol.
          <volume>17</volume>
          , No.
          <fpage>2</fpage>
          -
          <issue>4</issue>
          , pp.
          <fpage>323</fpage>
          -
          <lpage>356</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          “
          <article-title>Spark: Cluster Computing with Working Sets”</article-title>
          ,
          <source>Proceedings of the 2nd USENIX conference</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>10</fpage>
          -
          <lpage>10</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          , “
          <article-title>BOINC: A System for Public-Resource Computing</article-title>
          and Storage”,
          <year>2004</year>
          , Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. Spark:
          <article-title>Lightning-fast cluster computing</article-title>
          , http://spark.apache.org/,
          <source>Accessed August 2015</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>M. Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <surname>T. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dave</surname>
            , J. Ma,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>and I. Stoica</given-names>
          </string-name>
          , “
          <article-title>Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing”</article-title>
          ,
          <year>2012</year>
          ,
          <source>Proceedings of the NSDI</source>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>2</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>