<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mining Duplicate Tasks from Discovered Processes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Borja Vazquez-Barreiros</string-name>
          <email>borja.vazquez@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Mucientes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Lama</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigacion en Tecnolox as da Informacion (CiTIUS) Universidade de Santiago de Compostela</institution>
          ,
          <addr-line>Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>78</fpage>
      <lpage>82</lpage>
      <abstract>
        <p>Including duplicate tasks in the mining process is a challenge that hinders the process discovery as algorithms need an extra e ort to nd out which events of the log belong to which transitions. To face this problem, we propose an approach that uses the local information of the log to enhance an already mined model by performing a local search over the potential tasks to be duplicated. This proposal has been validated over 36 di erent solutions, improving the nal model in 35 out of 36 of the cases.</p>
      </abstract>
      <kwd-group>
        <kwd>Process mining</kwd>
        <kwd>process discovery</kwd>
        <kwd>duplicate tasks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The notion of duplicate tasks |or activities| refers to situations in which
multiple tasks in the process have the same label. This kind of behavior is useful
when i) a particular task is used in di erent contexts in a process and ii) to
enhance the comprehensibility of a model. Typically, duplicate tasks are recorded
with the same label in the log and, hence, they hinders the discovery of the
model that better ts the log, as algorithms need an extra e ort to nd out
which events of the log belong to which transitions. There are several techniques
allowing to mine duplicate tasks [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7">2,3,4,5,6,7</xref>
        ], however, or the heuristics rules
used to detect the duplicate tasks are not su ciently general for all the logs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
or they have to deal with a large search space, increasing the time needed for
these algorithms [
        <xref ref-type="bibr" rid="ref3 ref5 ref6">3,5,6</xref>
        ].
      </p>
      <p>In this paper we present a novel proposal to tackle duplicate tasks. The
proposal starts from an already mined model without duplicate tasks, and uses
the local information of the log and the retrieved process to improve the model
through a local search over the potential duplicate tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>Local search algorithm</title>
      <p>
        Algorithm 1 describes the proposed approach to tackle duplicate tasks. The rst
step is the discovery of the potential duplicate activities. We used the heuristics
de ned in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to reduce the search space by stating that two tasks with the same
      </p>
      <p>// Retrieved by a process discovery technique.</p>
      <p>Algorithm 1: Local search Algorithm.</p>
      <p>
        input: A log L
1 ind0 initial solution(L)
2 potentialDuplicates ;
3 foreach activity t in the log L do
4 if max(min(jt &gt;L t0j; jt0 &gt;L tj); 1) &gt; 1 then
5 potentialDuplicates potentialDuplicates [ t
indbest ind0
potentialDuplicatesL2L = potentialDuplicatesL2L [ t00 where
t00 2= potentialDuplicates and t &gt;L t00
label cannot share the same input and output dependencies. Within this context,
the duplicate tasks are locally identi ed based on the follows relation (&gt;L),
where the upper bound for an activity t is the minimum of the number of tasks
that directly precede t in the log and the number of tasks that directly follow t.
This de nition can be formalized as [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: max(min(jt &gt;L t0j; jt0 &gt;L tj); 1). If for a
task t the upper bound is greater than 1, then t is considered as a potential task
for being duplicated and, hence, it is added to potentialDuplicates (Alg.1:3-5).
      </p>
      <p>After nding the potential duplicates, the algorithm splits the input and
output dependencies of the activities of the model into multiple tasks with the same
label through the function localSearch (Alg.1:7). In this step, the algorithm
calculates the input and output combinations for each activity in potentialDuplicates
(Alg. 1:10-11) through the function CalculateCombinations (Alg.2). Within this
function, the algorithm rst nds all the subsequences in the log L that match
the pattern t1tt2 where t1 2 I (t) and t2 2 O(t) in the model (Alg.2:2) |being
I (t) and O(t) the inputs and outputs, respectively, of t. Then, based on these
Algorithm 2: Algorithm to compute the combinations of a task.
7
8
9
10
11
12
13
14
15
subsequences, the combinations are created following three rules (Alg.2:4-15).
First, given two subsequences t1tt2 and t3tt4, if t1 = t3, then we merge both
subsequences into a new combination (Alg.2:4-7). Later, given two di erent
combinations c and c0, if they share the same output, i.e., c:output = c0:output, these
two combinations are merged (Alg.2:9-11). Finally, if the intersection between
two combinations is not the empty set, we have to record which elements are
shared by both combinations (Alg.2:12-15).</p>
      <p>
        After creating all the possible combinations, for each combination c (Alg.1:12),
the algorithm creates a new task t0 equal to the original activity t of the current
model (Alg.1:13). Then, it removes from I(t) all the tasks shared with c:inputs,
but keeping the tasks that are in c:sharedInputs (Alg.1:14). On the other hand,
for the new task t0, it retains only the elements in I(t0) that are contained in
c:inputs (Alg.1:15). The same process is applied for the outputs of both t and
t0 but with c:outputs and c:sharedOutputs (Alg.1:16-1:17). If both the inputs
and outputs of these tasks are not empty (Alg.1:18), they are included in ind0
(Alg. 1:19). Otherwise the model goes back to its previous state and tries with
a new combination. If the new task is included, the model is repaired (Alg.1:20)
and the unused arcs are removed (Alg.1:21). In order to evaluate the models
(Alg.1:22), we based the quality of a solution on three criteria: tness replay,
precision and simplicity. To measure these criteria we used the hierarchical
metric de ned in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. If the new model is better, the best individual indbest is
replaced with ind0 (Alg.1:26). Otherwise the model goes back to its previous
state and repeats the process with a new combination.
      </p>
      <p>
        The main drawback of the heuristic followed to detect the possible
duplicate tasks of the log (Alg.1:3 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) is that it does not cover all the search space,
particularly with tasks involved in a length-two-loop situation, as it breaks the
rule of two tasks sharing the same input and output dependencies. To solve this,
we have to make all the process iterative: when for a task t, max(min(jt &gt;L
t0j; jt0 &gt;L tj); 1) is greater than 1, i.e, t is detected as a duplicate activity, the
upper bound for all the tasks t0 that directly follow t must be updated, because
these tasks will now have multiple tasks with the same label as input. Hence, if
a task t is correctly duplicated in the model (Alg.1:26), we add the tasks that
directly follow t |and that weren't detected as possible duplicated tasks in the
rst step| into potentialDuplicatesL2L (Alg.1:27). Therefore, the last step of
the algorithm (Alg.1:31) involves a new execution of the function localSearch
(Alg.1:7) but with potentialDuplicatesL2L instead of potentialDuplicates. In
this second and nal execution, the subsequences are obtained from the process
model |note that in the rst execution the subsequences were extracted from
the log. Therefore, the algorithm parses the solution, checking which one of the
activities with the same label t0 2 I(t) were executed just before t and which
activities t00 2 O(t) were executed after t. Finally, it creates the combinations
based on this information.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3 Experimentation</title>
      <p>
        The validation of the presented approach has been done with several synthetic
logs from [
        <xref ref-type="bibr" rid="ref5 ref7">5,7</xref>
        ]. We used ProDiGen [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and HM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] over these set of logs to
retrieve the initial solutions. On the other hand, the quality of the models was
measured taking into account three metrics: tness replay (C) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], precision
(P) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and simplicity (S) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] . Table 1 shows the results retrieved before
applying the presented approach |the raw solutions mined with ProDiGen and
HM| and after the local search. Moreover, they show information about which
algorithm retrieves better results for each metric |highlighted in grey| and
which solutions are equal to the original model |highlighted in italics.
      </p>
      <p>After applying our approach over the solutions, the proposed local search
was able to enhance the results in 35 out of 36 of the cases. More speci cally,
the algorithm was able to i) signi cantly improve the precision, and ii) to reduce
the complexity of the di erent models by splitting the behavior of the overly
connected nodes. Furthermore, our approach was able to retrieve the original
model in 25 out of 36 cases.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have presented an approach to tackle duplicate tasks in an already discovered
model. Our proposal takes as starting point a model without duplicate tasks and
its respective log, and based on the local information of the log and the causal
dependencies of the input mined model, it improves the comprehensibility of the
solution. The presented approach has been validated with 36 di erent models
with duplicate tasks. Results conclude that this local search is able to detect all
the potential duplicate tasks in the log, and enhance the comprehensibility of
the nal model, by improving its tness replay, precision and simplicity.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the Spanish Ministry of Economy and
Competitiveness under project TIN2014-56633-C3-1-R, and the Galician Ministry of
Education under the projects EM2014/012 and CN2012/151.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adriansyah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munoz-Gama</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carmona</surname>
            , J., van Dongen,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Alignment based precision checking</article-title>
          .
          <source>In: BPM</source>
          . (
          <year>2012</year>
          )
          <volume>137</volume>
          {
          <fpage>149</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Broucke</surname>
            ,
            <given-names>S.K.V.</given-names>
          </string-name>
          :
          <article-title>Advances in Process Mining</article-title>
          .
          <source>PhD thesis</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Buijs</surname>
            ,
            <given-names>J.C.A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Dongen</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Quality dimensions in process discovery: The importance of tness, precision, generalization and simplicity</article-title>
          .
          <source>International Journal of Cooperative Information Systems</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ) (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Carmona</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cortadella</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kishinevsky</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A region-based algorithm for discovering petri nets from event logs</article-title>
          .
          <source>In: BPM</source>
          . Springer (
          <year>2008</year>
          )
          <volume>358</volume>
          {
          <fpage>373</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>de Medeiros</surname>
          </string-name>
          , A.:
          <article-title>Genetic Process Mining</article-title>
          .
          <source>PhD thesis</source>
          , TU/e (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Goedertier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martens</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanthienen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baesens</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Robust process discovery with arti cial negative events</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>10</volume>
          (
          <year>2009</year>
          )
          <volume>1305</volume>
          {
          <fpage>1340</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Process mining: Extending -algorithm to mine duplicate tasks in process logs</article-title>
          .
          <source>In: Advances in Web and Network Technologies, and Information Management</source>
          . (
          <year>2007</year>
          )
          <volume>396</volume>
          {
          <fpage>407</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rozinat</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Conformance checking of processes based on monitoring real behavior</article-title>
          .
          <source>Information Systems</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ) (
          <year>2008</year>
          )
          <volume>64</volume>
          {
          <fpage>95</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sanchez-Gonzalez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garca</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendling</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruiz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Piattini: Prediction of business process model quality based on structural metrics</article-title>
          .
          <source>In: Conceptual Modeling ER 2010</source>
          . Volume
          <volume>6412</volume>
          . (
          <year>2010</year>
          )
          <volume>458</volume>
          {
          <fpage>463</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Vazquez-Barreiros</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mucientes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lama</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ProDiGen: Mining complete, precise and minimal structure process models with a genetic algorithm</article-title>
          .
          <source>Information Sciences</source>
          <volume>294</volume>
          (
          <year>2015</year>
          )
          <volume>315</volume>
          {
          <fpage>333</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Weijters</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van der Aalst</surname>
          </string-name>
          , W.M.P.,
          <string-name>
            <surname>de Medeiros</surname>
          </string-name>
          , A.:
          <article-title>Process mining with the heuristics miner-algorithm</article-title>
          .
          <source>Technische Universiteit Eindhoven</source>
          <volume>166</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>