<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Label-independent feature engineering-based clustering in Public Administration Event Logs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Flavio Corradini</string-name>
          <email>flaviocorradini@unicam.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caterina Luciani</string-name>
          <email>caterinaluciani@unicam.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Morichetta</string-name>
          <email>andreamorichetta@unicam.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Piangerelli</string-name>
          <email>marcopiangerelli@unicam.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Polini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Label-independent clustering, Process Mining, Complexity</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Camerino</institution>
          ,
          <addr-line>Via Andrea D'Accorso, 16, 62032 Camerino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>Process mining algorithms infer business models by analyzing Log files derived from the execution of business activities in organizations. In this paper, a label-independent clustering methodology is proposed. It allows an analysis completely agnostic with respect the nature and domain knowledge of the process Logs. The methodology is totally data-driven and it is based on features that do not depend on activity labels and do not need model extraction at all, thus not requiring the four quality dimensions of a mining discovery algorithm to be satisfied. Due to its independence from asset labels, the methodology is very flexible and applicable in diferent scenarios. The methodology was tested on the process logs of a municipality of twenty thousand inhabitants showing good performances when evaluated using a mining discovering algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        All organisations, be they governmental, non-profit or corporate, are characterised by processes
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As pointed out by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], process analysis can be used to achieve standardisation: for example,
in the case of merging and acquisitions of companies, it may be important to recognise similar
processes and give a common description of them in terms of models. However, the processes
are not always modelled or the model does not necessarily correspond to the actual process
performed. Process mining [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a technique that allows the process models to be derived from
the logs produced by information systems. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], process mining techniques are used to visually
compare two logs. This methodology has been applied by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to compare public administration
logs. The authors showed that an activity is only present in logs acquired after 2017, perhaps as
a result of a new law coming into force. However, the approach has two limitations: only two
logs can be compared at a time and the methodology requires homogeneity between the labels
of the two logs to be efective. Particularly the latter, is a stringent requirement in reality, where
the labels may be diferent either because of recording errors, or because they are modelled with
diferent names [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the context of model comparison, the problem of labels was addressed
nEvelop-O
and several solutions were found, based on measures of syntactic and semantic similarity [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the authors highlight that there is still a lack of methods for comparing logs underlying
models with activities from diferent universes. The authors propose a methodology of log
comparison that does not take into account labels, comparing the temporal distribution of
activities in the two logs. However, the timestamp is not always available in logs and the
authors do not develop a clustering methodology. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the authors develop a log clustering
methodology that strictly depends on labels limiting its applicability to logs containing similar
labels. Here we present a label-independent clustering methodology, which allows an analysis
agnostic to the nature and domain knowledge of the process log collection. The methodology
is based on feature engineering that does not depend on activity labels and does not require
model extraction in any way. This results in a fully data-driven approach, which does not need
to take into account the four quality dimensions of a mining discovery algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Due to its
independence from asset labels, the methodology is applicable both in the case of clustering
collections of logs representing variants of the same process and in the case of collections
referring to several processes, even diferent in scope. It is therefore a flexible instrument, ready
to respond to the needs of the process analyst. The methodology was tested on the process
logs of a municipality of twenty thousand inhabitants. The rest of the paper is organised as
follows. Section 2 introduces the proposed data-driven methodology, while Section 3 illustrates
the main results applied to real logs. Finally, Section 4 concludes the paper by touching upon
directions for future work.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>
        The aim of this work is to propose a very simple and completely data-driven, label independent
methodology for clustering Logs in an agnostic way, complementary to process mining. In
order to do that we followed the general guidelines for performing data analysis that were
developed during the Da.Re. project [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. According to this methodology, the archetypal data
lifecycle is a four-step process: 1) searching for relevant data, which consists in providing clean
and useful data; 2) data preparation or feature engineering, in which data are modified and
treated in such a way to be “good and ready” for the analysis; 3) data analysis, consisting in
the definition of the most suitable learning algorithm and 4) data visualization, for presenting
them in an impacting way. In our methodology, starting from Logs, we extracted features from
data and then we used them to setting up an unsupervised clustering algorithm. The features
were selected to be label-independent, or agnostic, i.e. those features do not depend on activity
names. The approach is fully data-driven: logs can be compared without having to extrapolate
a model and without having to take into account the four quality dimensions of the discovery
algorithms. The minimum requirement for a Log to be eligible for process mining is that each
event must be both case/activity-related [11]. The same assumption has been made for feature
extraction, thus creating a common starting point for the two approaches to be compared. The
methodology is based on 4 main features: the Lempel-Ziv complexity of a Log (the number
of distinct substrings that can be found in it Lempel, (1976) [12]), the average number of the
predecessors of activities that are not “start” activities ( ), the average number of successors of
activities that are not “end” activities ( ), and, finally, the number of unique activities in the
()
||||||||
Considering the sequence   
Log. The presented features are selected to be the minimal set useful for describing Logs and
balancing global (Lempel-Ziv) and local ( and  ) information, following the Occam’s razor
principle to not add complexity.
      </p>
      <p>Lempel-Ziv Complexity. The Lempel-Ziv complexity (LZC) is the number of distinct
substrings which can be found in a string [12]. Actually, given a sequence of length 
containing the symbols from an alphabet Ω, the idea is to parse the sequence into a number
of distinct strings, by considering as a new word any subsequence never met before.
, the final LZ algorithm output is:
, with a complexity () = 9 . To perform the LZC
algorithm, all Log traces were merged into a single sequence, and a delimiter was inserted between
each trace. The LZ algorithm was modified so that only substrings between two delimiters
were detected. This was necessary to prevent the detection of non-existent behaviour, and to
limit the maximum length of detected substrings, which is otherwise potentially infinite and
completely dependent on the number of log traces and their sorting.</p>
      <p>Unique activities. Unique activities,  , are defined as the total number of distinct activities in
the Log. Considering, for example, the following Log: ℒ 1 = , , 
then we
can compute,  (ℒ 1) = 5</p>
      <p>and  . For each activity in the sequence Log, a predecessor and successor set is derived.
Considering the Log ℒ 1 two sets are generated: successors =  ∶ , ,  ∶ , ,  ∶ , , ,  ∶,  ∶
predecessors =  ∶,  ∶,  ∶ , ,  ∶ , ,  ∶ , 
. Instead, with  / we indicate the average
number of direct distinct predecessors/successors not considering the starting and ending
activities. Formally they are defined as  =</p>
      <p>∑=1  ⋅</p>
      <p>
        ,  =
∑=1  ⋅   where  = number of activities
of the Log without start (end) events,  = number of Unique activities of the Log and  and 
are respectively the number of diferent predecessors and successors of the activity  . Hence,
taking into account successors and predecessors sets,  = 7/5 and  = 6/5 . Clustering. Clusters
are homogeneous groups of objects that are grouped on the base of a similarity criterion, i.e.
a similarity function, in such a way that the similarity among objects belonging to the same
cluster is greater than the one among objects belonging to diferent clusters. K-Means algorithm,
equipped with a Euclidean similarity function, was used for clustering; the optimal number of
clusters was assessed by using the Silhouette Index. Process clustering is a well-established
problem in business process management. In particular, the authors in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] develop a
methodology based on ad-hoc similarity that allows the identification of variants in a large collection of
logs. However, the methodology is based on the assumption that the activity labels belong to
the same universe and are therefore comparable across diferent logs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>In order to show the feasibility of the approach, we applied the described methodology to a
collection of execution Logs provided by a European company. Those Logs refer to public
administration processes of a municipality of about twenty thousand inhabitants. The whole
dataset is composed of about 40 logs. After an inspection of the Logs (preprocessing), 29 Logs,
which met the minimum quality requirements for analysis, were used. In addition, only the
minimum requirements, i.e. each activity is linked to an id case and a unique indicator of
process instantiation, have been extracted from Logs for the feature engineering. Identified
features were calculated and normalised using the min-max scaling method. Cluster analysis
identifies 12 clusters, with a Silhouette Index of 0.54. The clusters seem to group processes with
comparable control flows, e.g. the cluster in Figures 1, 2 3 and 4. The processes in Figure 1 show
linear control flow and have more activities than the models, also linear, in Figure 2. Figure 3
shows that also complex processes with numerous activities and non-linear control flow are
detected by our methodology.</p>
      <p>Clustering was also able to discern models with the same number of activities but diferent
control flow. In Figure 4a, a three-activity model and a four-activity model are clustered together.
In fact, both have the same number of end events and therefore similar control flow. This is due
to an emergent property of the two features  and  : if  cannot reveal the part of the model
preceding a start activity,  will be able to reveal the contribution of the arc and vice versa. The
conjunction of the two features is also able to take into account the contribution of multiple
starts and ends. Model b in Fig. 4 also has four activities, but there are two loops, which is why
it is clustered alone.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future works</title>
      <p>In this paper, an agnostic clustering methodology of process logs was presented, based on
features that do not depend on activity labels. The methodology, which is completely
datadriven, is applicable both in the analysis of collections of processes with the same purpose
(variability analysis) and in the analysis of processes with diferent purposes. The methodology
proposed should be considered complementary to process mining, since it is suitable to manage
and analyze big amount of logs coming from diferent organizations, overcoming the problems
related to the syntactic and semantic comparison of labels. The methodology was applied to
clustering processes (with diferent purposes) of a European municipality of twenty thousand
inhabitants. The logs were then discovered with process mining techniques and the elements
of each group were compared to each other. The identified clusters highlight the potential of
the methodology, which can be applied to a wide range of datasets. In the future, we aim to
apply it to other known datasets or application fields [ 13], to evaluate further features, while
maintaining the simplicity criterion adopted in the selection that was made in this work. Finally,
further investigation on topology-based clustering approaches, could be of interest [14, 15].
M. Amador, in: Big data: Business, technology, education, and science: Big data (ubiquity
symposium), Ubiquity 2018, 2018, p. 1–13.
[11] W. van der Aalst, Process mining: data science in action, volume 2, Springer, Heidelberg,
2016.
[12] A. Lempel, J. Ziv, On the complexity of finite sequences, IEEE Transactions on information
theory 22 (1976) 75–81.
[13] F. Corradini, F. Marcantoni, A. Morichetta, A. Polini, B. Re, M. Sampaolo, Enabling auditing
of smart contracts through process mining, in: From Software Engineering to Formal
Methods and Tools, and Back, Springer, Cham, 2019, p. 467–480.
[14] M. Rucco, A. Mamuye, M. Piangerelli, M. Quadrini, L. Tesei, E. Merelli, Survey of topdrim
applications of topological data analysis, in: CEUR Workshop Proceedings, volume 1748,
RWTH Aachen University, Aachen, Germany, 2016, p. 1814.
[15] M. Piangerelli, S. Maestri, E. Merelli, Visualising 2-simplex formation in metabolic reactions,
Journal of Molecular Graphics and Modelling 97 (2020) 107576.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Reijers</surname>
          </string-name>
          ,
          <article-title>Fundamentals of business process management</article-title>
          , volume
          <volume>1</volume>
          , Springer, Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schoknecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Thaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fettke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oberweis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Laue</surname>
          </string-name>
          ,
          <article-title>Similarity of business process models-a state-of-the-art analysis</article-title>
          ,
          <source>ACM Computing Surveys (CSUR</source>
          <volume>50</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining: Overview and opportunities</article-title>
          ,
          <source>ACM Transactions on Management Information Systems (TMIS</source>
          <volume>3</volume>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leoni</surname>
          </string-name>
          , W. van der Aalst,
          <article-title>A visual approach to spot statistically-significant diferences in event logs based on process metrics</article-title>
          ,
          <source>in: International Conference on Advanced Information Systems Engineering</source>
          , Springer, Cham,
          <year>2016</year>
          , p.
          <fpage>151</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Corradini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luciani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morichetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polini</surname>
          </string-name>
          ,
          <article-title>Process variance analysis and configuration in the public administration sector</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suriadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hofstede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wynn</surname>
          </string-name>
          ,
          <article-title>Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs</article-title>
          ,
          <source>Information systems 64</source>
          (
          <year>2017</year>
          )
          <fpage>132</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zellner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Azaiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Winkel</surname>
          </string-name>
          , T. Seidl,
          <article-title>Liproma: label-independent process matching</article-title>
          ,
          <source>in: International Conference on Business Process Management</source>
          , Springer, Cham,
          <year>2019</year>
          , p.
          <fpage>186</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Corradini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Luciani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morichetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piangerelli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Polini,  −
          <article-title>: A dissimilarity measure for public administration process logs</article-title>
          ,
          <source>in: International Conference on Electronic Government</source>
          , Springer, Cham,
          <year>2021</year>
          , p.
          <fpage>301</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dongen</surname>
          </string-name>
          , W. van der Aalst,
          <article-title>On the role of fitness, precision, generalization and simplicity in process discovery</article-title>
          , in: OTM Confederated International Conferences”
          <article-title>On the Move to Meaningful Internet Systems”</article-title>
          , Springer, Berlin, Heidelberg,
          <year>2012</year>
          , p.
          <fpage>305</fpage>
          -
          <lpage>322</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , L. Tesei,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piangerelli</surname>
          </string-name>
          , E. Merelli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Paci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stojanovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Leitão</surname>
          </string-name>
          , J. Barbosa,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>