<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emails Analysis for Business Process Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nassim LAGA</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marwa ELLEUCH</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Walid GAALOUL</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oumaima ALAOUI ISMAILI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Orange Labs</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Telecom SudParis</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>54</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>Most often, process mining consists of discovering models of actual processes from structured event logs. However, some business processes (BP), or at least some parts of them, are not necessary supported by an information system (IS), and consequently do not leave any structured events log. Therefore, applying traditional process mining techniques would generate a partial view of such processes. Process actors often rely on communication tools to collaboratively execute their business processes in such situations. However, given the unstructured nature of communication tools traces, process mining techniques could not be applied directly; thus it is necessary to generate structured event logs by recognizing the process-related items (activities, actors, instances, etc.). In this paper, we address this challenge in order to mine business processes from email exchange traces. We introduce an approach that minimizes users' efforts to manage the growing amounts of exchanged emails: It enables to collaboratively, and gradually build an annotated corpus of messages, and to automatically classify these ones into process, instance and activity IDs using machine learning techniques. Compared to related works, we facilitate the task of obtaining annotated datasets and we investigate the use of email exchange histories, correspondent, references and named entities for building clustering and classification features. The proposed approach is evaluated through a proof of concept and successfully experimented on an email dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Process mining</kwd>
        <kwd>Business process management</kwd>
        <kwd>Clustering</kwd>
        <kwd>Supervised learning</kwd>
        <kwd>Named entities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Process mining consists in extracting useful knowledge from event logs
generated by a variety of software tools hosted in Information Systems (IS) [
        <xref ref-type="bibr" rid="ref21 ref23">21,23</xref>
        ].
It enables business experts to discover new processes, new practices, as well
as BP limitations. However, most of them assume that: (1) Event logs have a
structured format, and (2) the business process is totally executed in IS,
and consequently event logs contain the trace of all executed business tasks.
However, several business activities could be achieved using informal methods,
such as communication tools (e.g. email exchange, IM, etc.). As a consequence,
the traces of these activities are not present in traditional event logs. In
addition, they are often not structured. One of the important tasks to be handled
when starting from unstructured log data to mine processes is how to convert it
into structured event logs, which is compatible with the available process mining
techniques. The task of constructing structured event log starting from
unstructured log data of a given trace of communication consists mainly in recognizing
the process-related items (name, activity and instance).
      </p>
      <p>
        Up until recently, only few studies have properly addressed the recognition
of all these related information. Most of them focus on the unstructured email
exchange traces to mine business processes by using learning [
        <xref ref-type="bibr" rid="ref10 ref17 ref4 ref7 ref8 ref9">8,4,17,10,7,9</xref>
        ] or
pattern matching [
        <xref ref-type="bibr" rid="ref1 ref11">11,1</xref>
        ] techniques. These proposals suffer from the following
limitations. Firstly, they require a considerable human intervention with time
consuming tasks. In the case of pattern matching based approaches, patterns are
defined manually. As for the supervised learning based approaches, a huge
human effort is required for building a training dataset: In most cases, a big amount
of unorganized and unlabeled data has to be manually annotated. Even if
unsupervised learning techniques can be introduced as a possible alternative to avoid
preparing such a corpora [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ], some manual tasks are still needed; they consist in
labeling or modifying (correcting) the generated clusters and tuning the
parameters of some parametric algorithms such as kmean and hierarchical algorithms.
Second, they tend to generate unreliable business process models. This can
be the result of: (1) Relying only on clustering techniques, which is error prone
[
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ], and (2) using features that are not discriminative enough to recognize some
business process related information [
        <xref ref-type="bibr" rid="ref10 ref17 ref4 ref7">4,17,10,7</xref>
        ]. Discriminative features are the
most relevant variables for clustering. These latter depend on the type of the
knowledge that we want to extract. In the case of process mining, existing works
mostly exploit the entire content of the unstructured email related data (subject
or email body) to build some of their learning features [
        <xref ref-type="bibr" rid="ref12 ref8 ref9">12,8,9</xref>
        ]. These features
are used then for all kind of BP knowledge extraction tasks. Typically, the
textual data of emails may contain key terms which can help to separate emails
according to their BP for example. However, given emails belonging to the same
BP but to different instances, their textual data are likely to share the same BP
key terms, which means that using them entirely probably increases the
confusion between instance clusters. On the other hand, named entities (e.g. person,
company names) and references (e.g. customer reference, product reference)
differ probably from one instance to another even if they belong to the same BP,
which means that they can have a considerable contribution in separating emails
into instances.
      </p>
      <p>In this paper, we address these challenges in order to mine business processes
from email logs. We consider that human intervention is often required to
generate reliable business process models but we aim to minimize it. We propose then
an approach that enables users to collaboratively and gradually build an
annotated corpus of messages and to automatically classify these ones into process,
instance and activity IDs using machine learning techniques. Our proposal differs
from existing works by: (1) Reducing the human effort required for obtaining an
annotated dataset through a collaborative and progressively learning approach
(2) Investigating the use of email exchange histories, its participant
correspondent entities, references and named entities existing in its body for building
clustering and classification features (3) Applying a fast and non-parametric
clustering algorithm for process instance detection.</p>
      <p>The rest of the paper is organized as follows. First, we introduce in section II
an overview of our contributions, the overall algorithm, and functional entities.
Then, we detail them in section III. In section IV, we validate the proposals by a
proof of concept and its application to detect some processes (e.g. hiring process
and patent application process) taken from real data. We discuss the related
work in section V and conclude with a summary and perspectives in section VI.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overview</title>
      <p>Our proposal combines classification and clustering techniques for mining
process models from email exchange traces. To achieve this task, a structured event
log which contains process, activity and process instance labels must be
generated. In this paper, we assume that each email is related to one activity, one
process, and one instance. Our approach is summarized in Figure 1 and is a
sequential combination of the following steps:
Step 1: Process and activity labels generation: This step is initially done
manually and collaboratively by users. Then, a predictive model is trained
gradually with the obtained annotated data for recognizing process and activity
names. The classification features, which are used in the training phase, are
built from the following email parts: subject, content, historical exchange and
correspondent entities of email participants. Once reaching reliable prediction
performances, the task of process and activity labels generation will be
automatically performed by the obtained predictive models.</p>
      <sec id="sec-2-1">
        <title>Step 2: Process instance detection: The purpose of this step is to detect</title>
        <p>the process instance related to each email. A clustering algorithm is applied at
this step. The used distance matrix is defined as a weighted sum of sub-distances
related to the following email parts: (1) time, (2) correspondent entities of email
participants, and (3) content and subject reduced into references and named
entities.</p>
        <p>Step 3: Event log generation: The goal is to generate time-ordered
perprocess event sets. Each event presents an email with its timestamp and its
activity, process and instance ID labels.</p>
        <p>Step 4: Process model discovery: Any process discovery algorithm can be
applied here to mine the business process models.
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <sec id="sec-3-1">
        <title>Step1: Process and activiy labels generation</title>
        <p>The goal of this step is to associate process and activity labels to new incoming
emails. More formally, it is a function F : EM → P × A where EM denotes a set
of emails e, P presents a set of process names and A presents a set of activities.</p>
        <p>Fig. 1. Process Mining From Email logs: Main Steps
Algorithm 1 Process and activity labeling of one incoming email
1: procedure tagEmail(</p>
        <p>e, aEL, pEL, pET, emC, erC, clf s, minBatch)
. e: the email to be annotated
. aEL: the list of annotated emails
. pEL: the pseudo event log
. pET: the error rate threshold
. emC: the email count since model train (It is initialized outside the procedure)
. erC: the errors count since model train (It is initialized outside the procedure)
. clfs: the classifiers
. minB: the minimum size of the new data to train new models
This approach consists on waiting until obtaining a batch of new observations
and then train the already existing models on this whole batch. The pseudo code
of this step is described in Algorithm 1.</p>
        <p>First, the associated process and activity names for each email e are
predicted using existing models (line 8). Then, if manual annotations (process and
activity names) are given by the user (this could happen if the annotations are
not, or wrongly, predicted) errors rate is checked, if it is higher than a
threshold (pET) and a minimum size of data is collected (minB), we train again the
models (process and activity classifiers) (lines 10→16). Finally, the predicted, or
corrected, process and activity names, associated to the email, are saved in the
pseudo event log (pEL) dataset (line 20).</p>
        <p>For learning or updating predictive models, we have to dispose a training
dataset that must be converted into a (X,Y) couple, where X is a matrix having
in each row the feature values of each sample and Y is a vector representing the
corresponding targets. We further detail in the following sections our
classification features and preprocessing steps applied to generate this matrix X.</p>
        <p>a) Defining classification features: We split here the task of selecting
the efficient attributes for building our classification features according to the
prediction task type. To predict activities, we use the correspondent entities of
email participants (FROM, TOs, and CCs) and the email subjects (SUB) and we
add the content parts because correspondent and subject parts are not sufficient
enough to recognize activities since they lack precision about them. We have
handled also the case where the email contains one short sentence. This type
of email can be sent, for example, to confirm or deny what has been said in
previous emails. In this case, we use also the email exchange history (HistExch)
because it is not obvious to understand the business goal of sending such emails
without analyzing the history of the concerned discussions. To recognize Business
processes, we use only the correspondent entities and the subject parts because
we noticed that content parts degrade the recognition accuracy when they are
introduced with the same preprocessing steps as in the activity prediction phase.</p>
        <p>b) Defining preprocessing steps: We summarize these steps as follows:
Replace Particular Expressions: We detect, using regular expressions,
special expression within the subject (and the body in the case of activity
prediction) (e.g. HTTP links, phone numbers, special references if known, and file
names and their extension), and replace these expressions with a special tag (e.g.
HTTP LINK, PHONE NUMBER, IS REF, and PPT FILENAME etc.).
Remove Person Names: We detect all the correspondents’ first names and
last names, and then remove them from the textual data.</p>
        <p>Lemmatize The Text: This method reduces the dimension of the resulting
matrix (Eq (2)). It consists in reducing the different forms of word to a single
form (e.g. words “thinks”, and “thinking” into one single form “think”).
Remove Stop Words: This function is useful to remove the most common
words in the emails that can be distracting and non-informative. We used the
nltk3 library, enriched with additional words detected during the data
exploration step (e.g. regards, hello, outlook prefixes, etc.).</p>
        <p>Generate 1-gram,2-gram Vocabulary: We first split the remaining text into
a list of words and detect the different sequential combinations where each item
3 https://www.nltk.org
has the size of 1 to 2 words (1-gram, 2-gram). Then, we remove the most and the
less frequent terms to improve generalization of our predictive models. In fact,
sparse terms can generate wrong associations and overly common words don’t
present relevant information to differentiate between email intents.
Generate TFIDF (Term frequency-inverse document frequency): This function
generates a TFIDF matrix which encodes the frequency of 1-gram and 2-gram
terms in an email with respect to the rest of the corpus. The size of this matrix is
equal to (N, T) where N is the total number of emails and T is the total number
of different 1-gram and 2-gram terms.</p>
        <p>Generate Interaction Matrix: In order to emphasize the contribution of the
email correspondent entities in defining process and activity names, we build a
presence matrix (P M ) reflecting for each email the interlocutors’ entities (sender
entity, recipient entity, and copied entity).</p>
        <p>Definition 1 Let E be the set of emails of length N, L be the list of different
entities of length M, C be a list where each element corresponds to the list of
participant correspondents’ entities of each email, PM be a matrix whose columns
represent L and rows represent E. PM can be defined as follows:
P M (i, j)i,j∈[0,N−1]×[0,M−1] =
(1 if L[j] ∈ C[i]
0 otherwise
(1)
GenerateTheFinalMatrix: This function generates the X matrix which is a
weighted concatenation of PM and TFIDF matrices.</p>
        <p>X = β1T F IDF(N×T ) β2P M(N×M)
(2)
Where the weights β1 and β2 will be defined empirically.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Step2: Process Instance Detection</title>
        <p>The goal of this step is to detect a specific occurrence or execution of the same
business process, which is known as process instance. We use the pseudoEventLog
dataset (pEL) generated as an output of step 1 which contains the list of emails
correctly annotated with corresponding process and activity names. We first
divide it into per-process groups. Then, we apply a clustering technique on each
obtained group to detect clusters corresponding to process instances.</p>
        <p>Clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) should have similar properties or features,
while objects in different groups should have highly dissimilar ones. A
clustering algorithm takes as input a similarity matrix M which defines the similarity
between each couple of emails. This matrix is formally specified in Definition 2
Definition 2 Let N be the total number of emails, E be the set of emails in our
corpus and f : E × E → R be the similarity function that calculates a similarity
value between two emails. The similarity matrix M can be defined as a square
matrix of size N × N where M [i, j]i,j∈[1,N] = f (E[i], E[j])
In our case, the similarity function f is a distance function defined by Eq(3).</p>
        <p>Our clustering phase goes mainly through the following sub-steps: (1) Identify
clustering features. (2) Generate the similarity matrix. (3)Apply a clustering
algorithm.</p>
        <p>a) Identify clustering features For building our clustering features, we
focus the analysis on: (1) Subject and content reduced to references and named
entities (2) Time (3) Correspondent entities. In fact, we believe that emails
belonging to the same process instance are likely to have close reception dates and
to share the same named entities, the same references and the same
correspondents entities (from, dest, and CC).</p>
        <p>A named entity is a real-world object, such as persons, locations,
organizations, products. . . etc that can be denoted with a proper name. The references
are the information generated by business applications used for executing some
process tasks (e.g. customer number, purchase request ID etc., ). The purpose
of reducing the email into references and named entities is to conserve only the
significant contextual data in relation with the business process instances. In
fact, the entire content of these email parts often contain additional vocabulary,
which degrades instance detection accuracy.</p>
        <p>
          The named entities are detected through two methods which are
complementary in our case: We first use the Polygot NER method [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Named-entity
recognition (NER) is a subtask of information extraction. It seeks to locate and
classify named entity mentions in unstructured text into pre-defined categories
such as the person names, organization, time expression, monetary values,
percentage etc.Unlike the other existing pipelines (NLTK, standford, OpenNLP)
where most languages are unsupported, Polygot NER is a multilingual named
entity recognition tool that supports 40 major languages. It is also automatic but
not complete (some named entities are not detected). To overcome this
limitation, we express explicitly the non-detected patterns. As for reference detection,
we inject specific regular expressions.
        </p>
        <p>b) Similarity function Our similarity function is a distance function which
is defined as a weighted sum of sub-distances related to our clustering features.
It has the following formula:
f (x, y) = w0DC(x, y) + w1DT (x, y) + w2DN E(x, y)
(3)</p>
        <p>Where the weights w0, w1 and w2 are defined empirically according to each
process type and DC(x,y), DT(x,y) and DNE(x,y) are defined as follows:
- DC(x, y) is the correspondent distance between two emails x and y. We
define it as a Jaccard distance between the correspondent entity sets of their
interlocutors which is equal to the cardinality of their intersection divided by the
cardinality of their union. More formally, let C(x) be the list of correspondents
of the email x, and C(y) be the list of correspondents of the email y. DC(x,y) is
then defined as follows:</p>
        <p>DC(x, y) = |C(x) ∩ C(y)|
|C(x) ∪ C(y)|
ts(x) and ts(y) refer to the timestamp of x and y and λ is the time, expressed in
the number of days, which separates two emails arrivals. We estimate this value
by:
λ =
date max − date min</p>
        <p>number of emails
- DNE(x,y) is the distance related to the named entities and the references
present in the subject and the content of the email x and those present in email
y. Emails belonging to the same process instance are likely to share the same
named entities and references. We define the distance DNE as a Jaccard distance:
(5)
(6)
(7)
- DT(x, y) is the time distance between emails x and y. The emails belonging
to the same process instance are likely to have close reception dates. We assume
that the inter-arrival duration follows an exponential law and is expressed in
number of days. Consequently, we define the distance through the following
formula:</p>
        <p>DT (x, y) = 1 − e−λ(ts(x)−ts(y))</p>
        <p>Once the process and activity names are identified and the emails belonging
to different process instances are grouped, a dataset labelled with process,
activity, timestamp and instance ID labels can be obtained. The event logs generation
function aims to construct, from this new dataset, time-ordered per-process event
sets. Algorithm 2 summarizes our steps to generate it. From the pseudo event
logs obtained from step 1, we generate a time-ordered event logs using the email</p>
        <p>NE(x) is the set of named entities and references present in email x, and
NE(y) is the set of named entities and references present in email y.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Step 3: Event Logs Generation</title>
        <p>Algorithm 2 Event logs generation
1: procedure GenerateEventLog(pEL)
2:
3:
4:
5:
6:
7:
. pEL: The pseudo event log
pEL ordered ← T imeBasedOrdering(pEL)
EM P ← P erP rocessSplitting(pEL ordered)
EM P I ← InstanceDetection(EM P )
InstanceIdentifierGeneration(EM P I)
EventLogs = P erP rocessGrouping(EM P I)
return EventLogs
timestamp variable (line 2). Then, we split it into several subsets, each one
containing a single process emails list (line 3). For each process subset, we apply
a clustering algorithm in order to detect instance groups and we associate to
each group a unique identifier (line 4, 5). Finally, all these groups are regrouped
into a per process dataset (line 6) so that a process mining algorithm could be
applied.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Validation</title>
      <p>We validate our contribution through a proof of concept and experimentations
carried on our dataset composed of 1026 emails and on the email environment of
two employees (Microsoft Outlook as an email client, and Microsoft Exchange as
an email server). In these experimentations we succeed to discover two processes:
(1) a hiring process and (2) a patent application process. In this section we detail
only the hiring process discovery.
4.1</p>
      <sec id="sec-4-1">
        <title>Proof Of Concept</title>
        <p>Our tool is implemented through three components:</p>
        <p>a) Frontend component: This component is a Microsoft Outlook 2010
plugin developed using C# programming language. It has four main functionalities.
First, it enables the user to manually annotate his emails. Thus, it provides the
GUI that enables the user to select the email and the related process and
activity names. This association is sent to the backend (SetTag interface) as a JSON
object containing the email parameters and the associated process and activity
names. Second, it captures incoming and outgoing messages, constructs a JSON
object for each email, sends it to the backend for analysis (GetTag interface),
retrieves the results (JSON object representing the detected process name and
activity name) and associates the tags to the email. Third, it enables the display
of the process and activity related to each email along with the email. Finally,
the plugin provides users with advanced functionalities such as email search by
related business process and activity.</p>
        <p>b) Backend component: The backend component is implemented through
three HTTP interfaces (SetTag, GetTag, and GetAllProcesses), with a MySQL
database containing two tables: training dataset table and event logs table. The
training dataset contains the following columns (id, source, destination, cc,
subject, content, received date, process name, activity name). The event log dataset
contains the following columns (id, source, destination, cc, subject, content,
received date, process name, activity name, instance id).
- SetTag Interface: This interface is used to enrich the learning database from
one hand. It is invoked by the frontend when the user annotates manually an
email. It receives the JSON object representing the email and the associated
process and activity names. This information is inserted into the training dataset
table as well as into the event log table, in which we set the instance id column
to NULL value, as we don’t know yet to which instance the email belongs to.
- GetTag Interface: This interface is invoked by the Microsoft Outlook Plugin
each time the user sends or receives an email. It receives a JSON object
representing an email, analyzes it using the trained predictive models, and returns as
a result the predicted process and activity name IDs. The result is also inserted
into the event log table.</p>
        <p>To build the machine learning models, a multinomial version of the logistic
regression (LR) classifier is employed. It estimates the parameters of a logistic
model for multiclass prediction task. The stochastic Gradient Descent (SGD) is
used as an optimizer in the training phase. In fact, it can converge faster than
batch training because it performs updates more frequently. Therefore, it has
been successfully applied to large-scale and sparse machine learning problems
often encountered in text classification and natural language processing fields.
- GetProcesses Interface: This interface enables the email client user to
display existing tags (process names and activity names). It supports him in the
process of enriching the learning database. Basically, this interface is invoked
when the user is about to manually annotate an email. It enables to retrieve the
list of available annotation in the training dataset table. For each process name,
we also retrieve the list of associated activities.</p>
        <p>c) Instance separation component: To detect instances, we applied the
clustering algorithm HDBSCAN (Hierarchical Density-Based Spatial Clustering
of Applications with Noise). It extends DBSCAN by converting it into a
hierarchical clustering algorithm, and then using a technique to extract a flat clustering
based on the stability of clusters. The choice of HDBSCAN is justified by two
reasons: (1) HDBSCAN does not require human intervention. In fact, it is a fast
and non-parametric algorithm that does not require setting any parameters even
the number of clusters. (2) HDBSCAN can generate clusters of different sizes,
shapes and densities, which can enhance clustering accuracy.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimentations</title>
        <p>We carried here experimentations to justify our choices in each step and then to
evaluate their performances and to present the obtained results.</p>
      </sec>
      <sec id="sec-4-3">
        <title>a) Validation of process and activity prediction: We manually anno</title>
        <p>tated 1026 emails to obtain a correctly annotated email corpus with process and
activity IDs. There are 13 processes (e.g. Hiring, PatentApplication, Command,
ConferenceParticipation, travel expense refund, etc.) and 116 activities. Taking
the example of hiring process, we identified the hiring steps achieved through
emails: “describe the position”, “publish the position”, “receive applications”,
“setting the interviews”, “asking for documents”, and “notifying the decision”.</p>
        <p>The performances of process and activity prediction phase highly depend on
the choice of the machine learning algorithms and the data preprocessing actions.
To select these latter, we have tested different techniques until good prediction
performances are reached. Better performances are noticed when:
- Using only the subject and the correspondent entities for process prediction.
- Applying the preprocessing steps detailed in II.B.2.b when we consider that
the most frequent terms in the documents have a frequency greater than 5% and
the less frequent ones have a frequency less than 0.1%.
- Assuming that short emails contain less than 40 words and that email exchange
history is constructed from the four previous emails.
- Setting the weights of Eq (2) as follows: β1 = 0.8 and β2 = 0.2.
- Using the LR with SGD optimizer for training predictive models. This result
is obtained after testing two other prediction algorithms (Random Forest (RF)
and Support Vector Machine (SVM)). The performances of each one were
estimated by using 5-fold cross-validation method and by calculating the F1 Score.
This score is a measure that combines precision and recall. Precision is known
as positive predictive value while recall is called the sensitivity of the classifier.
Mathematically, the F1 score is defined as:</p>
        <p>2 × precision × recall
F 1Score = (8)</p>
        <p>precision + recall
The obtained F1 scores are summarized as follows: 0.8072 for Random Forest,
0.8626 for LR with SGD and 0.8094 for SVM.</p>
      </sec>
      <sec id="sec-4-4">
        <title>b) Validation of Process Instance Detection: In this subsection, we</title>
        <p>evaluate the performances of our selected clustering technique HDBSCAN on
our data set composed of 180 emails related to a hiring process.</p>
        <p>
          We manually generated the emails clusters related to the process instances
where we obtained 11 clusters. To compare this manual clustering with the
results of HDBSCAN, we computed the Adjusted Mutual Information4 (AMI)
metric [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. We tested different values of the weights related to distance matrix
computation (Eq (3)), and we obtained an optimal configuration (using these
values w0 = 12 , w1 = 14 , and w2 = 41 ). The AMI value obtained with this
configuration is 0.86.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>c) Validation of Process Model Discovery: S05cm To validate this step,</title>
        <p>
          we applied the heuristic miner algorithm [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] on the automatically generated
4 AMI computes the metric to evaluate the similarity between two hdbscan partition
and real partition. It returns a value between 0 and 1. This value is closed to 1 when
the two partitions are strongly matched and closed to 0 when the two partitions are
weakly matched
event log of the hiring process. Figure 2 shows the theoretical and the detected
model. We can notice that the behavior captured in the event log is almost in
conformity with the theoretical one. Nevertheless, two discrepancy types can be
detected at low frequency: (1) Unfitting model behavior which refers to
behaviors observed in the theoretical model that are not allowed by the captured one
(e.g.“Welcome procedure” is performed after “Decision Notification”). (2)
Additional model behavior which refers to behaviors allowed in the captured model
but does not exist in the theoretical one such as: “Welcome Procedure” is done
initially and asking for“Hiring Documents” or “Decision Notifying” are done
before “Interview Setting”. Actually, these discrepancy types can be caused either
by errors accumulated through our log building system, or by the log miner
technique that we have used or by a real difference between the process as observed
in the emails and the related theoretical BP.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Works</title>
      <p>Up until recently, only few studies have considered business process mining from
email exchange. They have been mainly interested in: Activity and process names
recognition, process instances detection and process discovery. They do require
human intervention and may generate unreliable business process models. In this
section, we discuss them according to three categories:
5.1</p>
      <sec id="sec-5-1">
        <title>Non-learning based methods</title>
        <p>
          One of the first proposals for mining business processes from emails assumes
that the associated business process is explicitly included in the email subject
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Such an approach requires a significant human intervention and involvement.
Indeed, email interlocutors must include the business process name and related
attributes in the email subject, which is not realistic.
        </p>
        <p>
          Another proposal [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] assumes that the manual task is an association of (1)
classical manual task of the BPMN2.0 specification [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and (2) a set of semantic
patterns that enable to validate whether a given communication content is part of
a business process, activity and a given business process instance. The limitation
of such system is the necessity of anticipating and manually defining all semantic
patterns related to each task which is time consuming and not scalable.
        </p>
        <p>
          E-Mail Mining [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] is a method for semi-automatic discovery of
knowledgeintensive process. From a set of emails belonging to a BP, e-Mail Mining aims
to discover the amount of knowledge embedded in the execution of its activities.
This knowledge consists of : (1) BP participants and their social interactions (2)
Relevant terms that are related to the BP domain (3) BP activities defined by
three elements : Actors, candidate actions and parameters. Relevant BP
activities are selected manually from a list of candidate activities. These latter are
generated after splitting emails into sentences and by assuming that each
sentence is composed of : (1) a noun phrase object which can describes the agent
performing an action or the resource that receives the effect of the executed
action (2) a verb phrase that describes the activity performed by the agent. This
approach has an interesting contribution in the field of process mining from
emails. In fact, it allows the detection of multiple activities in one email as well
as the metadata embedded in the execution of a given BP. However, it requires
manual tasks during its execution (e.g for selecting relevant activities or sample
of emails related to one BP). Moreover, it considers emails as storytelling textual
data to mine the candidate activities. Actually , emails do not have the same
structure as narrative textual data (which generally describes activities in a more
formal way than e-mails). For instance, the proposal does not seem to handle
passive-voice sentences where actors do not appear or where their positions are
switched with those of resources.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Act theory based methods</title>
        <p>
          This category deals with activity name recognition by using act theory based
methods: The idea behind this theory is to classify emails according to the
sender’s intent [
          <xref ref-type="bibr" rid="ref18 ref19">19,18</xref>
          ]. Thus, two possible classifications of speech acts are
proposed: (1) Illocutionary act classes; Assertive, Commissive, Directive,
Expressive, declarations. (2) Speech act verbs: Propose, Request, Deliver, Commit, etc.
Some proposals set email speech acts in advance. Then, they apply a supervised
learning algorithm to classify emails as containing or not containing the specific
acts [
          <xref ref-type="bibr" rid="ref17 ref4">4,17</xref>
          ]. Other works treat the problem of process detection as a problem of
conversation finder such as [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]: It suggests firstly classifying emails into business
and non-business process related. The business-oriented email messages are then
grouped into threads to detect conversations using a refined version of Vector
Space Model and a semantic similarity measure. Finally, the interactions in each
conversation are labeled by applying the classification of illocutionary acts [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>
          An iterative relational learning approach for email task management was
also suggested [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It exploits the mutual performance improvement between
the extraction of speech acts and the identification of related emails. In fact,
after initializing both of them using automatic methods, a supervised learning
algorithm (SMO implementation of SVM [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]) is applied on incoming emails:
It takes into account related emails as a feature to recognize the correspondent
speech acts. Then, a relational learning terminology [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is exploited: It similarly
uses speech act as a feature to predict relations between the incoming and the
existing messages.
        </p>
        <p>Obviously, all of these works require labeled data for training statistical
speech acts recognizers which leads to a huge human intervention. Furthermore,
business process tasks differ from one process to another. Thus, setting a unique
list of activities in advance, degrades the performances of generating the right
business process models.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Unsupervised learning based methods</title>
        <p>
          In order to minimize the human intervention and to avoid preparing a labeled
dataset, there exist propositions that have integrated unsupervised learning
techniques in their approaches to mine business processes from emails [
          <xref ref-type="bibr" rid="ref5 ref6 ref8 ref9">8,9,6,5</xref>
          ]. One
of these propositions identify the process and the instance clusters by
applying a hierarchical clustering method (Bottom up) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In order to find process
groups, the distance used combines the subject and body attributes. Then, to
detect instances, the timestamp attribute is added. As for the process activities
identification phase, the K-mean algorithm is adopted. The approach proposes
a customization method to set the number (K) and the initial centers, on the
basis of the instance clusters obtained from the previous step. Then, it applies
a distance formula that takes into consideration the meanings similarity of the
words present in the subjects and the bodies.
        </p>
        <p>
          Another approach proposes a two step algorithm to discover the processes and
activities from emails [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. A hierarchical clustering is sequentially applied: first to
deduce process clusters and then to deduce sub-clusters corresponding to activity
types. The similarity measurement is based on word2vect method: It aims to
exploit the hidden semantic relations between words existing in email contents
and subjects. An activity labeling technique is also proposed: It recommends to
the user the most frequent contiguous sequence of n items existing in an activity
type cluster.
        </p>
        <p>
          These studies aim to minimize the human intervention. However, they have
some limitations: First, the hierarchical clustering requires a human effort in
tuning its parameters. Additionally, its quality highly depends on how these
parameters are set [
          <xref ref-type="bibr" rid="ref15 ref3">15,3</xref>
          ]. Second, it is computational hard. Hence, applying the
same algorithm twice in the same method increases the computational
complexity and the execution time [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Third, the activity identification quality highly
depends on the instance clustering phase [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In fact, as the K-mean algorithm
is sensitive to the initial start centers, poor instance detection quality can
certainly lead to a bad convergence. Finally, the automatic generation of labels
considerably increases the risk of error and interpretation [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          MailOfMine [
          <xref ref-type="bibr" rid="ref5 ref6">6,5</xref>
          ] proposes to mine artful business processes and to define
them with a “declarative” approach. It suggests to start from some assumptions
which map email and BP structures: (1) Each conversation presents an activity
trace (2) Each activity presents a set of elementary tasks deduced from
conversation key parts (3) Each process is composed by a set of activities. MailOfMine
approach consists basically of: (1) Applying three times a similarity clustering
algorithm: to cluster emails into conversation threads, to cluster these threads
into activity types and finally, to cluster each activity key parts into task types.
During the clustering process, email body, the names of attached files and some
email header fields are taken into consideration.(3) Applying supervised
learning process to assign activities to different processes (4) Automatically labeling
activity tasks with the possibility of customizing them and manually
assigning activity and process names (5) Mining constraints between tasks (activities)
among each activity (process) . The proposed work has the advantage of
discovering BP with different level of granularity (Process, subprocess or activity,
task) and describing them with declarative approach, which is more flexible than
the classical imperative approach. Nevertheless, this work suffers from some
limitations: (1) Its execution time can diverge when applying it on large number
of traces containing various tasks and activity types, e.g; it is linear (quadratic)
time with respect to the number, size of traces. (2) A considerable human
intervention is required to manually assigning activity and process names and to
initiate activity classification step.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion &amp; Discussion</title>
      <p>In this paper, a solution for mining business processes from email logs was
proposed and evaluated through a proof of concept implementation and successful
experimentations on our dataset.</p>
      <p>To build a training dataset from existing corpus of emails, we proposed a
collaborative and iterative approach which is implemented through graphical
interfaces and automatic prediction functionalities. This has the advantage of
encouraging users to be involved in building an annotated dataset since it
facilitates the tagging task and minimizes the required effort. Consequently, the
training dataset will be built gradually and will be available instantly without
the need to dispose a lot of time and human resources. However, this approach
still relies on human involvement. Moreover, tagging collaboratively the same
dataset can lead to dispose samples belonging to the same cluster but with
different annotations. Therefore, tag normalization step is required.</p>
      <p>
        The prediction entity is based on a supervised learning technique. For
building classification features, we have investigated, according to the prediction goal,
some or all of these variables: email participant correspondent entities, subject,
content and exchange history. Our experiments revealed that email contents
degrade the performances of process name recognition. Even if this assumption
seems contradictory to existing works [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ], it can be justified by the nature of
our dataset and our preprocessing steps.
      </p>
      <p>Our instance detection approach differs from related works by using a fast
and non parametric clustering algorithm which is non sensitive to the noise and
which can generate clusters with different shapes, sizes and densities. Moreover,
we have reduced the body and the subject into references and named entities for
clustering emails into BP instances. To improve the detection quality of named
entities and references, we have defined explicitly the non-detected patterns. This
action has the advantage of having good performances, however, it is manual and
by consequence costly. We have assumed that three variables (timestamp,
correspondent entities and named entities) can contribute to separate BP instances.
Actually, the contribution of each variable depends on the nature of the BP. This
is why we have introduced weights correlated to each variable to be tuned by
users according to their expertise. For instance, the timestamp variable can help
to separate instances in the case of BP with time constraints (e.g; the accounting
closing process that is carried out regularly on scheduled dates) while in the case
of BP whose instances are independent of time, the same variable seems to have
no effect.</p>
      <p>
        In our approach, we have handled some research questions related to BP
mining from emails by supposing that one email can be affected to one process,
one activity and one instance. Actually, messaging systems such as emails allows
users to discuss BP issues with informal way without respecting such constraints;
in one email, user can discuss more than one activity and more than one instance.
This kind of challenges was addressed in previous works such as [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] by assuming
that activities are expressed and generally correlated with their metadata at
sentence level. In future works, we plan to study these challenges. We plan also
to more automate the BP discovery pipeline since the current approach still
requires human involvement. Finally, we suggest to employ similarity meaning
measures for constructing learning features based on email contents.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Emailanalyzer: an e-mail mining plug-in for the prom framework</article-title>
          .
          <source>BPM Center Report BPM-07-16</source>
          , BPMCenter. org (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Al-Rfou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Polyglot-ner: Massive multilingual named entity recognition</article-title>
          .
          <source>In: SIAM International Conference on Data Mining</source>
          . pp.
          <fpage>586</fpage>
          -
          <lpage>594</lpage>
          . SIAM (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ciosici</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Improving quality of hierarchical clustering for large data series</article-title>
          .
          <source>arXiv preprint arXiv:1608.01238</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          , et al.:
          <article-title>Learning to classify email into“speech acts”</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Di</given-names>
            <surname>Ciccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mecella</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Minerful, a mining algorithm for declarative process constraints in mailofmine</article-title>
          .
          <source>Department of Computer and System Sciences Antonio Ruberti Technical Reports</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Di</given-names>
            <surname>Ciccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mecella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zardetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Catarci</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Mailofmineanalyzing mail messages for mining artful collaborative processes</article-title>
          .
          <source>In: International Symposium on Data-Driven Process Discovery and Analysis</source>
          . pp.
          <fpage>55</fpage>
          -
          <lpage>81</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jeong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Semi-supervised speech act recognition in emails and forums</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing</source>
          . vol.
          <volume>3</volume>
          , pp.
          <fpage>1250</fpage>
          -
          <lpage>1259</lpage>
          . Association for Computational Linguistics (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jlailaty</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>A framework for mining process models from emails logs</article-title>
          .
          <source>arXiv preprint arXiv:1609.06127</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Jlailaty</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Mining business process activities from email logs</article-title>
          .
          <source>In: Cognitive Computing (ICCC)</source>
          . pp.
          <fpage>112</fpage>
          -
          <lpage>119</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Khoussainov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kushmerick</surname>
          </string-name>
          , N.:
          <article-title>Email task management: An iterative relational learning approach</article-title>
          . In: CEAS (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Laga</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>Communication-based business process task detection-application in the crm context</article-title>
          .
          <source>In: Enterprise Distributed Object Computing Workshop (EDOCW)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Mavaddat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Facilitating business process discovery using email analysis</article-title>
          .
          <source>In: The First International Conference on Business Intelligence and Technology</source>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Model</surname>
            ,
            <given-names>B.P.:</given-names>
          </string-name>
          <article-title>Notation (bpmn) version 2.0</article-title>
          .
          <string-name>
            <given-names>OMG</given-names>
            <surname>Specification</surname>
          </string-name>
          , Object Management Group pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Neville</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Iterative classification in relational data</article-title>
          .
          <source>In: Learning Statistical Models from Relational Data</source>
          . pp.
          <fpage>13</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Oyang</surname>
            ,
            <given-names>Y.J.</given-names>
          </string-name>
          , et al.:
          <article-title>Characteristics of a hierarchical data clustering algorithm based on gravity theory</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Technical Report of NTUCSIE 02-01</source>
          .(Available at http://mars. csie. ntu. edu . . . (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Platt</surname>
          </string-name>
          , J.C.
          <article-title>: 12 fast training of support vector machines using sequential minimal optimization</article-title>
          .
          <source>Advances in kernel methods</source>
          pp.
          <fpage>185</fpage>
          -
          <lpage>208</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Qadir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riloff</surname>
          </string-name>
          , E.:
          <article-title>Classifying sentences as speech acts in message board posts</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>748</fpage>
          -
          <lpage>758</lpage>
          . Association for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Searle</surname>
            ,
            <given-names>J.R.:</given-names>
          </string-name>
          <article-title>A taxonomy of illocutionary acts (</article-title>
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Searle</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Searle</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Speech acts: An essay in the philosophy of language</article-title>
          , vol.
          <volume>626</volume>
          . Cambridge university press (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santoro</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          , Baia˜o,
          <string-name>
            <surname>F.A.</surname>
          </string-name>
          :
          <article-title>Discovering collaborative knowledgeintensive processes through e-mail mining</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          <volume>36</volume>
          (
          <issue>6</issue>
          ),
          <fpage>1451</fpage>
          -
          <lpage>1465</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Van Der Aalst</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , et al.:
          <article-title>Process mining manifesto</article-title>
          .
          <source>In: International Conference on Business Process Management</source>
          . pp.
          <fpage>169</fpage>
          -
          <lpage>194</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Vinh</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al.:
          <article-title>Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>11</volume>
          (Oct),
          <fpage>2837</fpage>
          -
          <lpage>2854</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Weijters</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Process mining with the heuristics miner-algorithm</article-title>
          .
          <source>Technische Universiteit Eindhoven, Tech. Rep. WP 166</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>