<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mining Business Process Information from Email Logs for Business Process Models Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diana Jlailaty</string-name>
          <email>diana.al-jlailaty@dauphine.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniela Grigori</string-name>
          <email>daniela.grigori@dauphine.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khalid Belhajjame</string-name>
          <email>khalid.belhajjame@dauphine.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université Paris-Dauphine, PSL Research University</institution>
          ,
          <addr-line>CNRS, [UMR 7243], LAMSADE, 75016 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université Paris-Dauphine, PSL Research University</institution>
          ,
          <addr-line>CNRS, [UMR 7243], LAMSADE, 75016 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université Paris-Dauphine, PSL Research University</institution>
          ,
          <addr-line>CNRS, [UMR 7243], LAMSADE, 75016 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Exchanged information in emails' texts is usually concerned by complex events or business processes in which the entities exchanging emails are collaborating to achieve the processes' final goals. Thus, the flow of information in the sent and received emails constitutes an essential part of such processes i.e. the tasks or the business activities. An email can be harvested for understanding the undocumented business process information it contains. Our goal in this work is to recast emails into a resource of business-oriented information. We describe a framework that is constituted of several analytical approaches able to extract such kind of information from email logs i.e. transforming an email log into an event log. The efficiency of all approaches is studied by applying different experiments on the open Enron email dataset. Index Terms-Email Analysis, Business Process Models, Text Mining, Process Instances, Business Activities</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>Email is by and large the first and the most popular
professional communication and social medium 1. It is a
reliable, confidential, fast, free and easily accessible form of
communication. Exchanging emails becomes essential when
applying tasks in organizational processes necessitates the
involvement of multiple individuals. Assigning tasks, asking
for more information, reporting results - all these activities are
enacted via email messages. Therefore, such email messages
necessarily contain process-related information that refer to
the business process under execution.</p>
      <p>However, email analysis from a Business Process
Management (BPM) perspective has not been thoroughly studied in the
literature. Some of the existing works allow the identification
of email activities among a predefined set of activities [5],
[4], [3]. The email analyzer developed by Van der Aalst
[6] necessitates the user interference to extract a process
instance from an email log. Hence, until recently and up
to our knowledge, none of the previous works has tackled
the problem of extracting business process information from
emails automatically without any a priori knowledge for the
goal of business process models discovery.</p>
      <p>In this work, we aim to analyze the unstructured data in
emails to harvest the undocumented business process
information from email logs i.e. event logs. In an event log, events
are characterized by some attributes. Each event corresponds
to an activity (associated to an activity label) that is executed in
the process (associated to a Process Identifier), where multiple
events (ordered by their timestamps) can be linked together as
a process instance (associated to a Process Instance Identifier).
Transforming email logs into event logs allows us to produce
business process models using the available process mining
tools. The produced business process models can provide a
clear overview on the processes and the activities in a user
email log which facilitates the organization and retrieval of
emails. In this work, we develop a framework that includes
different approaches contributing in the following:
• An approach that can automatically find, for each email,
the business process topic it belongs to i.e. extraction of
the Process Identifier (ProcessID) for each email.
• A process instance discovery approach that can
automatically find the business process instance an email
belongs to i.e. extraction of the Process Instance Identifier
(ProcessInstanceID) for each email.
• An approach that automatically extracts multiple business
activities from emails and that annotates the elicited
activities i.e. extraction of the activity labels in an email.
• A preliminary approach that can estimate the real
occurrence time of an event or email activity i.e. extraction of
the activity occurrence timestamp.
• The efficiency of all the above approaches is evaluated
using multiple email folders from Enron email dataset 2.</p>
      <p>In this paper, we first start by providing a brief study on
the related works in section II. An overview on the overall
framework is presented in section III. Sections IV, IV, VI, and
VII explain the phases of our framework. Finally, the work is
concluded in section VIII.</p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORK</title>
      <p>The common objective of the related works presented in this
section is to categorize emails into a set of classes (folders,
topics, importance, main activities). In the work of Alsmadi
et al. [1], a large set of emails is used for the purpose of
folder classifications. Five classes are proposed to label the
1http://onlinegroups.net/blog/2014/03/06/use-email-for-collaboration/
2https://www.cs.cmu.edu/ enron/
nature of emails: Personal, Job, Profession, Friendship, and
Others. Another work by Yoo et al. [7] develop a personalized
email prioritization method using a supervised classification
framework. The goal is to model personal priorities over email
messages, and to predict importance levels for new messages
using standard Support Vector Machines (SVMs) as classifiers.
In the work of Bekkerman et al. [2], they represent emails as
bag-of words (vectors of word counts) to classify them into a
predefined set of classes (folders). In the work of Faulring et
al. [5], they classify tasks contained within sentences of emails
from 8 predefined set of classes of tasks.</p>
      <p>In our work, we overcome the limitation of specifying
a predefined set of process topic and activity classes. Our
approach is able to cope with the diversity of business topic
and activity types that can exist in emails. In contrast to some
previous works, we automatically discover and label all topic
and activity types presented in an email. In addition, instead
of dealing with a single activity type per email, we work on
discovering multiple business-oriented activities in an email.</p>
    </sec>
    <sec id="sec-3">
      <title>III. FRAMEWORK OVERVIEW</title>
      <p>In this section, we present the overall framework developed
in this paper. It is composed of three main components. Figure
1 shows an overview of the components of the framework. The
framework takes as an input the email log. The first component
is for Process Topic Discovery where each email is associated
to a business process topic. The second component is for
Process Instances Discovery where each email is associated
to a business process instance. The third component if for
Process Activities Discovery where each email is associated
to a set of process activities.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. PROCESS TOPIC DISCOVERY</title>
      <sec id="sec-4-1">
        <title>A. Email Log Preprocessing</title>
        <p>Each email in an email is represented by some attributes
describing it: email subject, sender, receiver, email body, and
email timestamp. Knowing that email text falls under the
category of unstructured data, one must take considerable time
to preprocess this data with fixed fields so that they can be
queried, quantified, and analyzed with data mining techniques.
Four main steps are applied in this component:
1) Data Cleansing: removing stopwords, whitespaces, and
puctuations, or stemming.
2) Data Representation: representing emails bodies and
subjects as numerical Term Frequency-Inverse
Document Frequency matrices that can be used in the
analysis.
3) Verb-Nouns Extraction: we consider that the verb-noun
pairs are likely to be candidates of being business
activities.</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Clustering Emails According to their Process Topics</title>
        <p>In this section, our objective is to group the emails into
clusters according to what process model they are concerned
by. If we fetch the emails of a researcher, we can detect that
his/her emails are concerned by different processes such as
scheduling a meeting or organizing a conference, etc..</p>
        <p>
          Since in our case, we do not have any apriori knowledge
about the business process topics available in the email log,
we use an unsupervised machine learning method which is the
clustering. In particular, hierarchical clustering is used to group
emails in clusters based on their similarity. Agglomeratively
(using the complete linkage), the clusters are fused together
according to the chosen similarity measurement technique. We
try two different methods for semantic similarity measurement
between the selected features of emails (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Latent Semantic
Analysis which is used to map words occurring in emails into
concepts. and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Word2vec which is used to calculate the
similarities between emails according to the context of their
words. The hierarchy is cut such that emails belonging to same
process model are clustered together. The output of this cut
is a set of clusters {P C1, P C2, P C3, ..., P Cn}, where each
cluster P Ci contains a set of emails related to the same process
model topic. P Ci and subsequently the emails contained in it
are associated to a ProcessID.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. PROCESS INSTANCES DISCOVERY</title>
      <p>In process mining, there exist the main terms: business
process model and the business process instance. A process
instance is a specific occurrence or execution of a business
process model. Each email log can contain processes of
different topics. Let us suppose that in an email log, the "meeting
scheduling" process topic exists. An employee may exchange
emails with two different entities for scheduling different
meetings. Thus, multiple email exchanges take place for this
purpose. These email exchanges represent the different
executions or occurrences. In this section, we work on analyzing
email texts for identifying for each email the process instance
it belongs to. This is mainly done by choosing attributes
and features from email texts that help in distinguishing
between emails of different process instances and in grouping
of the same process instances. We work on choosing the best
distance function that consists of a combination of attributes
for clustering emails into process instances.</p>
      <p>For discovering business process instances from email logs,
we start from the previously obtained process topic clusters
{P C1, P C2, P C3, ..., P Cn}, where each cluster P Ci
represents a business process topic. We aim to deduce for each
business process topic cluster, the set of process instances it
contains.</p>
      <sec id="sec-5-1">
        <title>A. Defining an Appropriate Distance Function</title>
        <p>In order to separate emails belonging to the same
process model into different process instances, we apply a
subclustering step on the already obtained process topics clusters.
We illustrate the steps of this phase and the distance
function definition by using an example which is concerned by
applications for missions funding.</p>
        <p>Example: Suppose that one of the obtained clusters
contains emails about all applications of Ph.D students for
"missions funding" process topic. Emails of the same process
instance are supposed to revolve around some common names.
Take as an example the emails exchanged between a student
and the secretary for applying for a funding to attend the
BDCSIntell 2019 conference in Versailles. Most of these emails
bodies and subjects will include the named entities
"BDCSIntell" or "Versailles" or "Paris". We claim that these named
entities can be helpful in discovering which emails are related
to the same process execution (same mission application).
However, in some cases named entities will not be sufficient to
distinguish instances. Suppose two emails are about applying
to a travel grant for the same conference BDCSIntell but in two
different years 2018 and 2019. These two emails are supposed
to belong to different process instances. Using only the named
entities, these two emails will be considered belonging to
the same process instance. Thus, we decide to add another
attribute which gives an indication about the time of sending
an email. Although the named entities and the email timestamp
have provided a good indication for separating emails into
different instances, some cases have proven that these two
attributes are not always sufficient. Suppose two different
students are applying to the same conference BDCSIntell in
the same year 2019. The named entities and the timestamps
of the emails of these students will be similar, however,
these emails belong to different process instances (for two
different students). Therefore, we add a new attribute which is
sender/receiver of an email which can separate emails as in the
described case. We define the distance function as follows: we
first define the similarity function and then derive the distance
function.</p>
        <p>
          Distance(Eij, Eik) = 1 (w1 ⇥ Sim(NEij, NEik) + w2 ⇥ Sim(Tij, Tik) + w3 ⇥ Sim(SRij, SRik))
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where Eij and Eik are two different emails j and k in
the same process model cluster Ci. (N Eij , Tij , SRij ) and
(N Eik , Tik , SRik ) are the named entities of the subjects
and bodies, timestamps and sender/receiver of emails Eij and
Eik respectively. Weights w1, w2, w3 (w1 + w2 + w3 = 1)
represent the relative importance of named entities, timestamps
and sender/receiver of emails, respectively.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>B. Clustering Emails Into Process Instances</title>
        <p>Using the above distance function, we calculate distances
between all pairs of emails. Accordingly, hierarchical
clustering is applied where we get the emails distributed on a
hierarchical structure. We tried several cuts on the obtained
hierarchy. We choose the one which provides the best clustering
quality (according to clustering quality measures mentioned
in the experimentation section). Each of the obtained clusters
ICi contains emails belonging to the same process instance.
Every cluster is provided a process instance identifier.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>VI. PROCESS ACTIVITIES DISCOVERY</title>
      <sec id="sec-6-1">
        <title>A Business Process Model is composed of a set of Business</title>
        <p>Activities enacted in a specific sequence to achieve a business
goal. Each email is sent for the aim of requesting, canceling,
confirming a specific task or set of tasks. One of the main
attributes in the event log is the activity label. Therefore, in this
section, we work on extracting activity labels from email logs.
There are two main hypotheses for the extraction of business
process activities from email logs. The first hypothesis is that
each email contains one and only activity which is not always
true. Therefore, we propose the second hypothesis in which
we assume that an email can contain 0, 1 or more activities.</p>
        <p>We build an approach that takes as input an email log and
produces the set of business activity types it contains.</p>
      </sec>
      <sec id="sec-6-2">
        <title>A. Relevant Sentences Extraction</title>
        <p>
          The goal of this phase is to extract from each email the
sentences that contain business activities or information about
activities. To identify relevant sentences in an email, we use
a classification technique that associates each email sentence
with one of the following labels Relevant or Non-Relevant.
We characterize each email sentence by a set of features
that describe it: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Sentence position, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Sentence length,
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) Number of named entities, (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Cohesion with centroid
sentence, (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Dissimilarity with greeting phrases, (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) Length
of the sentence, (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) Similarity of the verb-nouns with
processoriented activities extracted from a repository of process
models of different domains.
        </p>
        <p>We build the training data by labelling the the sentences
feature vectors. An expert decides whether the sentence is
meaningful from a business-oriented perspective. The training
dataset is used to train the classification model. Different
classification techniques are used to obtain the best classification
results.</p>
      </sec>
      <sec id="sec-6-3">
        <title>B. Activity Types Discovery</title>
        <p>The business activities elicited in the previous steps can
be further processed to organize them by their activity types.
Hierarchical clustering is applied to sentences containing
process oriented verb-noun pairs (i.e. activities). The similarity
between two sentences is calculated using the cosine similarity
between the Word2vec vectors of their verb-noun pairs (i.e.
activities). This phase will give as a result a set of clusters
where each cluster contains sentences from different emails
but with the same activity type ({ACi}). For each cluster, we
choose the top N verb-noun pairs mentioned in the activity
cluster (for example N can be equal to 3). Then one of these
verb-noun candidates can be chosen by an expert as a label
for the cluster.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>VII. TEMPORAL FEATURES EXTRACTION</title>
      <p>Each email is exchanged in a specified timestamp. As a first
hypothesis, one can consider that the timestamp of an email
activity is the same as that of the email it belongs to which
is not always true. An email may contain business activities
that were already applied, in progress activities, or activities
that will be applied in the future. Therefore, we extract the
temporal relation:
1) Between the email activities and the email timestamp:
the objective is to temporally locate an email activity
according to the email timestamp in which it occurs. To
be accurate, we can divide the relation between the
activity and the email timestamp into different categories.
Possible categories are Before: in which the activity
occurs before the email sending time, Overlap in which
the activity occurs at the time the email is sent and After
in which the activity will occur after the email sending
time.
2) Between the email activities themselves: the objective
here is to extract the intra temporal relations between the
email activities using the email temporal expressions.</p>
    </sec>
    <sec id="sec-8">
      <title>VIII. CONCLUSIONS AND PERSPECTIVES</title>
      <p>
        Throughout the sections of this paper, we described the
main approaches that constitute the framework presented in
this work. The components of the framework are mainly
concerned by: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) business process topic discovery for emails,
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) business process instances discovery for emails, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
business activities discovery, and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) preliminary estimation of the
activity occurrence timestamp.
      </p>
      <p>There exist several potential perspectives based on the
obtained results such as building a recommendation system
that can recommend activities based on received emails, or
allowing the incremental learning for our system.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Izzat</given-names>
            <surname>Alsmadi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ikdam</given-names>
            <surname>Alhami</surname>
          </string-name>
          .
          <article-title>Clustering and classification of email contents</article-title>
          .
          <source>Journal of King</source>
          Saud University-Computer and Information Sciences,
          <volume>27</volume>
          (
          <issue>1</issue>
          ):
          <fpage>46</fpage>
          -
          <lpage>57</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ron</given-names>
            <surname>Bekkerman</surname>
          </string-name>
          .
          <article-title>Automatic categorization of email into folders: Benchmark experiments on enron and sri corpora</article-title>
          .
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Vitor</surname>
            <given-names>R Carvalho</given-names>
          </string-name>
          and William W Cohen.
          <article-title>Improving email speech acts analysis via n-gram selection</article-title>
          .
          <source>In Proceedings of the HLT-NAACL 2006 Workshop on Analyzing Conversations in Text and Speech</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>41</lpage>
          . Association for Computational Linguistics,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>William</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Vitor R. Carvalho</surname>
            , and
            <given-names>Tom M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitchell</surname>
          </string-name>
          .
          <article-title>Learning to classify email into speech acts</article-title>
          .
          <source>In In Proceedings of Empirical Methods in Natural Language Processing</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Faulring</surname>
          </string-name>
          , Brad Myers, Ken Mohnkern, Bradley Schmerl, Aaron Steinfeld, John Zimmerman, Asim Smailagic, Jeffery Hansen, and Daniel Siewiorek.
          <article-title>Agent-assisted task management that reduces email overload</article-title>
          .
          <source>In Proceedings of the 15th international conference on Intelligent user interfaces</source>
          , pages
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Wil</surname>
            <given-names>MP van der Aalst and Andriy</given-names>
          </string-name>
          <string-name>
            <surname>Nikolov</surname>
          </string-name>
          .
          <article-title>Emailanalyzer: an e-mail mining plug-in for the prom framework</article-title>
          .
          <source>BPM Center Report BPM-07- 16</source>
          , BPMCenter. org,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Shinjae</given-names>
            <surname>Yoo</surname>
          </string-name>
          , Yiming Yang,
          <string-name>
            <given-names>Frank</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <surname>Il-Chul Moon</surname>
          </string-name>
          .
          <article-title>Mining social networks for personalized email prioritization</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>967</fpage>
          -
          <lpage>976</lpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>