=Paper=
{{Paper
|id=Vol-2622/paper1
|storemode=property
|title=Mining Business Process Information from Email Logs for Business Process Models Discovery
|pdfUrl=https://ceur-ws.org/Vol-2622/paper1.pdf
|volume=Vol-2622
|authors=Diana Jlailaty,Daniela Grigori,Khalid Belhajjame
|dblpUrl=https://dblp.org/rec/conf/bdcsintell/JlailatyGB19
}}
==Mining Business Process Information from Email Logs for Business Process Models Discovery==
<pdf width="1500px">https://ceur-ws.org/Vol-2622/paper1.pdf</pdf>
<pre>
   Mining Business Process Information from Email
    Logs for Business Process Models Discovery
                   Diana Jlailaty                                        Daniela Grigori                              Khalid Belhajjame
          Université Paris-Dauphine,                           Université Paris-Dauphine,                     Université Paris-Dauphine,
           PSL Research University,                             PSL Research University,                      PSL Research University,
       CNRS, [UMR 7243], LAMSADE,                            CNRS, [UMR 7243], LAMSADE,                    CNRS, [UMR 7243], LAMSADE,
             75016 Paris, France                                   75016 Paris, France                           75016 Paris, France
         diana.al-jlailaty@dauphine.fr                         daniela.grigori@dauphine.fr                  khalid.belhajjame@dauphine.fr


   Abstract—Exchanged information in emails’ texts is usually                    are characterized by some attributes. Each event corresponds
concerned by complex events or business processes in which                       to an activity (associated to an activity label) that is executed in
the entities exchanging emails are collaborating to achieve the                  the process (associated to a Process Identifier), where multiple
processes’ final goals. Thus, the flow of information in the
sent and received emails constitutes an essential part of such                   events (ordered by their timestamps) can be linked together as
processes i.e. the tasks or the business activities. An email can be             a process instance (associated to a Process Instance Identifier).
harvested for understanding the undocumented business process                    Transforming email logs into event logs allows us to produce
information it contains. Our goal in this work is to recast emails               business process models using the available process mining
into a resource of business-oriented information. We describe a                  tools. The produced business process models can provide a
framework that is constituted of several analytical approaches
able to extract such kind of information from email logs i.e.                    clear overview on the processes and the activities in a user
transforming an email log into an event log. The efficiency of all               email log which facilitates the organization and retrieval of
approaches is studied by applying different experiments on the                   emails. In this work, we develop a framework that includes
open Enron email dataset.                                                        different approaches contributing in the following:
   Index Terms—Email Analysis, Business Process Models, Text
                                                                                    • An approach that can automatically find, for each email,
Mining, Process Instances, Business Activities
                                                                                      the business process topic it belongs to i.e. extraction of
                       I. I NTRODUCTION                                               the Process Identifier (ProcessID) for each email.
   Email is by and large the first and the most popular                             • A process instance discovery approach that can auto-

professional communication and social medium 1 . It is a                              matically find the business process instance an email
reliable, confidential, fast, free and easily accessible form of                      belongs to i.e. extraction of the Process Instance Identifier
communication. Exchanging emails becomes essential when                               (ProcessInstanceID) for each email.
applying tasks in organizational processes necessitates the                         • An approach that automatically extracts multiple business

involvement of multiple individuals. Assigning tasks, asking                          activities from emails and that annotates the elicited
for more information, reporting results - all these activities are                    activities i.e. extraction of the activity labels in an email.
enacted via email messages. Therefore, such email messages                          • A preliminary approach that can estimate the real occur-

necessarily contain process-related information that refer to                         rence time of an event or email activity i.e. extraction of
the business process under execution.                                                 the activity occurrence timestamp.
   However, email analysis from a Business Process Manage-                          • The efficiency of all the above approaches is evaluated

ment (BPM) perspective has not been thoroughly studied in the                         using multiple email folders from Enron email dataset 2 .
literature. Some of the existing works allow the identification                     In this paper, we first start by providing a brief study on
of email activities among a predefined set of activities [5],                    the related works in section II. An overview on the overall
[4], [3]. The email analyzer developed by Van der Aalst                          framework is presented in section III. Sections IV, IV, VI, and
[6] necessitates the user interference to extract a process                      VII explain the phases of our framework. Finally, the work is
instance from an email log. Hence, until recently and up                         concluded in section VIII.
to our knowledge, none of the previous works has tackled
                                                                                                         II. R ELATED W ORK
the problem of extracting business process information from
emails automatically without any a priori knowledge for the                         The common objective of the related works presented in this
goal of business process models discovery.                                       section is to categorize emails into a set of classes (folders,
   In this work, we aim to analyze the unstructured data in                      topics, importance, main activities). In the work of Alsmadi
emails to harvest the undocumented business process infor-                       et al. [1], a large set of emails is used for the purpose of
mation from email logs i.e. event logs. In an event log, events                  folder classifications. Five classes are proposed to label the
  1 http://onlinegroups.net/blog/2014/03/06/use-email-for-collaboration/           2 https://www.cs.cmu.edu/ enron/


 Copyright © 2019 for this paper by its authors. Use permitted under Creative
 Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                                                                        1
nature of emails: Personal, Job, Profession, Friendship, and        queried, quantified, and analyzed with data mining techniques.
Others. Another work by Yoo et al. [7] develop a personalized       Four main steps are applied in this component:
email prioritization method using a supervised classification         1) Data Cleansing: removing stopwords, whitespaces, and
framework. The goal is to model personal priorities over email           puctuations, or stemming.
messages, and to predict importance levels for new messages           2) Data Representation: representing emails bodies and
using standard Support Vector Machines (SVMs) as classifiers.            subjects as numerical Term Frequency-Inverse Docu-
In the work of Bekkerman et al. [2], they represent emails as            ment Frequency matrices that can be used in the analy-
bag-of words (vectors of word counts) to classify them into a            sis.
predefined set of classes (folders). In the work of Faulring et       3) Verb-Nouns Extraction: we consider that the verb-noun
al. [5], they classify tasks contained within sentences of emails        pairs are likely to be candidates of being business
from 8 predefined set of classes of tasks.                               activities.
   In our work, we overcome the limitation of specifying
a predefined set of process topic and activity classes. Our         B. Clustering Emails According to their Process Topics
approach is able to cope with the diversity of business topic
                                                                       In this section, our objective is to group the emails into
and activity types that can exist in emails. In contrast to some
                                                                    clusters according to what process model they are concerned
previous works, we automatically discover and label all topic
                                                                    by. If we fetch the emails of a researcher, we can detect that
and activity types presented in an email. In addition, instead
                                                                    his/her emails are concerned by different processes such as
of dealing with a single activity type per email, we work on
                                                                    scheduling a meeting or organizing a conference, etc..
discovering multiple business-oriented activities in an email.
                                                                       Since in our case, we do not have any apriori knowledge
                III. F RAMEWORK OVERVIEW                            about the business process topics available in the email log,
   In this section, we present the overall framework developed      we use an unsupervised machine learning method which is the
in this paper. It is composed of three main components. Figure      clustering. In particular, hierarchical clustering is used to group
1 shows an overview of the components of the framework. The         emails in clusters based on their similarity. Agglomeratively
framework takes as an input the email log. The first component      (using the complete linkage), the clusters are fused together
is for Process Topic Discovery where each email is associated       according to the chosen similarity measurement technique. We
to a business process topic. The second component is for            try two different methods for semantic similarity measurement
Process Instances Discovery where each email is associated          between the selected features of emails (1) Latent Semantic
to a business process instance. The third component if for          Analysis which is used to map words occurring in emails into
Process Activities Discovery where each email is associated         concepts. and (2) Word2vec which is used to calculate the
to a set of process activities.                                     similarities between emails according to the context of their
                                                                    words. The hierarchy is cut such that emails belonging to same
                                                                    process model are clustered together. The output of this cut
                                                                    is a set of clusters {P C1 , P C2 , P C3 , ..., P Cn }, where each
                                                                    cluster P Ci contains a set of emails related to the same process
                                                                    model topic. P Ci and subsequently the emails contained in it
                                                                    are associated to a ProcessID.

                                                                                V. P ROCESS I NSTANCES D ISCOVERY
                                                                       In process mining, there exist the main terms: business
                                                                    process model and the business process instance. A process
                                                                    instance is a specific occurrence or execution of a business
                                                                    process model. Each email log can contain processes of dif-
                                                                    ferent topics. Let us suppose that in an email log, the "meeting
                                                                    scheduling" process topic exists. An employee may exchange
                                                                    emails with two different entities for scheduling different
                   Fig. 1. The overall framework.                   meetings. Thus, multiple email exchanges take place for this
                                                                    purpose. These email exchanges represent the different exe-
              IV. P ROCESS T OPIC D ISCOVERY                        cutions or occurrences. In this section, we work on analyzing
                                                                    email texts for identifying for each email the process instance
A. Email Log Preprocessing                                          it belongs to. This is mainly done by choosing attributes
   Each email in an email is represented by some attributes         and features from email texts that help in distinguishing
describing it: email subject, sender, receiver, email body, and     between emails of different process instances and in grouping
email timestamp. Knowing that email text falls under the            of the same process instances. We work on choosing the best
category of unstructured data, one must take considerable time      distance function that consists of a combination of attributes
to preprocess this data with fixed fields so that they can be       for clustering emails into process instances.


                                                                                                                                          2
   For discovering business process instances from email logs,                                          B. Clustering Emails Into Process Instances
we start from the previously obtained process topic clusters                                               Using the above distance function, we calculate distances
{P C1 , P C2 , P C3 , ..., P Cn }, where each cluster P Ci rep-                                         between all pairs of emails. Accordingly, hierarchical clus-
resents a business process topic. We aim to deduce for each                                             tering is applied where we get the emails distributed on a
business process topic cluster, the set of process instances it                                         hierarchical structure. We tried several cuts on the obtained hi-
contains.                                                                                               erarchy. We choose the one which provides the best clustering
                                                                                                        quality (according to clustering quality measures mentioned
A. Defining an Appropriate Distance Function                                                            in the experimentation section). Each of the obtained clusters
   In order to separate emails belonging to the same pro-                                               ICi contains emails belonging to the same process instance.
cess model into different process instances, we apply a sub-                                            Every cluster is provided a process instance identifier.
clustering step on the already obtained process topics clusters.
                                                                                                                    VI. P ROCESS ACTIVITIES D ISCOVERY
We illustrate the steps of this phase and the distance func-
tion definition by using an example which is concerned by                                                  A Business Process Model is composed of a set of Business
applications for missions funding.                                                                      Activities enacted in a specific sequence to achieve a business
      Example: Suppose that one of the obtained clusters                                                goal. Each email is sent for the aim of requesting, canceling,
contains emails about all applications of Ph.D students for                                             confirming a specific task or set of tasks. One of the main
"missions funding" process topic. Emails of the same process                                            attributes in the event log is the activity label. Therefore, in this
instance are supposed to revolve around some common names.                                              section, we work on extracting activity labels from email logs.
Take as an example the emails exchanged between a student                                               There are two main hypotheses for the extraction of business
and the secretary for applying for a funding to attend the BD-                                          process activities from email logs. The first hypothesis is that
CSIntell 2019 conference in Versailles. Most of these emails                                            each email contains one and only activity which is not always
bodies and subjects will include the named entities "BDC-                                               true. Therefore, we propose the second hypothesis in which
SIntell" or "Versailles" or "Paris". We claim that these named                                          we assume that an email can contain 0, 1 or more activities.
entities can be helpful in discovering which emails are related                                            We build an approach that takes as input an email log and
to the same process execution (same mission application).                                               produces the set of business activity types it contains.
However, in some cases named entities will not be sufficient to                                         A. Relevant Sentences Extraction
distinguish instances. Suppose two emails are about applying
                                                                                                           The goal of this phase is to extract from each email the
to a travel grant for the same conference BDCSIntell but in two
                                                                                                        sentences that contain business activities or information about
different years 2018 and 2019. These two emails are supposed
                                                                                                        activities. To identify relevant sentences in an email, we use
to belong to different process instances. Using only the named
                                                                                                        a classification technique that associates each email sentence
entities, these two emails will be considered belonging to
                                                                                                        with one of the following labels Relevant or Non-Relevant.
the same process instance. Thus, we decide to add another
                                                                                                        We characterize each email sentence by a set of features
attribute which gives an indication about the time of sending
                                                                                                        that describe it: (1) Sentence position, (2) Sentence length,
an email. Although the named entities and the email timestamp
                                                                                                        (3) Number of named entities, (4) Cohesion with centroid
have provided a good indication for separating emails into
                                                                                                        sentence, (4) Dissimilarity with greeting phrases, (5) Length
different instances, some cases have proven that these two
                                                                                                        of the sentence, (6) Similarity of the verb-nouns with process-
attributes are not always sufficient. Suppose two different
                                                                                                        oriented activities extracted from a repository of process
students are applying to the same conference BDCSIntell in
                                                                                                        models of different domains.
the same year 2019. The named entities and the timestamps
                                                                                                           We build the training data by labelling the the sentences
of the emails of these students will be similar, however,
                                                                                                        feature vectors. An expert decides whether the sentence is
these emails belong to different process instances (for two
                                                                                                        meaningful from a business-oriented perspective. The training
different students). Therefore, we add a new attribute which is
                                                                                                        dataset is used to train the classification model. Different clas-
sender/receiver of an email which can separate emails as in the
                                                                                                        sification techniques are used to obtain the best classification
described case. We define the distance function as follows: we
                                                                                                        results.
first define the similarity function and then derive the distance
function.                                                                                               B. Activity Types Discovery
Distance(Eij , Eik ) = 1   (w1 ⇥ Sim(N Eij , N Eik ) + w2 ⇥ Sim(Tij , Tik ) + w3 ⇥ Sim(SRij , SRik ))     The business activities elicited in the previous steps can
                                                                                             (1)        be further processed to organize them by their activity types.
where Eij and Eik are two different emails j and k in                                                   Hierarchical clustering is applied to sentences containing pro-
the same process model cluster Ci . (N Eij , Tij , SRij ) and                                           cess oriented verb-noun pairs (i.e. activities). The similarity
(N Eik , Tik , SRik ) are the named entities of the subjects                                            between two sentences is calculated using the cosine similarity
and bodies, timestamps and sender/receiver of emails Eij and                                            between the Word2vec vectors of their verb-noun pairs (i.e.
Eik respectively. Weights w1 , w2 , w3 (w1 + w2 + w3 = 1)                                               activities). This phase will give as a result a set of clusters
represent the relative importance of named entities, timestamps                                         where each cluster contains sentences from different emails
and sender/receiver of emails, respectively.                                                            but with the same activity type ({ACi }). For each cluster, we


                                                                                                                                                                                3
choose the top N verb-noun pairs mentioned in the activity                    [7] Shinjae Yoo, Yiming Yang, Frank Lin, and Il-Chul Moon. Mining social
cluster (for example N can be equal to 3). Then one of these                      networks for personalized email prioritization. In Proceedings of the
                                                                                  15th ACM SIGKDD international conference on Knowledge discovery
verb-noun candidates can be chosen by an expert as a label                        and data mining, pages 967–976. ACM, 2009.
for the cluster.
           VII. T EMPORAL F EATURES E XTRACTION
   Each email is exchanged in a specified timestamp. As a first
hypothesis, one can consider that the timestamp of an email
activity is the same as that of the email it belongs to which
is not always true. An email may contain business activities
that were already applied, in progress activities, or activities
that will be applied in the future. Therefore, we extract the
temporal relation:
   1) Between the email activities and the email timestamp:
       the objective is to temporally locate an email activity
       according to the email timestamp in which it occurs. To
       be accurate, we can divide the relation between the ac-
       tivity and the email timestamp into different categories.
       Possible categories are Before: in which the activity
       occurs before the email sending time, Overlap in which
       the activity occurs at the time the email is sent and After
       in which the activity will occur after the email sending
       time.
   2) Between the email activities themselves: the objective
       here is to extract the intra temporal relations between the
       email activities using the email temporal expressions.
           VIII. C ONCLUSIONS AND P ERSPECTIVES
   Throughout the sections of this paper, we described the
main approaches that constitute the framework presented in
this work. The components of the framework are mainly
concerned by: (1) business process topic discovery for emails,
(2) business process instances discovery for emails, (3) busi-
ness activities discovery, and (4) preliminary estimation of the
activity occurrence timestamp.
   There exist several potential perspectives based on the
obtained results such as building a recommendation system
that can recommend activities based on received emails, or
allowing the incremental learning for our system.
                             R EFERENCES
[1] Izzat Alsmadi and Ikdam Alhami. Clustering and classification of email
    contents. Journal of King Saud University-Computer and Information
    Sciences, 27(1):46–57, 2015.
[2] Ron Bekkerman. Automatic categorization of email into folders: Bench-
    mark experiments on enron and sri corpora. 2004.
[3] Vitor R Carvalho and William W Cohen. Improving email speech acts
    analysis via n-gram selection. In Proceedings of the HLT-NAACL 2006
    Workshop on Analyzing Conversations in Text and Speech, pages 35–41.
    Association for Computational Linguistics, 2006.
[4] William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. Learning to
    classify email into speech acts. In In Proceedings of Empirical Methods
    in Natural Language Processing, 2004.
[5] Andrew Faulring, Brad Myers, Ken Mohnkern, Bradley Schmerl, Aaron
    Steinfeld, John Zimmerman, Asim Smailagic, Jeffery Hansen, and Daniel
    Siewiorek. Agent-assisted task management that reduces email overload.
    In Proceedings of the 15th international conference on Intelligent user
    interfaces, pages 61–70. ACM, 2010.
[6] Wil MP van der Aalst and Andriy Nikolov. Emailanalyzer: an e-mail
    mining plug-in for the prom framework. BPM Center Report BPM-07-
    16, BPMCenter. org, 2007.


                                                                                                                                                          4

</pre>