=Paper=
{{Paper
|id=Vol-2622/paper1
|storemode=property
|title=Mining Business Process Information from Email Logs for Business Process Models Discovery
|pdfUrl=https://ceur-ws.org/Vol-2622/paper1.pdf
|volume=Vol-2622
|authors=Diana Jlailaty,Daniela Grigori,Khalid Belhajjame
|dblpUrl=https://dblp.org/rec/conf/bdcsintell/JlailatyGB19
}}
==Mining Business Process Information from Email Logs for Business Process Models Discovery==
Mining Business Process Information from Email
Logs for Business Process Models Discovery
Diana Jlailaty Daniela Grigori Khalid Belhajjame
Université Paris-Dauphine, Université Paris-Dauphine, Université Paris-Dauphine,
PSL Research University, PSL Research University, PSL Research University,
CNRS, [UMR 7243], LAMSADE, CNRS, [UMR 7243], LAMSADE, CNRS, [UMR 7243], LAMSADE,
75016 Paris, France 75016 Paris, France 75016 Paris, France
diana.al-jlailaty@dauphine.fr daniela.grigori@dauphine.fr khalid.belhajjame@dauphine.fr
Abstract—Exchanged information in emails’ texts is usually are characterized by some attributes. Each event corresponds
concerned by complex events or business processes in which to an activity (associated to an activity label) that is executed in
the entities exchanging emails are collaborating to achieve the the process (associated to a Process Identifier), where multiple
processes’ final goals. Thus, the flow of information in the
sent and received emails constitutes an essential part of such events (ordered by their timestamps) can be linked together as
processes i.e. the tasks or the business activities. An email can be a process instance (associated to a Process Instance Identifier).
harvested for understanding the undocumented business process Transforming email logs into event logs allows us to produce
information it contains. Our goal in this work is to recast emails business process models using the available process mining
into a resource of business-oriented information. We describe a tools. The produced business process models can provide a
framework that is constituted of several analytical approaches
able to extract such kind of information from email logs i.e. clear overview on the processes and the activities in a user
transforming an email log into an event log. The efficiency of all email log which facilitates the organization and retrieval of
approaches is studied by applying different experiments on the emails. In this work, we develop a framework that includes
open Enron email dataset. different approaches contributing in the following:
Index Terms—Email Analysis, Business Process Models, Text
• An approach that can automatically find, for each email,
Mining, Process Instances, Business Activities
the business process topic it belongs to i.e. extraction of
I. I NTRODUCTION the Process Identifier (ProcessID) for each email.
Email is by and large the first and the most popular • A process instance discovery approach that can auto-
professional communication and social medium 1 . It is a matically find the business process instance an email
reliable, confidential, fast, free and easily accessible form of belongs to i.e. extraction of the Process Instance Identifier
communication. Exchanging emails becomes essential when (ProcessInstanceID) for each email.
applying tasks in organizational processes necessitates the • An approach that automatically extracts multiple business
involvement of multiple individuals. Assigning tasks, asking activities from emails and that annotates the elicited
for more information, reporting results - all these activities are activities i.e. extraction of the activity labels in an email.
enacted via email messages. Therefore, such email messages • A preliminary approach that can estimate the real occur-
necessarily contain process-related information that refer to rence time of an event or email activity i.e. extraction of
the business process under execution. the activity occurrence timestamp.
However, email analysis from a Business Process Manage- • The efficiency of all the above approaches is evaluated
ment (BPM) perspective has not been thoroughly studied in the using multiple email folders from Enron email dataset 2 .
literature. Some of the existing works allow the identification In this paper, we first start by providing a brief study on
of email activities among a predefined set of activities [5], the related works in section II. An overview on the overall
[4], [3]. The email analyzer developed by Van der Aalst framework is presented in section III. Sections IV, IV, VI, and
[6] necessitates the user interference to extract a process VII explain the phases of our framework. Finally, the work is
instance from an email log. Hence, until recently and up concluded in section VIII.
to our knowledge, none of the previous works has tackled
II. R ELATED W ORK
the problem of extracting business process information from
emails automatically without any a priori knowledge for the The common objective of the related works presented in this
goal of business process models discovery. section is to categorize emails into a set of classes (folders,
In this work, we aim to analyze the unstructured data in topics, importance, main activities). In the work of Alsmadi
emails to harvest the undocumented business process infor- et al. [1], a large set of emails is used for the purpose of
mation from email logs i.e. event logs. In an event log, events folder classifications. Five classes are proposed to label the
1 http://onlinegroups.net/blog/2014/03/06/use-email-for-collaboration/ 2 https://www.cs.cmu.edu/ enron/
Copyright © 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
1
nature of emails: Personal, Job, Profession, Friendship, and queried, quantified, and analyzed with data mining techniques.
Others. Another work by Yoo et al. [7] develop a personalized Four main steps are applied in this component:
email prioritization method using a supervised classification 1) Data Cleansing: removing stopwords, whitespaces, and
framework. The goal is to model personal priorities over email puctuations, or stemming.
messages, and to predict importance levels for new messages 2) Data Representation: representing emails bodies and
using standard Support Vector Machines (SVMs) as classifiers. subjects as numerical Term Frequency-Inverse Docu-
In the work of Bekkerman et al. [2], they represent emails as ment Frequency matrices that can be used in the analy-
bag-of words (vectors of word counts) to classify them into a sis.
predefined set of classes (folders). In the work of Faulring et 3) Verb-Nouns Extraction: we consider that the verb-noun
al. [5], they classify tasks contained within sentences of emails pairs are likely to be candidates of being business
from 8 predefined set of classes of tasks. activities.
In our work, we overcome the limitation of specifying
a predefined set of process topic and activity classes. Our B. Clustering Emails According to their Process Topics
approach is able to cope with the diversity of business topic
In this section, our objective is to group the emails into
and activity types that can exist in emails. In contrast to some
clusters according to what process model they are concerned
previous works, we automatically discover and label all topic
by. If we fetch the emails of a researcher, we can detect that
and activity types presented in an email. In addition, instead
his/her emails are concerned by different processes such as
of dealing with a single activity type per email, we work on
scheduling a meeting or organizing a conference, etc..
discovering multiple business-oriented activities in an email.
Since in our case, we do not have any apriori knowledge
III. F RAMEWORK OVERVIEW about the business process topics available in the email log,
In this section, we present the overall framework developed we use an unsupervised machine learning method which is the
in this paper. It is composed of three main components. Figure clustering. In particular, hierarchical clustering is used to group
1 shows an overview of the components of the framework. The emails in clusters based on their similarity. Agglomeratively
framework takes as an input the email log. The first component (using the complete linkage), the clusters are fused together
is for Process Topic Discovery where each email is associated according to the chosen similarity measurement technique. We
to a business process topic. The second component is for try two different methods for semantic similarity measurement
Process Instances Discovery where each email is associated between the selected features of emails (1) Latent Semantic
to a business process instance. The third component if for Analysis which is used to map words occurring in emails into
Process Activities Discovery where each email is associated concepts. and (2) Word2vec which is used to calculate the
to a set of process activities. similarities between emails according to the context of their
words. The hierarchy is cut such that emails belonging to same
process model are clustered together. The output of this cut
is a set of clusters {P C1 , P C2 , P C3 , ..., P Cn }, where each
cluster P Ci contains a set of emails related to the same process
model topic. P Ci and subsequently the emails contained in it
are associated to a ProcessID.
V. P ROCESS I NSTANCES D ISCOVERY
In process mining, there exist the main terms: business
process model and the business process instance. A process
instance is a specific occurrence or execution of a business
process model. Each email log can contain processes of dif-
ferent topics. Let us suppose that in an email log, the "meeting
scheduling" process topic exists. An employee may exchange
emails with two different entities for scheduling different
Fig. 1. The overall framework. meetings. Thus, multiple email exchanges take place for this
purpose. These email exchanges represent the different exe-
IV. P ROCESS T OPIC D ISCOVERY cutions or occurrences. In this section, we work on analyzing
email texts for identifying for each email the process instance
A. Email Log Preprocessing it belongs to. This is mainly done by choosing attributes
Each email in an email is represented by some attributes and features from email texts that help in distinguishing
describing it: email subject, sender, receiver, email body, and between emails of different process instances and in grouping
email timestamp. Knowing that email text falls under the of the same process instances. We work on choosing the best
category of unstructured data, one must take considerable time distance function that consists of a combination of attributes
to preprocess this data with fixed fields so that they can be for clustering emails into process instances.
2
For discovering business process instances from email logs, B. Clustering Emails Into Process Instances
we start from the previously obtained process topic clusters Using the above distance function, we calculate distances
{P C1 , P C2 , P C3 , ..., P Cn }, where each cluster P Ci rep- between all pairs of emails. Accordingly, hierarchical clus-
resents a business process topic. We aim to deduce for each tering is applied where we get the emails distributed on a
business process topic cluster, the set of process instances it hierarchical structure. We tried several cuts on the obtained hi-
contains. erarchy. We choose the one which provides the best clustering
quality (according to clustering quality measures mentioned
A. Defining an Appropriate Distance Function in the experimentation section). Each of the obtained clusters
In order to separate emails belonging to the same pro- ICi contains emails belonging to the same process instance.
cess model into different process instances, we apply a sub- Every cluster is provided a process instance identifier.
clustering step on the already obtained process topics clusters.
VI. P ROCESS ACTIVITIES D ISCOVERY
We illustrate the steps of this phase and the distance func-
tion definition by using an example which is concerned by A Business Process Model is composed of a set of Business
applications for missions funding. Activities enacted in a specific sequence to achieve a business
Example: Suppose that one of the obtained clusters goal. Each email is sent for the aim of requesting, canceling,
contains emails about all applications of Ph.D students for confirming a specific task or set of tasks. One of the main
"missions funding" process topic. Emails of the same process attributes in the event log is the activity label. Therefore, in this
instance are supposed to revolve around some common names. section, we work on extracting activity labels from email logs.
Take as an example the emails exchanged between a student There are two main hypotheses for the extraction of business
and the secretary for applying for a funding to attend the BD- process activities from email logs. The first hypothesis is that
CSIntell 2019 conference in Versailles. Most of these emails each email contains one and only activity which is not always
bodies and subjects will include the named entities "BDC- true. Therefore, we propose the second hypothesis in which
SIntell" or "Versailles" or "Paris". We claim that these named we assume that an email can contain 0, 1 or more activities.
entities can be helpful in discovering which emails are related We build an approach that takes as input an email log and
to the same process execution (same mission application). produces the set of business activity types it contains.
However, in some cases named entities will not be sufficient to A. Relevant Sentences Extraction
distinguish instances. Suppose two emails are about applying
The goal of this phase is to extract from each email the
to a travel grant for the same conference BDCSIntell but in two
sentences that contain business activities or information about
different years 2018 and 2019. These two emails are supposed
activities. To identify relevant sentences in an email, we use
to belong to different process instances. Using only the named
a classification technique that associates each email sentence
entities, these two emails will be considered belonging to
with one of the following labels Relevant or Non-Relevant.
the same process instance. Thus, we decide to add another
We characterize each email sentence by a set of features
attribute which gives an indication about the time of sending
that describe it: (1) Sentence position, (2) Sentence length,
an email. Although the named entities and the email timestamp
(3) Number of named entities, (4) Cohesion with centroid
have provided a good indication for separating emails into
sentence, (4) Dissimilarity with greeting phrases, (5) Length
different instances, some cases have proven that these two
of the sentence, (6) Similarity of the verb-nouns with process-
attributes are not always sufficient. Suppose two different
oriented activities extracted from a repository of process
students are applying to the same conference BDCSIntell in
models of different domains.
the same year 2019. The named entities and the timestamps
We build the training data by labelling the the sentences
of the emails of these students will be similar, however,
feature vectors. An expert decides whether the sentence is
these emails belong to different process instances (for two
meaningful from a business-oriented perspective. The training
different students). Therefore, we add a new attribute which is
dataset is used to train the classification model. Different clas-
sender/receiver of an email which can separate emails as in the
sification techniques are used to obtain the best classification
described case. We define the distance function as follows: we
results.
first define the similarity function and then derive the distance
function. B. Activity Types Discovery
Distance(Eij , Eik ) = 1 (w1 ⇥ Sim(N Eij , N Eik ) + w2 ⇥ Sim(Tij , Tik ) + w3 ⇥ Sim(SRij , SRik )) The business activities elicited in the previous steps can
(1) be further processed to organize them by their activity types.
where Eij and Eik are two different emails j and k in Hierarchical clustering is applied to sentences containing pro-
the same process model cluster Ci . (N Eij , Tij , SRij ) and cess oriented verb-noun pairs (i.e. activities). The similarity
(N Eik , Tik , SRik ) are the named entities of the subjects between two sentences is calculated using the cosine similarity
and bodies, timestamps and sender/receiver of emails Eij and between the Word2vec vectors of their verb-noun pairs (i.e.
Eik respectively. Weights w1 , w2 , w3 (w1 + w2 + w3 = 1) activities). This phase will give as a result a set of clusters
represent the relative importance of named entities, timestamps where each cluster contains sentences from different emails
and sender/receiver of emails, respectively. but with the same activity type ({ACi }). For each cluster, we
3
choose the top N verb-noun pairs mentioned in the activity [7] Shinjae Yoo, Yiming Yang, Frank Lin, and Il-Chul Moon. Mining social
cluster (for example N can be equal to 3). Then one of these networks for personalized email prioritization. In Proceedings of the
15th ACM SIGKDD international conference on Knowledge discovery
verb-noun candidates can be chosen by an expert as a label and data mining, pages 967–976. ACM, 2009.
for the cluster.
VII. T EMPORAL F EATURES E XTRACTION
Each email is exchanged in a specified timestamp. As a first
hypothesis, one can consider that the timestamp of an email
activity is the same as that of the email it belongs to which
is not always true. An email may contain business activities
that were already applied, in progress activities, or activities
that will be applied in the future. Therefore, we extract the
temporal relation:
1) Between the email activities and the email timestamp:
the objective is to temporally locate an email activity
according to the email timestamp in which it occurs. To
be accurate, we can divide the relation between the ac-
tivity and the email timestamp into different categories.
Possible categories are Before: in which the activity
occurs before the email sending time, Overlap in which
the activity occurs at the time the email is sent and After
in which the activity will occur after the email sending
time.
2) Between the email activities themselves: the objective
here is to extract the intra temporal relations between the
email activities using the email temporal expressions.
VIII. C ONCLUSIONS AND P ERSPECTIVES
Throughout the sections of this paper, we described the
main approaches that constitute the framework presented in
this work. The components of the framework are mainly
concerned by: (1) business process topic discovery for emails,
(2) business process instances discovery for emails, (3) busi-
ness activities discovery, and (4) preliminary estimation of the
activity occurrence timestamp.
There exist several potential perspectives based on the
obtained results such as building a recommendation system
that can recommend activities based on received emails, or
allowing the incremental learning for our system.
R EFERENCES
[1] Izzat Alsmadi and Ikdam Alhami. Clustering and classification of email
contents. Journal of King Saud University-Computer and Information
Sciences, 27(1):46–57, 2015.
[2] Ron Bekkerman. Automatic categorization of email into folders: Bench-
mark experiments on enron and sri corpora. 2004.
[3] Vitor R Carvalho and William W Cohen. Improving email speech acts
analysis via n-gram selection. In Proceedings of the HLT-NAACL 2006
Workshop on Analyzing Conversations in Text and Speech, pages 35–41.
Association for Computational Linguistics, 2006.
[4] William W. Cohen, Vitor R. Carvalho, and Tom M. Mitchell. Learning to
classify email into speech acts. In In Proceedings of Empirical Methods
in Natural Language Processing, 2004.
[5] Andrew Faulring, Brad Myers, Ken Mohnkern, Bradley Schmerl, Aaron
Steinfeld, John Zimmerman, Asim Smailagic, Jeffery Hansen, and Daniel
Siewiorek. Agent-assisted task management that reduces email overload.
In Proceedings of the 15th international conference on Intelligent user
interfaces, pages 61–70. ACM, 2010.
[6] Wil MP van der Aalst and Andriy Nikolov. Emailanalyzer: an e-mail
mining plug-in for the prom framework. BPM Center Report BPM-07-
16, BPMCenter. org, 2007.
4