<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Activity Discovery Tool From Unstructured Data To Enhance Process Mining (Extended Abstract)</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marwa</forename><surname>Elleuch</surname></persName>
							<email>marwa1.elleuch@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Christophe</forename><surname>Maillard</surname></persName>
							<email>christophe.maillard@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Olivier</forename><surname>Graille</surname></persName>
							<email>olivier.graille@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sonia</forename><surname>Laurent</surname></persName>
							<email>sonia.laurent@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oumaima</forename><forename type="middle">Alaoui</forename><surname>Ismaili</surname></persName>
							<email>oumaima.alaouiismaili@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Philippe</forename><surname>Legay</surname></persName>
							<email>philippe.legay@orange.com</email>
							<affiliation key="aff0">
								<orgName type="laboratory">Orange Labs</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Activity Discovery Tool From Unstructured Data To Enhance Process Mining (Extended Abstract)</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">EA7E41B5D83FB62B24972A4591077047</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:02+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Activity discovery</term>
					<term>Unstructured textual records</term>
					<term>Communication logs</term>
					<term>Process mining</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The free and unstructured textual records of process actors communications are nowadays not considered by the process mining tools. The confidentiality constraints of these records makes them difficult to be processed and integrated in process mining studies conducted on real data. This paper introduces the activity discovery tool which locally analysis, in unsupervised way, the communication records of a process actor (or a restricted set of process actors) to convert them into a structured log. This log could be shared to complete the partial view of process executions obtained from structured traces. We show, through a scenario example, how the results generated by this tool could enhance process mining.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Nowadays, process mining tools could be applied only on event logs having structured format. The free textual records that capture process actors interactions and communications were generally ignored if they are not converted into a structured format. However, such records are of big importance to enrich existing process knowledge or to discover new process fragments <ref type="bibr" target="#b0">[1]</ref>. One of the main constraints for handling these unstructured records (e.g. emails, comments of incident tickets) is their confidentiality aspect. Taking the example of emails, process actors rarely agree to share the textual content of their emails to centralize their analysis. For some other types of free textual records, such as comments of incident tickets, the right of access and handling the records is generally restricted to a set of process actors. In fact, they are considered as sensitive data that could disclose the strategic aspect of an organism if they are largely shared. Therefore, it is not possible to process them outside the organism (as the case of the incident tickets) or the process actor machine (as the case of emails).</p><p>To handle these confidentiality restrictions (at individual or group of actors level), we propose in this paper ADT (the Activity Discovery Tool) that locally analysis the free textual commu-nication records of a process actor or a restricted group of actors. The tool implements and extends a recent work <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b0">1]</ref>. It aims to reduce the textual records, in unsupervised way, into a structured event log reporting the relevant performed activities. These events are generated in the way that they could be shared for completing other traces of the same process (obtained from other information systems or other process actors) but without disclosing the confidential textual contents of the handled records. In what follows, we give an overview on the related work, describe the main functionalities of the tool, provide an example scenario related to the incident management process, discuss the maturity of the tool and conclude with future works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Some related works were mainly based on supervised approaches (e.g. <ref type="bibr" target="#b2">[3]</ref>), which limits their potential to be applied in various scenarios. Tools that were designed in the same context allow employees at the most managing ongoing activities, e.g., by summarizing activities included in received emails <ref type="bibr" target="#b3">[4]</ref> or displaying activity realization status <ref type="bibr" target="#b4">[5]</ref>. A recent study <ref type="bibr" target="#b0">[1]</ref> shows that the implemented solution in our tool, answers simultaneously several challenges comparing to existing works allowing richer event log that captures in addition to activity names; their speech acts, (ii) business data and (iii) several activities per textual segment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Main Functionalities</head><p>ADT is an office application that allows process actors to analyse their communication records in order to reduce them into an event log. It ensures three main functionalities that we resume in Figure <ref type="figure" target="#fig_0">1</ref> (the functionalities colored in gray are to be ensured outside the tool): A-Unsupervised discovery of recurrent activities: It first discovers their recurrent activities by implementing and extending the approach proposed in <ref type="bibr" target="#b1">[2]</ref>. Basically, it first discovers recurrent expressions that potentially reflects how business actors express their recurrent activities in their communication records. These expressions are then grouped into activities while considering: (i) rephrasing relations, and (ii) synonymy constraints to differentiate between those that are different and which could refer to contradictory actions. To facilitate their exploration, the activities are then grouped into coarser topics and action types. Activities of the same action type (e.g. 'replace card', 'change fiber' ) share terms referring to the same coarser action (e.g. 'change', 'replace'). Activities grouped into the same topic (e.g. 'replace card', 'delete card') share terms referring the the same manipulated artifact (e.g. 'card'). B-Validation phase: It allows process actors to intervene after discovering activities to: (i) discard those that are judged confidential, and (ii) validate sharing the others. C-Generate anonymous event log: The tool generates an event log to be shared according to the proposed structure in <ref type="bibr" target="#b0">[1]</ref>. Each textual record is first reduced into the set of activity occurrences whose labels were validated (in terms of sharing). Each activity occurrence (i.e. event) is characterized by these attributes: activity name, activity speech act, business data, communication record attributes (i.e. ID, timestamp, sender, receivers and conversation ID), an action type and a topic. The tool offers the possibility to either access to such event log for further adaptation and customization (e.g. by business experts) or to its anonymous version. To obtain such anonymous version, sensitive data in each event (i.e. business data values, sender and receivers) are hashed (to guarantee that similar values could be mapped) and salted to complicate its cracking process. In this way, the textual content of the communication records are not shared. Only the relevant information w.r.t business processes are shared giving the possibility of being centrally analyzed and merged with (i) other event logs generated from the communication records of other process actors, or (ii) the structured part of the same records (e.g. IDs of incident tickets replacing the process instance information).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Scenario example</head><p>The scenario example is related to the incident managing process. We dispose the log capturing the comments exchanged, inside the incident tickets, between a restricted set of actor groups. This log is of two parts: (i) a structured part recording: the actor names sending comments, timestamps, human duration, and their ticket IDs and (ii) a non-structured part revealing the free textual content of the sent comments.</p><p>Using our tool, the comments of the concerned set of actors are analysed and reduced into the events recording the occurred activities. This log was then shared to be analysed by business experts in order to enrich the structured event log part and to inject it to Celonis as a process mining tool. We show in the demo how the additional attributes extracted from the unstructured log part enabled us to implement additional interfaces within Celonis for enriching: (i) the process actor perspective, and (ii) the filtering criteria to detect tickets containing incorrect activations of one actor.</p><p>1) Enrich the process actor perspective: At each actor activation, it becomes feasible to observe the detailed activities and generate a synthesis of the scope of the mentioned ones. This allows for the identification of instances where an action (i) was documented by an actor other than the one who performed it, or (ii) manipulating a material of a specific technical scope.</p><p>2) Enrich filtering criteria: Giving a process actor expertGroup1 and other actors of different technical domains (i.e. A, B, C and D), the goal is to identify: (i) the tickets of incorrect activation of expertGroup1 of domain A, C and D because they were resolved by actors of domain B, and (ii) the actors involved in such incorrect activation. Based only on the structured part of the tickets, we could select those where expertGroup1 was activated and domain B is the last activated compared to expertGroup1 of other domains. However, the main constraints with such method, is that sometimes, actors of domain B are not explicitly activated in the tickets; they don't send comments, so they could not be detected from the set of senders. They are only reported in comments sent by other actors referring to their interventions with a corrective action. The demo shows how with the detected activities by ADT, it was possible to identify additional 50 tickets containing incorrect activations of expertGroup1 (representing arround 23% of the total tickets) . This helps us to: (i) identify more precisely the lost human time by expertGroup1, i.e. a total duration increased 2.5 times as the additional tickets contains longer tickets that could potentially correspond to anomalies of important calculated lost time, and (ii) implement precedence sequential constraints for detecting additional tickets where actors of domain B were involved in such incorrect activation (i.e. 49 tickets representing 34% of tickets validating such case).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Maturity and available resources</head><p>ADT is accessible in our organism for installation. The front-end is implemented with the Angular and Electron frameworks. The back-end is implemented in python. The implemented solution was validated in <ref type="bibr" target="#b0">[1]</ref> using a public dataset of emails Enron where the performances are reported, and the obtained activities were publicly provided (i.e. see this link). We also conducted tests on other datasets like the incident ticket comments and the emails of the employees of our organism. We validated the results with business experts that reported that the major advantage of the tool is its ability to generate first results in unsupervised way (which means without human intervention) able to be interpretable and adapted to enrich other event logs. We provide the following elements:</p><p>• A documentation explaining how the tool is installed and run within our organism. However, we were not able to provide the access for external collaborators. • A public video illustrating how our tool serves the described scenario example (Section 4).</p><p>• A guide to access the implemented Celonis interfaces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and future work</head><p>In this paper, we presented ADT wich analyses the free communication textual records of business actors to enrich event logs for process mining. We intend to leverage the studied use cases to communicate across all directions within our organism and visually demonstrate potential gains to enhance collective efficiency in other use cases. We aim to assess the extent to which these efforts are replicable to other processes and customize the developed tool to make it increasingly versatile whenever needed. In future works, we aim to cover various communication records types. Additionally, we aim to investigate the following points:</p><p>• Improve the anonymization functionality to consider the privacy risk of sharing sequence of events rather than only individual events <ref type="bibr" target="#b5">[6]</ref>. This is by allowing users to check the sequence of the occurred activities to edit confidential sub-sequences that does not seem sensitive when looking at individual activities. • Extend the format of the generated event log to support recent format, mainly the Object-Centric Event Log (OCEL) <ref type="bibr" target="#b6">[7]</ref>. • Study how the generative AI could enhance ADT performances. In fact, with the actual publicly available models, a large resources in terms of RAM and GPU (e.g. 80 GB for MOSAIC ML MPT30 B and at least 30 GB for LLAMA2 70B after quantization) is needed. This makes their integration in our tool as office application not feasible. Getting confidential data out of the user's machine to be processed in an external data center (as the case of chatGPT) is also unfeasible, as explained before.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Pipeline for event log mining using ADT</figDesc><graphic coords="2,89.29,430.17,416.70,128.14" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We would like to thank Alain Bouchard, David Menchi, Frédéric Bastard and Marjorie Deshayes: the experts in the studied process, for their invaluable assistance, the time they devoted to addressing our inquiries, for testing the tool we made available to them, and for placing their trust in us. This collaborative effort was instrumental in achieving promising results.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Process fragments discovery from emails: Functional, data and behavioral perspectives discovery</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elleuch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Ismaili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Laga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Gaaloul</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Systems</title>
		<imprint>
			<biblScope unit="page">102229</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Discovering activities from emails based on pattern discovery approach</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elleuch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Ismaili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Laga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Gaaloul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Benatallah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Business Process Management Forum: BPM Forum 2020</title>
				<meeting><address><addrLine>Seville, Spain</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">September 13-18, 2020. 2020</date>
			<biblScope unit="page" from="88" to="104" />
		</imprint>
	</monogr>
	<note>Proceedings 18</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Event log construction from customer service conversations using natural language inference</title>
		<author>
			<persName><forename type="first">C</forename><surname>Kecht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Egger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kratsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Röglinger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 3rd International Conference on Process Mining (ICPM), IEEE</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="144" to="151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Task-focused summarization of email</title>
		<author>
			<persName><forename type="first">S</forename><surname>Corston-Oliver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ringger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Campbell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out</title>
				<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="43" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatically classifying emails into activities</title>
		<author>
			<persName><forename type="first">M</forename><surname>Dredze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kushmerick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 11th international conference on Intelligent user interfaces</title>
				<meeting>the 11th international conference on Intelligent user interfaces</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="70" to="77" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Libra: High-utility anonymization of event logs for process mining via subsampling</title>
		<author>
			<persName><forename type="first">G</forename><surname>Elkoumy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dumas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2022 4th International Conference on Process Mining (ICPM), IEEE</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="144" to="151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Ocel: A standard for object-centric event logs</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Ghahfarokhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Berti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">M</forename><surname>Van Der Aalst</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Advances in Databases and Information Systems</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="169" to="175" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
