Activity Discovery Tool From Unstructured Data To Enhance Process Mining (Extended Abstract) Marwa Elleuch1,*,† , Christophe Maillard1,† , Olivier Graille1,† , Sonia Laurent1,† , Oumaima Alaoui Ismaili1,† and Philippe Legay1,† 1 Orange Labs, France Abstract The free and unstructured textual records of process actors communications are nowadays not considered by the process mining tools. The confidentiality constraints of these records makes them difficult to be processed and integrated in process mining studies conducted on real data. This paper introduces the activity discovery tool which locally analysis, in unsupervised way, the communication records of a process actor (or a restricted set of process actors) to convert them into a structured log. This log could be shared to complete the partial view of process executions obtained from structured traces. We show, through a scenario example, how the results generated by this tool could enhance process mining. Keywords Activity discovery, Unstructured textual records, Communication logs, Process mining 1. Introduction Nowadays, process mining tools could be applied only on event logs having structured format. The free textual records that capture process actors interactions and communications were generally ignored if they are not converted into a structured format. However, such records are of big importance to enrich existing process knowledge or to discover new process fragments[1]. One of the main constraints for handling these unstructured records (e.g. emails, comments of incident tickets) is their confidentiality aspect. Taking the example of emails, process actors rarely agree to share the textual content of their emails to centralize their analysis. For some other types of free textual records, such as comments of incident tickets, the right of access and handling the records is generally restricted to a set of process actors. In fact, they are considered as sensitive data that could disclose the strategic aspect of an organism if they are largely shared. Therefore, it is not possible to process them outside the organism (as the case of the incident tickets) or the process actor machine (as the case of emails). To handle these confidentiality restrictions (at individual or group of actors level), we propose in this paper ADT (the Activity Discovery Tool) that locally analysis the free textual commu- ICPM Doctoral Consortium and Demo Track 2023, ICPM 2023, Rome, Italy, September 23 - September 27, 2023 * Corresponding author. † These authors contributed equally. $ marwa1.elleuch@orange.com (M. Elleuch); christophe.maillard@orange.com (C. Maillard); olivier.graille@orange.com (O. Graille); sonia.laurent@orange.com (S. Laurent); oumaima.alaouiismaili@orange.com (O. A. Ismaili); philippe.legay@orange.com (P. Legay)  0000-0002-0877-7063 (M. Elleuch); 0009-0004-3028-3758 (O. A. Ismaili) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings nication records of a process actor or a restricted group of actors. The tool implements and extends a recent work[2, 1]. It aims to reduce the textual records, in unsupervised way, into a structured event log reporting the relevant performed activities. These events are generated in the way that they could be shared for completing other traces of the same process (obtained from other information systems or other process actors) but without disclosing the confidential textual contents of the handled records. In what follows, we give an overview on the related work, describe the main functionalities of the tool, provide an example scenario related to the incident management process, discuss the maturity of the tool and conclude with future works. 2. Related Work Some related works were mainly based on supervised approaches (e.g. [3]), which limits their potential to be applied in various scenarios. Tools that were designed in the same context allow employees at the most managing ongoing activities, e.g., by summarizing activities included in received emails [4] or displaying activity realization status [5]. A recent study [1] shows that the implemented solution in our tool, answers simultaneously several challenges comparing to existing works allowing richer event log that captures in addition to activity names; their speech acts, (ii) business data and (iii) several activities per textual segment. 3. Main Functionalities ADT is an office application that allows process actors to analyse their communication records in order to reduce them into an event log. It ensures three main functionalities that we resume in Figure 1 (the functionalities colored in gray are to be ensured outside the tool): Figure 1: Pipeline for event log mining using ADT A- Unsupervised discovery of recurrent activities: It first discovers their recurrent activities by implementing and extending the approach proposed in [2]. Basically, it first discovers recurrent expressions that potentially reflects how business actors express their recurrent activities in their communication records. These expressions are then grouped into activities while considering: (i) rephrasing relations, and (ii) synonymy constraints to differentiate between those that are different and which could refer to contradictory actions. To facilitate their exploration, the activities are then grouped into coarser topics and action types. Activities of the same action type (e.g. ’replace card’, ’change fiber’ ) share terms referring to the same coarser action (e.g. ’change’, ’replace’). Activities grouped into the same topic (e.g. ’replace card’, ’delete card’) share terms referring the the same manipulated artifact (e.g. ’card’). B- Validation phase: It allows process actors to intervene after discovering activities to: (i) discard those that are judged confidential, and (ii) validate sharing the others. C- Generate anonymous event log: The tool generates an event log to be shared according to the proposed structure in [1]. Each textual record is first reduced into the set of activity occurrences whose labels were validated (in terms of sharing). Each activity occurrence (i.e. event) is characterized by these attributes: activity name, activity speech act, business data, communication record attributes (i.e. ID, timestamp, sender, receivers and conversation ID), an action type and a topic. The tool offers the possibility to either access to such event log for further adaptation and customization (e.g. by business experts) or to its anonymous version. To obtain such anonymous version, sensitive data in each event (i.e. business data values, sender and receivers) are hashed (to guarantee that similar values could be mapped) and salted to complicate its cracking process. In this way, the textual content of the communication records are not shared. Only the relevant information w.r.t business processes are shared giving the possibility of being centrally analyzed and merged with (i) other event logs generated from the communication records of other process actors, or (ii) the structured part of the same records (e.g. IDs of incident tickets replacing the process instance information). 4. Scenario example The scenario example is related to the incident managing process. We dispose the log capturing the comments exchanged, inside the incident tickets, between a restricted set of actor groups. This log is of two parts: (i) a structured part recording: the actor names sending comments, timestamps, human duration, and their ticket IDs and (ii) a non-structured part revealing the free textual content of the sent comments. Using our tool, the comments of the concerned set of actors are analysed and reduced into the events recording the occurred activities. This log was then shared to be analysed by business experts in order to enrich the structured event log part and to inject it to Celonis as a process mining tool. We show in the demo how the additional attributes extracted from the unstructured log part enabled us to implement additional interfaces within Celonis for enriching: (i) the process actor perspective, and (ii) the filtering criteria to detect tickets containing incorrect activations of one actor. 1) Enrich the process actor perspective: At each actor activation, it becomes feasible to observe the detailed activities and generate a synthesis of the scope of the mentioned ones. This allows for the identification of instances where an action (i) was documented by an actor other than the one who performed it, or (ii) manipulating a material of a specific technical scope. 2) Enrich filtering criteria: Giving a process actor expertGroup1 and other actors of different technical domains (i.e. A, B, C and D), the goal is to identify: (i) the tickets of incorrect activation of expertGroup1 of domain A, C and D because they were resolved by actors of domain B, and (ii) the actors involved in such incorrect activation. Based only on the structured part of the tickets, we could select those where expertGroup1 was activated and domain B is the last activated compared to expertGroup1 of other domains. However, the main constraints with such method, is that sometimes, actors of domain B are not explicitly activated in the tickets; they don’t send comments, so they could not be detected from the set of senders. They are only reported in comments sent by other actors referring to their interventions with a corrective action. The demo shows how with the detected activities by ADT, it was possible to identify additional 50 tickets containing incorrect activations of expertGroup1 (representing arround 23% of the total tickets) . This helps us to: (i) identify more precisely the lost human time by expertGroup1, i.e. a total duration increased 2.5 times as the additional tickets contains longer tickets that could potentially correspond to anomalies of important calculated lost time, and (ii) implement precedence sequential constraints for detecting additional tickets where actors of domain B were involved in such incorrect activation (i.e. 49 tickets representing 34% of tickets validating such case). 5. Maturity and available resources ADT is accessible in our organism for installation. The front-end is implemented with the Angular and Electron frameworks. The back-end is implemented in python. The implemented solution was validated in [1] using a public dataset of emails Enron where the performances are reported, and the obtained activities were publicly provided (i.e. see this link). We also conducted tests on other datasets like the incident ticket comments and the emails of the employees of our organism. We validated the results with business experts that reported that the major advantage of the tool is its ability to generate first results in unsupervised way (which means without human intervention) able to be interpretable and adapted to enrich other event logs. We provide the following elements: • A documentation explaining how the tool is installed and run within our organism. However, we were not able to provide the access for external collaborators. • A public video illustrating how our tool serves the described scenario example (Section 4). • A guide to access the implemented Celonis interfaces. 6. Conclusion and future work In this paper, we presented ADT wich analyses the free communication textual records of business actors to enrich event logs for process mining. We intend to leverage the studied use cases to communicate across all directions within our organism and visually demonstrate potential gains to enhance collective efficiency in other use cases. We aim to assess the extent to which these efforts are replicable to other processes and customize the developed tool to make it increasingly versatile whenever needed. In future works, we aim to cover various communication records types. Additionally, we aim to investigate the following points: • Improve the anonymization functionality to consider the privacy risk of sharing sequence of events rather than only individual events [6]. This is by allowing users to check the sequence of the occurred activities to edit confidential sub-sequences that does not seem sensitive when looking at individual activities. • Extend the format of the generated event log to support recent format, mainly the Object- Centric Event Log (OCEL) [7]. • Study how the generative AI could enhance ADT performances. In fact, with the actual publicly available models, a large resources in terms of RAM and GPU (e.g. 80 GB for MOSAIC ML MPT30 B and at least 30 GB for LLAMA2 70B after quantization) is needed. This makes their integration in our tool as office application not feasible. Getting confidential data out of the user’s machine to be processed in an external data center (as the case of chatGPT) is also unfeasible, as explained before. Acknowledgments We would like to thank Alain Bouchard, David Menchi, Frédéric Bastard and Marjorie Deshayes: the experts in the studied process, for their invaluable assistance, the time they devoted to addressing our inquiries, for testing the tool we made available to them, and for placing their trust in us. This collaborative effort was instrumental in achieving promising results. References [1] M. Elleuch, O. Alaoui Ismaili, N. Laga, W. Gaaloul, Process fragments discovery from emails: Functional, data and behavioral perspectives discovery, Information Systems (2023) 102229. [2] M. Elleuch, O. Alaoui Ismaili, N. Laga, W. Gaaloul, B. Benatallah, Discovering activities from emails based on pattern discovery approach, in: Business Process Management Forum: BPM Forum 2020, Seville, Spain, September 13–18, 2020, Proceedings 18, Springer, 2020, pp. 88–104. [3] C. Kecht, A. Egger, W. Kratsch, M. Röglinger, Event log construction from customer service conversations using natural language inference, in: 2021 3rd International Conference on Process Mining (ICPM), IEEE, 2021, pp. 144–151. [4] S. Corston-Oliver, E. Ringger, M. Gamon, R. Campbell, Task-focused summarization of email, in: Text Summarization Branches Out, 2004, pp. 43–50. [5] M. Dredze, T. Lau, N. Kushmerick, Automatically classifying emails into activities, in: Proceedings of the 11th international conference on Intelligent user interfaces, 2006, pp. 70–77. [6] G. Elkoumy, M. Dumas, Libra: High-utility anonymization of event logs for process mining via subsampling, in: 2022 4th International Conference on Process Mining (ICPM), IEEE, 2022, pp. 144–151. [7] A. F. Ghahfarokhi, G. Park, A. Berti, W. M. van der Aalst, Ocel: A standard for object-centric event logs, in: European Conference on Advances in Databases and Information Systems, Springer, 2021, pp. 169–175.