Business Process Representation Learning Peter Pfeiffer1,2 1 German Research Center for Artificial Intelligence (DFKI), Campus D3_2, 66123 Saarbrücken, Germany 2 Saarland Informatics Campus, 66123 Saarbrücken, Germany Abstract Data stored in information systems gathered from the execution of business processes is a rich source of information. Process mining aims to extract and gain knowledge from such data, usually captured in event logs, in order to understand and improve the business processes. From a data-science perspective, event log data is a very interesting yet complex data modality. It does not only describe the process from the control-flow perspective, but also contains additional information like entities and organizations involved, temporal aspects and much more. While there is a lot of work on applying existing machine learning techniques on event log data for solving a specific problem, little work has focused on how to learn from such data effectively. This work presents the idea of developing representation learning models for event logs, i.e. neural-network-based methods specifically designed for this data modality, which learn generic and rich representations of events and cases. The representations are expected to be used for solving different business problems such as process prediction, anomaly detection or other process mining tasks more efficiently and effectively. Keywords Business Process Data, Representation Learning, Process Analytics 1. Motivation and Problem Description Extracting and gaining knowledge from data of business process executions, e.g., provided by information systems has gained a lot of attention over the past years. In the research field of process mining, a large variety of methods to analyse such data have been developed where the standard data format to store such information and make analysis on are event logs [1]. While the majority of process mining methods extract hand-crafted features from the information in the event log, like footprint matrices to describe how activities are in relation to each other, some use machine learning (ML) methods based on neural networks to make predictions about the future state of running process instances [2]. On other data modalities like images or text, it has been shown that neural-network-based methods perform equally well as, or outperform, traditional feature-based methods in image [3] or language understanding tasks [4]. For event log data, this has mainly been shown for process prediction [2, 5, 6]. Recently, a study showed that process discovery can be solved with graph neural networks [7] reaching comparable performance to feature-based methods. However, many problems in process mining still rely on hand-crafted instead of learned features. BPM 2022 Best Dissertation Award, Doctoral Consortium, and Demonstration & Resources Track Envelope-Open peter.pfeiffer@dfki.de (P. Pfeiffer) Orcid 0000-0002-0224-4450 (P. Pfeiffer) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 34 Part of the success of neural-network-based methods on other data modalities like images and texts is due to their ability to learn and generate a rich representation of the concepts in the data in form of a feature vector which allows to be used for a variety of tasks. For instance, neural- network-based language models are trained to fill gaps in sentences and to predict whether two sentences follow each other which teaches them in learning effective representation of words and sentences [4] without relying on labels created by domain experts. Afterwards, they can be fine-tuned on a variety of specific tasks using labeled data. Such approaches belong to representation learning, a field of machine learning that deals with learning a representation of data that ”makes it easier to extract useful information when building classifiers or other predictors” [8]. After pre-training using such a self-supervised training method, the same model can be used to solve various downstream tasks utilizing the learned representation. If the pre-trained model produces ”good” representations, it is sufficient to add task-specific, simple neural networks to solve certain problems. The idea behind such pre-training is that the features learned thereby are useful for the downstream tasks. Thus, representation learning aims at incorporating features in the representation which are useful for solving different settings and tasks, enabling transfer learning, domain adaption and multitask learning [9]. This two-stage strategy, which has to lead to great success in machine learning, makes solving many tasks more effective which is indicated by a high accuracy on various downstream tasks. One example is BERT [4], a transformer-based model that was pre-trained on a very large corpus of textual data in a self-supervised fashion to generate representations of words and sentences. The learned representations have been used for various different downstream tasks and set new state-of-the-art results in question answering, sentence classification and other tasks that require reasoning over words and sentences. For other data modalities, approaches that adapt these ideas [10, 11] perform equally well or better than existing methods. For data describing business processes, there is no comparable approach that can produce representations effective for solving different tasks. As the modality of event data is different to text or images, pre-trained models like BERT cannot directly be applied. While textual data is often described as a single sequence of words, a case consists of events with attributes that describe what action has been performed at what time (which is not necessarily a sequence). Additional attributes can be added, e.g., who performed the activity or what objects have been involved. Events can have different numbers of attributes which can be of different types, e.g., categorical, numerical and temporal and have different scales. Furthermore, some attributes change within each event while others are fixed for the whole case. These characteristics make it challenging to learn representations from event log data with neural networks. Furthermore, we argue that it is difficult to process them with existing network architectures like LSTMs or BERT as they do not fully adapt to and account for the characteristics of that data. In order to enable ML-based analysis on event log data on a large scale, more research is required how to learn rich representations of concepts describing business processes. For instance, to make such networks aware of what event logs consist of by learning the concepts of events, cases and processes. Existing neural-network-based approaches for event logs either focus on solving one task, like next step prediction [2] or anomaly detection [12], or learn representations using a subset of the information available in the event log. For instance, representations learning approaches of cases [13, 14] or events [15] include categorical attributes, but no numerical or temporal information. This makes such approaches less generic, as they lack 35 Figure 1: Overview of the intended solution. Left: Pre-training on large sets of business processes and event logs. Right: Application of the pre-trained model on one specific event log solving different tasks. to include all relevant information found in the event log and only work for certain tasks. Thus, the learned representation of such approaches are task-specific. Findings ways to learn a generic representation is one of the main objectives of representation learning which could, applied to event logs, allow to use such methods for more tasks. Some work [14, 16, 17] indicates that approaches being able to learn rich representations of cases or events containing and combining information from multiple event or case perspectives could be helpful to solve certain tasks on such data. However, this is hard to archive using hand-crafted features. This PhD project is about developing methods that can learn rich representations of different concepts in data gathered from the execution of business processes to be used for solving analytical tasks more effectively. Thus, the research problem is to develop specialized neural network architecture and training objective which can be used for learning from such data as well as applications and assessments that demonstrate and measure the effectiveness of the representations. For instance, how to adapt and adjust successful training procedures from other modalities to work on event log data as well as encoding approaches which preserve the structure and semantic of the data. Furthermore, we want to investigate what characteristics the representations should have to be effective. In general, a good representation is one that makes solving subsequent tasks easier [9]. Some general-purpose priors exists which are not task specific and can be followed when developing representation learning methods [8]. However, as the representations in this work are expected to be effective for solving process mining tasks, they should contain characteristics that support problem solving on event data. The following characteristics reflect desirable characteristics from our current point of view but a more systematic approach of collecting and aligning them 36 with process mining tasks is planned as part of the project. First, representations should be effective for solving different tasks and trainable with small quantities of manually labeled data. Furthermore, the approach should be able to learn representations for events as well as for cases which requires a hierarchical perception of the data. Event representations are expected to contain feature describing the semantic of the activity and all its attributes, its position in the case as well as its context, i.e., the events nearby. Similarly, representations of cases should aggregate the semantics of all event as well as behavioural characteristics of the case itself. Another desired property is that it allows domain adaption, i.e., that representations can also be created for event logs where the model was not trained on. However, while textual data and the features describing it do not change too much when considering one language, different processes might have very different behaviour that makes it challenging to transfer certain features learned from one set of event logs to another. Thus, we want to investigate which features are transferable or how to transfer them with little effort. Certain process mining task to solve ultimately might require an additional training phase as the generic characteristics learned are not sufficient or the features required are specific to the event log. Thus, domain knowledge (labeled data) or a different training objective is needed. For instance, it could be helpful for some tasks to have the case representation contain information which cases fit the underlying process and which not, i.e., for case classification tasks. Such features are difficult to learn generically and might require an additional training phase or labeled data on the event log of interest. What makes representation learning different to other machine learning tasks is that no objective or target function exists which can be directly optimized during training to obtain the desired characteristics. Rather, training methods have to be developed that ”shape” the representations in the desired form. Nevertheless, pre-training to learn certain generic charac- teristics has shown to be beneficial on other modalities [9] which makes investigating this idea for event data interesting. 2. Methodology and Techniques The research project follows the design science methodology [18] including exploratory phases within its cycles that steer the development of a representation learning approach for business process data. The problem and opportunity identification, i.e., to design a representation learning approach for event log data that makes event log analysis more efficient, is part of the relevance cycle. Furthermore, acceptance criteria must be defined. For investigating which characteristics the representations should have to be effective, we follow the general priors for good representations [8] and systematically collect tasks and problems the approach will be applied on. From these observations we will to derive features that are necessary to be contained in the representation for process mining tasks. As discussed earlier, it is very hard to directly measure whether the representations contain the desired characteristics. Instead, representation learning approaches are usually evaluated on a variety of different tasks. If they are able to solve the tasks accurately, the representations are supposed to be effective. We follow the same evaluation procedure by assessing whether the learned event and case representations can be used to solve different process mining tasks, like process prediction or 37 anomaly detection with similar or better performance to existing algorithms measured in terms of established metrics like precision, recall and accuracy. Furthermore, we try to assess how the learned representations can be utilized for tasks not considered so far. State-of-the-art machine learning techniques will be used, i.e., neural network architectures and training methods for learning representations as well as knowledge from the process mining field. Experience from both domains is combined to develop new methods to process and learn from event log data. Thereby, additions to the knowledge base in both domains are made, i.e. in learning (hierarchical) representations of complex data modalities as well as how to learn and use such representations to solve process mining tasks. In the design cycle, new representation learning approaches are being developed in an iterative and exploratory fashion. Knowledge from the corresponding domains is used to enhance the approach and to test its effectiveness on various tasks. Each representation learning method consists of a neural network architecture and a self-supervised pre-training phase. Both need to be combined appropriately to create valuable feature vectors. Techniques applied throughout the design are for instance network architectures like transformer models [19], specifically, customized transformer architectures for modalities like time series [10] which are further extended to appropriately capture the concepts in event logs. For training, self-supervised learning techniques similar to BERT are adopted, e.g., reconstructing missing parts in the input or predicting characteristics of higher level concepts in the data. They will be combined with process analytic knowledge to find effective learning objectives that fit business process data. 3. Solutions and Results Results archived so far in this PhD project include the Multi-Perspective Process Network (MPPN) [20] which learns representations of cases. With respect to the network architecture and training method, an image encoding approach for time-series was applied that allows to process categorical, numerical and temporal information in the event log in the same way using a self-supervised pre-training phase and an architecture based on convolutional neural networks. Instead of training embeddings for different attributes in the event log each time, we transform all perspectives into distinct image representations and use pre-trained convolutional neural networks to extract features describing the perspective. We first pre-train MPPN on the next event prediction task by predicting several attributes values of the next event at once. Thereby, it learns the general characteristics of the event log, i.e., how activities and attributes influence each other, before being applied on a certain task. It has been demonstrated [20] that the learned representations are effective for solving different process prediction tasks and, without additional training, suitable for case retrieval, i.e., retrieving contextual similar cases to one case of interest. Nevertheless, MPPN has some shortcomings. The main disadvantages are that a lot of semantic information is lost when transforming perspectives to 2D images and that it does not learn representations for events. Loosing the semantic information, e.g., activity and attribute names and values makes the approach less generic as it requires to train one MPPN per event log which prevents domain adaption. In the next iteration, a new approach following the two-stage scheme illustrated in Figure 1 38 is being developed that overcomes the issues of MPPN. Again the model will be pre-trained on a self-supervised task to generate representations which can afterwards be utilized to solve different problems. The new approach learns representations of events and cases simultaneously, compromising as much semantic information found in the event log as possible using a new architecture that is more efficient and flexible. In the pre-training phase, the objective is to train the model N to recognize and interpret the different concepts in the data and how they relate to each other by predicting different characteristics of business processes on event- and case-level to learn important features. This involves, e.g., to reconstruct missing event values in order to learn features on event-level and predicting case or process characteristics to learn higher-level features. By including and encoding as much semantic information from the event log as possible, such as attribute and activity names, the representations should be rich in information and the approach generic. This is achieved by splitting each event into distinct tokens where each token carries a certain part of the information [19, 10] which enables a more flexible and generic way of processing and learning from event data. For instance, each activity in each event is encoded as a token by contextualising the activities name and combining that with the contextualized attribute name. The same is done for other for attributes that contain semantic information. Depending on the attribute type, appropriate encoding techniques are used to transform the information into tokens which serve as input to the neural network. Similar to the positional encoding used in BERT [19], a ”event encoding” it added that indicates which tokens belong to the same event. Thus, each token contains information what attribute is represents, which value/content the attribute has as well as to which event it belongs. Using this token-based encoding approach, which embeds information how event log data is structured into the input given to 𝑁 allows that data from the event log can be processed generically and that 𝑁 can interpret the information as intended by the data structure. Additionally, special tokens like the 𝐶𝐿𝑆-Token in BERT are added which can be used to solve classification tasks on cases or events. Splitting and encoding the data into tokens allows to process them with transformer-based architectures [19, 10]. After pre-training 𝑁 on a large set of synthetic and real-world business processes and event log it will be fine-tuned or directly applied on different tasks to measure its effectiveness. For some tasks, no fine-tuning using labeled data for the target task may be required. For instance unsupervised ones such as anomaly detection [12] or clustering- based tasks like behaviour mining [17] or event abstraction [21, 16]. For other tasks, the representations need to be fine-tuned, e.g., for process prediction as demonstrated in previous work [20]. For other tasks, additional labeled data might be required indicated by the dotted arrow in Figure 1. In order to demonstrate that the representations are helpful, the performance will be compared to existing techniques using common datasets and standardised evaluation procedures. For example next-step and outcome prediction tasks on the BPIC event logs or classifying traces into fitting and unfitting ones using the data provided by the process discovery challenge (PDC)1 . A dedicated representation learning benchmark is also planned which combines different process mining tasks and datasets to test and compare approaches on. 1 https://www.tf-pm.org/competitions-awards/discovery-contest 39 4. Conclusion Designing customized neural-network-based architectures and training methods for event logs that are generic and learn to interpret the modality, i.e., learn representations of events and cases including their semantics is novel. The idea is different from existing approaches that apply network architectures developed for other data modalities like LSTM and training strategies like next step prediction on event log data in the sense that we aim for approaches that learn the concepts in the data and can be used for various tasks. By separating the training from the application phase, representations can be learned and applied for solving different tasks. This is supposed to be more effective as designing different approaches for different problems. Furthermore, having a rich representation of events and cases makes problem solving easier as a simple classifier with a few samples can be sufficient for solving a specific task. Using a feature vector representation also brings some limitations as the feature vectors produced by N are not as explainable and understandable by humans as hand-crafted features. Furthermore, pre-training a representation learning model requires a lot of data and careful parameter optimization. Not all tasks in all domains benefit from using one representation which might also apply to this project. However, once the model is pre-trained, it can be used on different datasets and easily fine-tuned to various tasks. The results archived so far indicate that representations can work for different predictive tasks as well as for retrieval. In the future, expect to get insights how the architecture of the neural network, encoding methods and training objective have to be designed for learning effective representations of event logs and which features are useful for solving process mining tasks. As the data modality of event logs is challenging, learning representations makes it a interesting problem not only from a process mining but also from the machine learning perspective. This could enable to solve other process mining tasks, that yet rely on hand-crafted feature more effectively, enabling ML-based analysis in process mining on a larger scale. References [1] W. M. P. van der Aalst, Process Mining: A 360 Degree Overview, Springer International Publishing, Cham, 2022, pp. 3–34. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 1 - 0 8 8 4 8 - 3 _ 1 . [2] J. Evermann, J.-R. Rehse, P. Fettke, Predicting process behaviour using deep learning, Decision Support Systems 100 (2017) 129–140. [3] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in neural information processing systems 25 (2012). [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: NAACL-HLT, 2019. [5] N. Tax, I. Teinemaa, S. J. van Zelst, An interdisciplinary comparison of sequence modeling methods for next-element prediction, Software and Systems Modeling 19 (2020) 1345–1365. [6] W. Kratsch, J. Manderscheid, M. Röglinger, J. Seyfried, Machine learning in business process monitoring: A comparison of deep learning and classical approaches used for outcome prediction, Business & Information Systems Engineering 63 (2021) 261–276. doi:1 0 . 1 0 0 7 / s 1 2 5 9 9 - 0 2 0 - 0 0 6 4 5 - 0 . 40 [7] D. Sommers, V. Menkovski, D. Fahland, Process discovery using graph neural networks, in: 3rd International Conference on Process Mining (ICPM), IEEE, 2021, pp. 40–47. [8] Y. Bengio, A. Courville, P. Vincent, Representation learning: A review and new perspec- tives, IEEE transactions on pattern analysis and machine intelligence (2013) 1798–1828. [9] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. URL: http://www. deeplearningbook.org. [10] G. Zerveas, S. Jayaraman, D. Patel, A. Bhamidipaty, C. Eickhoff, A transformer-based framework for multivariate time series representation learning, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, 2021, p. 2114–2124. doi:1 0 . 1 1 4 5 / 3 4 4 7 5 4 8 . 3 4 6 7 4 0 1 . [11] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira, Perceiver: General perception with iterative attention, in: International Conference on Machine Learning, PMLR, 2021, pp. 4651–4664. [12] T. Nolle, A. Seeliger, M. Mühlhäuser, Binet: Multivariate business process anomaly detection using deep learning, in: Business Process Management, Springer International Publishing, 2018, pp. 271–287. [13] S. Luettgen, A. Seeliger, T. Nolle, M. Mühlhäuser, Case2vec: Advances in representa- tion learning for business processes, Process Mining Workshops, ICPM 2020, Springer International Publishing, 2020, pp. 162–174. [14] A. Seeliger, S. Luettgen, T. Nolle, M. Mühlhäuser, Learning of process representations using recurrent neural networks, in: International Conference on Advanced Information Systems Engineering, Springer International Publishing, 2021, pp. 109–124. [15] P. De Koninck, S. vanden Broucke, J. De Weerdt, act2vec, trace2vec, log2vec, and model2vec: Representation learning for business processes, in: Business Process Management, Springer International Publishing, 2018, pp. 305–321. [16] A. Rebmann, Abstracting low-level event data for meaningful process analysis, in: Proceedings of the Demonstration & Resources Track, Best BPM Dissertation Award, and Doctoral Consortium at BPM 2021 co-located with the 19th International Conference on Business Process Management, 2021. [17] L. Abb, C. Bormann, H. van der Aa, J. R. Rehse, Trace clustering for user behavior mining, in: 30th European Conference on Information Systems (ECIS 2022), 2022. [18] A. R. Hevner, A three cycle view of design science research, Scandinavian journal of information systems 19 (2007) 4. [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008. [20] P. Pfeiffer, J. Lahann, P. Fettke, Multivariate business process representation learn- ing utilizing gramian angular fields and convolutional neural networks, Business Pro- cess Management, Springer International Publishing, 2021, pp. 327–344. doi:1 0 . 1 0 0 7 / 978- 3- 030- 85469- 0\_21. [21] S. J. van Zelst, F. Mannhardt, M. de Leoni, A. Koschmider, Event abstraction in process mining: literature review and taxonomy, Granular Computing 6 (2021) 719–736. doi:1 0 . 1007/s41066- 020- 00226- 2. 41