Towards Retrograde Process Analysis in Running Legacy Applications Marius Breitmayer, Lisa Arnold and Manfred Reichert Institute of Databases and Information Systems, Ulm University, Germany {marius.breitmayer,lisa.arnold,manfred.reichert} Abstract. Process mining algorithms are highly dependent on the exis- tence and quality of event logs. In many cases, however, software systems (e.g., legacy systems) do not leverage workflow engines capable of pro- ducing high-quality event logs for process mining algorithms. As a result, the application of process mining algorithms is drastically hampered for such legacy systems. The generation of suitable event data from run- ning legacy software systems, therefore, would foster approaches such as process mining, data-based process documentation, and process-oriented software migration of legacy systems. This paper discusses the need for dedicated event log generation approaches in this context. Keywords: legacy systems, process mining, code analysis, event log 1 Introduction Software applications are implemented to address the needs of users, use cases, and business processes. However, the majority of common software systems (e.g., legacy systems or individual software solutions) have not been designed with the goal to provide high-quality process-related event logs that allow for compre- hensive process analyses and visualizations with modern process mining tools. Relevant questions emerging in legacy software modernization projects include, for example, how the process implemented by the legacy software system is structured (Process Discovery) or to what extent its execution deviates from a predefined to-be process (Conformance Checking). Currently, there exist three basic approaches to obtain process models: 1. Log analysis uses existing logs (e.g., event logs) to reconstruct the imple- mented process based on audit or workflow data. Consequently, the quality of the resulting process model is directly correlated with both the existence and quality of corresponding event logs [2,3]. However, a vast majority of individual applications and legacy systems are often unable to provide ap- propriate event logs. Moreover, even database-centric applications typically do not provide transaction-level audit data. Consequently, there has been no effective entry point for process mining yet. 2. Interviews may be conducted to discover the desired process model as perceived by key users and process owners [9]. Additionally, data models may be parsed to identify effects of processes on corresponding data. Ana- lyzing such data models enables assumptions on the underlying processes. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 12 Marius Breitmayer et al. This approach, however, is very time consuming and paved with both mis- understandings and misconceptions. In addition, interviews do not ensure completeness of the relevant processes and their various aspects, as they often neglect exceptions or specific process perspectives (e.g., data, time). 3. Pattern recognition attempts to identify typical process patterns in var- ious data pools using algorithms from the field of artificial intelligence [1]. The algorithms require a deep analysis and learning phase prior to their application to the raw data. This is a time-consuming, cost-intensive, and fuzzy approach, which is therefore hardly pursued. In the context of legacy systems, however, none of the presented approaches is easily applicable. All three approaches have in common that the business pro- cesses (and event logs), implemented by the legacy software systems, need to be represented accurately. Since most individual software solutions do not neces- sarily use process engines capable of delivering suitable process data, alternative approaches are required. One approach to tackle this challenge is, to observe process participants during process execution and to record their interactions with the software system resulting in a fine-grained documentation. Section 2 describes the proposed solution approach. Section 3 discusses re- lated work. Finally, Section 4 provides a summary and outlook. 2 Solution Approach A human-centered business process can be defined as a sequence of user interac- tions with a software application, where each interaction is subject-bound (i.e., part of the same transaction). In legacy systems, such processes can be initiated and terminated by suitable actions (e.g., pre-defined key combinations or menu items). Adding such actions to an event stream with the associated application object (e.g., an order identified by its unique order number), subsequently, pro- cess mining tools will have process related event logs as input. The collected event data may then constitute the basis for a plethora of use cases, such as process documentation, process mining, and process-oriented cost estimations for modernizing legacy software systems (i.e., software migration). We aim to create different logging variants for existing legacy production systems: 1. Dedicated recording documents existing processes by assigning related program components. Users may determine the start and end of the recording using predefined key combinations, thus precisely delimiting all activities that constitute the recorded process (or the considered process part). 2. Silent recording tracks the entire usage of the application from the first login until closing the application. A decision can be made as to whether this should be done for all sessions or only for selected user sessions (e.g., only sessions of users from a certain department). Furthermore, it may be configured, which information should be stored (e.g., to ensure compliance with data protection requirements). Towards Retrograde Process Analysis in Running Legacy Applications 13 To minimize the performance effects of these recording on running applications, we rely on existing logging mechanisms of the application infrastructure. For Oracle applications using a WebLogic Server, for example, Oracle Diag- nostic Logging (ODL) offers extensive possibilities to manage application infor- mation via the administration console. Among others, oracle logger classes (e.g., Application Development Framework ) may use this information through ODL handlers [15]. In Single Page Applications (e.g., the Oracle JavaScript Extension Toolkit JET), the primary object is known, however, the context between mul- tiple process steps may get lost due to the loose coupling of user sessions and services. Even applications based on Oracles Forms allow adding appropriate message calls for each PL/SQL unit. Using existing system logging functionality, the recording quality is signifi- cantly increased compared to purely mining the data model, as user interactions can be unambiguously linked to the process, program code, and associated data. Fig. 1 depicts the approach. In a first step we identify relevant objects using information from the database and the source code of the application. However, especially in databases of legacy systems, assumptions such as good normaliza- tion or even the existence of foreign key constraints are often not applicable. The reason for this is that in many cases the logic is represented in the source code of applications rather than the database. By combining knowledge from the database (e.g., create, read, update, and delete -operations) and correspond- ing source code (e.g., code fragments corresponding to such operations), we are able to tackle this issue. After having identified process-relevant objects in both source code and database, we correlate them and add code tracking capabilities to the legacy system using, for example, the possibilities mentioned previously. This does then enable the generation of event logs from either dedicated or silent recording. These event logs may then be used during analysis. Process Visualization Source code • Programming Real-Time Meta Data languages Process Production API • Scripting Real-Time Data Synchonization System Parser languages Data Event Stream Data AST Database (e.g. Event • Schema Legacy Cube (Abstract Stream, XES) • Instances System Syntax Tree) • Distribution Repository DWH ASCII Files Code Tracker ( Pre-installation step ) Fig. 1. General approach When analyzing event logs generated from such legacy systems, a valuable effect can be achieved that the three approaches described in Section 1 are unable to provide: If certain entries in the event stream are missing when comparing the event stream with the source code, this indicates that the process steps involved, although implemented and present, have never been used. This information is essential when removing technical debts and modernizing legacy systems [8]. 14 Marius Breitmayer et al. 3 Related Work This work is related to the research areas process mining, event log generation, and code analysis. Process mining [2] provides techniques to discover business process models from event logs [16,12], to evaluate conformance between process event logs and models [6], and to enhance processes [3]. Existing process dis- covery approaches mainly focus on the control flow perspective while the data perspective is mostly neglected [13]. The latter is of particular interest for mean- ingful process analysis and improvements (e.g., legacy system migration to new software architectures). Event log generation is concerned with the generation of event log based on various sources. In [11,4], approaches to record user activities based on desktop actions (e.g., for robotic process automation) are presented. Our approach is also able to correlate such desktop actions with the corresponding source code fragments and database operations, allowing for a more detailed event log gen- eration. The case study presented in [14] discusses the generation of event logs from a real-world data warehouse of a large U.S. health system. While some challenges (e.g., correlating events) may also arise in the context of legacy sys- tems, we plan to minimize required domain expert interviews by automatically extracting domain knowledge from the source code. Code analysis comprises traditional analysis (e.g., style checking or data flow analysis [10]) and profiling (e.g., CEGAR [7] and BMC [5]) which, combined with process knowledge, yield great potential for software improvement and migration. 4 Conclusion and Outlook This paper emphasizes the need for spending research efforts on the recording of high quality event data in legacy systems. 