Process Mining Model to Guarantee the Privacy of Personal Data in the Healthcare Sector Sebastian Saavedra 1, José Llatas 1 and Jimmy Armas-Aguirre 1,2 1 Universidad Peruana de Ciencias Aplicadas, Lima, Perú 2 Pontificia Universidad Católica del Perú, Lima, Perú Abstract In the paper, we propose a model to guarantee the privacy of patient data in critical processes in the healthcare sector through the application of process mining. Process mining is a discipline that discovers process models by analyzing event logs in order to identify bottlenecks and establish alternatives to improve their performance. In healthcare institutions, process mining is used to improve critical processes. However, event data logs containing confidential healthcare patient data are not protected when process mining and data visualization are applied. This definitely increases the risk of theft of this sensitive data and, therefore, the risk of patients being affected. The proposed model aims to mask event logs containing sensitive data so that they are inaccessible when process mining is applied. The model comprises four main stages: 1. target definition and data transformation; 2. data masking; 3. inspection and pattern analysis; 4. application of process mining techniques and data visualization. The model was validated using data from an appointment request process of a state health organization in Lima, Peru. Preliminary results showed that complete event logs containing sensitive data were protected, flow compliance increased by 68% and average processing time increased by 89.4%. Keywords 1 Process mining, Healthcare, Data privacy 1. Introduction The healthcare sector is among the three sectors with the highest number of data breach and security incidents, in 2016 the healthcare sector was the most affected, with 116 incidents, representing 37.2% of all incidents, while the second most affected sector reported only 34 incidents [1]. The World Economic Forum shows in its 2020 Global Risks Report that digital data theft and the risk of cyberattacks on critical infrastructure (including those in the healthcare sector) were among the top 10 risks most likely to occur in that year [2]. Process mining is a very useful technique for the discovery of real process models by analyzing event logs. Because of its benefits, many institutions from different business areas use it to optimize their processes. However, being an emerging technique, process mining also faces challenges that have not yet been solved. One of them is to consider security and privacy issues when applying it [3]. The challenge is greater when using this discipline in the healthcare sector, since Electronic Medical Records (EMR) are the most important asset in the healthcare sector, because of the detail data they contain about patients. This paper evaluates the creation of a reference model to ensure the privacy of sensitive data in the dating process, supported by Process Mining and Data Visualization. We expect this model will reduce the existing security gaps in Process Mining in the healthcare sector. CISETC 2021: International Congress on Educational and Technology in Sciences, November 16-18, 2021, Chiclayo, Peru EMAIL: u201613718@upc.edu.pe (A. 1); u201617237@upc.edu.pe (A. 2 jimmy.armas@upc.pe, a20184064@pucp.edu.pe (A. 3,4) ORCID: 0000-0002-7521-4981 (A. 1); 0000-0001-8121-3160 (A. 2); 0000-0002-1176-8969 (A. 3,4) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) This paper is structured as follows: we will review Process Mining models in the healthcare sector and then we will focus on describing the proposed model as a solution to the problem. Finally, conclusions and recommendations based on the results got in a case study are presented. 2. State of the Art: Process Mining Models A Three-step framework for privacy preservation during the application of process mining is presented in [4]. In the first step, sensitive information is protected; in the second step, privatized metadata is created. Finally, the third step comprises of applying process mining on this metadata. However, a case study was not carried out to validate the variables of the framework, so the authors mention that the effects of the application of data transformation methods to preserve of privacy in the event logs of healthcare sector organizations should be investigated. In [5], a five-phase reference model is developed for the evaluation of operational variables in healthcare using process mining and data visualization. In the first phase, data mining is performed, while in the second phase the event logs are processed, which will be analyzed through process mining in the third phase and represented in dashboards using the data visualization techniques applied in the fourth phase. Finally, the results are evaluated in the fifth phase. This model allows the identification of the effects of the application of process mining on healthcare sector records, but does not include techniques or practices that preserve the privacy of these event logs. In [6], a protection model for event data privacy is designed using differential privacy, which allows the sharing of public information about a dataset without allowing the sensitive data of the individuals to be compromised. This model protects sensitive data using queries so that process analysts do not have access to it. However, the authors show that data protection only applies to one of the 3 activities of process mining: process discovery, and that it does not extend to compliance verification and process improvement activities. In [7], details the analysis of a series of tools used to carry out cyberattacks on healthcare institutions in order to identify the most appropriate defensive techniques. These techniques do not include the protection of optimized processes through process mining. In short, the literature reviewed includes models and frameworks focused on process mining applied to the healthcare sector, but they do not satisfactorily cover the privacy aspect, sometimes because it is not addressed at all [5], or because it does not protect data throughout the entire process of applying process mining [6]. 3. Data privacy process model: proposed solution 3.1. Description of the proposal The model designed to be presented in Figure 1 comprises of four phases, taking as a reference the method of [8]. This model ensures the privacy of sensitive data that allows the identification of patients whose event logs are within the base used for the application of Process Mining and Data Visualization. Based on the regulations defined by Ministerial Resolution No. 688 - 2020 MINSA [9]. Figure 1: Reference Model for the privacy of patient personal data in processes using Process Mining 3.2. Phases of the model 3.2.1. Objectives and processing The objectives of the project are defined, and the data are processed based on them. It has three main sub-phases: definition of objectives, extraction and transformation. In the first sub-phase, different indicators, such as time, quality and cost, must be taken into consideration. Once the objectives, both general and specific, have been defined, questions should be generated based on them. These will establish a way to evaluate objectives, specifically their progress and fulfillment. The second sub-phase involves the extraction of event logs from various sources, which will be used later for the application of Process Mining. For the last sub-phase, a cleaning, integration and quality assurance of the event logs will be performed, in order to have only one merged base, showing the ID, the activity and the timestamp. 3.2.2. Data privacy This phase is based on [10], where all event logs already transformed go through a masking process to ensure the privacy of sensitive data, leading to patient identification. This phase also comprises three sub-phases: find, evaluate, and protect. The first sub-phase is based on the identification of the sensitive data within the event logs. After the analysis, there will be a generated list of the data that will be masked to maintain the privacy of the patients. The second sub-phase aims to identify the optimal masking algorithm for the event logs. Each attribute deserves a specific form of masking. The last sub- phase is the one where the chosen technique is executed for each of the attributes. Thus, the event logs are ready for application within Process Mining 3.2.3. Inspection of event logs and patterns In this phase, the first impression is obtained from the event logs and different statistics that are collected to create a summary of the pattern that they follow. The number of cases, events, and their duration, resources, patterns, and event frequencies are inspected to have a prompt visualization of the process and to understand it completely. 3.2.4. Process Mining and Data Visualization This phase is based on the application of the different process mining techniques with the protected event logs, in order to get information about the process and its compliance, as well as to adapt the data so that they can be correctly understood by non-expert users. This phase comprises five sub-phases: discovery, verification, data visualization techniques, evaluation of results, and improvement. In the first sub-phase, the real flow of the process will be found as recorded by the event logs within the tool used. In the second sub-phase, inconsistencies related to the compliance of the initially designed process and the event logs obtained will be detected. In the third sub-phase, the techniques that will apply to the different attributes of the logs in the data visualization will be defined, seeking the best representation of these and thus generate relevant information for the evaluation of the process. In the fourth sub- phase, the results presented through the different visualization techniques are evaluated in order to propose subsequent improvements or corrections that will help to optimize the current process. In the last sub-phase, after the analysis of results and measurement of indicators that allow answering the questions raised in the first phase, improvement opportunities will be obtained to start a new cycle of the model. 4. Case Study: Experimentation 4.1. Organization Following the best model validation practices outlined in [11], which show that successful model validation requires that all its steps are fulfilled, the model validation process was performed by processing, securing, and analyzing a dataset from the appointment process of a public health institution in Lima, Peru. 4.2. Validation Process 4.2.1. Definition of objectives and process First, as part of the model’s objectives and processing phase, the objectives related to the project were defined through the formulation of questions, and with variables and indicators to answer them, check Table 1. Table 1 Objectives and Indicators Objective Question Variable Indicator How is the process going? Process integrity Number of cases in the process Know the What are the most common Process flow compliance Percentage of performance flows? occurrence of the of ESSALUD’s most frequent flow appointment What are the most limited Process flow compliance Percentage of process flows? occurrence of each alternative flow To what degree is the Process flow compliance Percentage of optimal process flow met? occurrence of the optimal flow What are the bottlenecks in Process flow compliance Average waiting time the process? between activities Know the level of How many event logs with Probability of event logs Number of protected protection of sensitive data are protected? theft records sensitive data in the appointment process Finally, a BPMN diagram of the process was made for later comparison with the model discovered during the process mining application, see Figure 2. Figure 2: Diagram of the appointment process 4.2.2. Extract data Continuing with the objectives and processing phase, and with the support of the health institution’s staff, three Event Data bases were obtained: Request, Granting, and Appointments in Excel. All of them had an identifier named “ACTO_MEDICO”, which will later allow the consolidation of the databases. 4.2.3. Process data First, the three extracted databases were subjected to a cleaning process to eliminate null, incomplete, and inconsistent data that could negatively affect the reliability and accuracy of the results; for example, reserved appointments where the patient did not attend, thus leaving a gap in the field of Attention, or appointment confirmations made at a later time than their attention, due to errors human error caused by workers. Then, the three databases were integrated into a single database through the “ACTO_MEDICO” field mentioned above; however, this integrated database still does not meet the minimum characteristics required by an event log. Finally, Phyton 3.9 was used to, through the Pandas library, generate the event logs with the required field (ID, activity, and timestamp). Each event log was composed of four activities: Request, Grant, Appointment, and Attention. 4.2.4. Data masking As part of the data privacy phase, the event logs were masked in three steps. First, all event logs containing DPS (Personal Health Data) were identified, which, as stated in Ministerial Resolution N° 688-2020 MINSA, are highly confidential. Then, the masking technique was evaluated for each field based on the IMETU (Identify, Map, Execute, Test, and Utilize) masking framework. Finally, the techniques were applied to the database stored in Excel, check Table 2. Table 2 Protected Event Logs DPS of event log Example Selected masking Masked DPS National Identification 12345678 Remove last 4 characters 1234#### Card (DNI) Patient Rafael Pedro Use Excel Kutools Add-in 9E7475F70KJYdCys5Aoqckh Ramirez Vela cnuSvaXqQG8m0mTQi7HZw h/R87cQ= Age 44 Increase value by 20 64 Sex M The value "*" will be * taken Physician’s National 87654321 Remove last 4 characters 8765#### Identification Card Physician Javier Mateo Use Excel Kutools Add-in 8BCF7BD70KJYdCys5AoPCf Lopez Zarate W+5Ua3dYGICQMd45wV9e F9JJ7SIETuSC4TWOiD+w== 4.2.5. Process Mining Application As part of the event log and pattern inspection phase, the masked data were loaded into the Celonis platform to get an overview of the process using the metrics it provides, such as daily cases and events, average process time, or bottlenecks. Next, the process mining phase proceeds with the discovery of the process model through the Celonis Overview tool, which allows us to see the discovered model with all its deviations. Then, the process model is loaded to be compared with the model discovered in verification. In the first data load, in Figure 3, this verification was 12% of event logs. Following the continuous improvement approach of the model, problems in the data were identified and corrective actions were taken, such as using the Excel DATEDIFF function to validate the correct sequence of dates. Figure 3: Safety verification of the first load In the second load, see Figure 4, the verification was 80%, but the diagram obtained looked forced because Celonis did not organize correctly the activities that occurred on the same day due to the absence of the correct time in the timestamp, so the date and time fields were unified in Excel to allow the timestamp to take it into account. Figure 4: Safety verification of the second load In the Figure 5, the third load, satisfactory results were obtained, so we proceeded with the next phase. The discovered model is shown below in Figure 6, followed by the safety verification. Figure 5: Safety verification of the third load Figure 6: Model discovered in the third load 4.2.6. Data Visualization and Decision Making Using the Celonis Studio and Celonis Business Views tools, dashboards facilitated the understanding of the analysis by non-expert users using some traditional charts such as pie charts to show the distribution of cases by medical service, or the bar chart, where the average process time by age group in days is shown. Subsequently, the data from the Celonis results are evaluated and improvement actions are determined, such as the definition of start and end times for the activities, the creation of a variable in the confirmation activity that shows that the appointment has been attended, among others. Regarding the results, all event logs that contained DPS are protected, complying with the Ministerial Resolution N° 688-2020 MINSA, while the average time of the process was reduced by 89.4% and the percentage of compliance with the flow increased by 68%. With this, it can be affirmed that masking data to ensure its privacy does not prevent an effective process mining analysis, nor does it affect the reliability and accuracy of the analysis. 5. Conclusions and perspectives In the paper, we proposed a reference model to ensure the privacy of confidential patient data in the health appointment process using Process Mining. The model was applied in an operational context in the search of answering questions that help to know the behavior of the process and find improvements. 55,249 event logs were reviewed for the case study, through which all confidential records were obtained masked, ensuring their privacy, the compliance of the process increased by 68% and the average execution time decreased by 89.4%. This not only ensures the privacy of confidential records in the event logs, but also has a positive impact on the process. It is recommended to evaluate the addition of a data protection and governance phase, which includes the definition of roles and authentications that reinforce the protection of sensitive information recorded in the healthcare sector. As future work, it is recommended to improve the quality of the data recorded in the databases by periodically cleaning null or empty data and incorrect dates and inconsistent data, since these may affect the analysis and the results obtained are not as accurate, so there is the probability that the improvement of the process will be focused in the wrong direction. It has also been noted the need for the definition of start and end times for the care activity. In this way, it will be possible to justify the number of appointments to be carried out in a period or for a specific service and also to know the number of resources that can be allocated to minimize time. 6. References [1] Hurst W., Boddy A., Merabti M., & Shone N. (2020). Patient Privacy Violation Detection in Healthcare Critical Infrastructures: An Investigation Using Density-Based Benchmarking. Future Internet, 12(6), 100. Recovered from http://dx.doi.org/10.3390/fi12060100 [2] Banco Interamericano de Desarrollo (BID), Organización de los Estados Americanos (OEA). (2020). Reporte Ciberseguridad 2020: Riesgos, avances y el camino a seguir en América Latina y el Caribe [3] Van Der Aalst, W. et al. (2011). Process mining manifesto. En International Conference on Business Process Management (p. 169-194). Springer, Berlin, Heidelberg. Recovered from https://doi.org/10.1007/978-3-642-28108-2_19 [4] Pika, A., Wynn, M. T., Budiono, S., ter Hofstede, A. H., van der Aalst, W. M., & Reijers, H. A. (2019). Towards privacy-preserving process mining in healthcare. In International Conference on Business Process Management (pp. 483-495). Springer, Cham. Reovered from https://doi.org/10.1007/978-3-030-37453-2_39 [5] Aguirre, J. A., Torres, A. C., & Pescoran, M. E. (2019). Evaluation of operational process variables in healthcare using process mining and data visualization techniques. Health, 7, 19. Recovered from http://dx.doi.org/10.18687/LACCEI2019.1.1.286 [6] Mannhardt, F., Koschmider, A., Baracaldo, N., Weidlich, M., & Michael, J. (2019). Privacy- preserving process mining. Business & Information Systems Engineering, 61(5), 595-614. Recovered from https://doi.org/10.1007/s12599-019-00613-3 [7] Ibarra, J., Jahankhani, H., & Kendzierskyj, S. (2019). Cyber-physical attacks and the value of healthcare data: facing an era of cyber extortion and organised crime. In Blockchain and Clinical Trial (pp. 115-137). Springer, Cham. Recovered from https://doi.org/10.1007/978-3-030-11289- 9_5 [8] Ibarra, J., Jahankhani, H., & Kendzierskyj, S. (2019). Cyber-physical attacks and the value of healthcare data: facing an era of cyber extortion and organised crime. In Blockchain and Clinical Trial (pp. 115-137). Springer, Cham. Recovered from https://doi.org/10.1007/978-3-030-11289- 9_5 [9] Ministerio de Salud (MINSA). (2020). Resolución Ministerial 688 – 2020. [10] Ali, O., & Ouda, A. (2016, October). A classification module in data masking framework for business intelligence platform in healthcare. In 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON) (pp. 1-8). IEEE. doi: 10.1109/IEMCON.2016.7746327 [11] Anderson, M. P., & Woessner, W. W. (1992). The role of the postaudit in model validation. Advances in Water Resources, 15(3), 167-173