Model-informed LIME Extension for Business Process Explainability Guy Amit1,† , Fabiana Fournier1,† , Shlomit Gur1,† and Lior Limonad1,† 1 IBM Research - Haifa, Israel Abstract Our focus in this work is on the adaptation of eXplainable AI techniques for the interpretation of business process execution results. Such adaptation is required since conventional employment of such techniques involves a surrogate machine learning model that is trained on historical process execution logs. However, being a data-driven surrogate, its representation faithfulness of the real business process model affects the adequacy of the explanations derived from it. Hence, native use of such techniques is not ensured to be adhering to the target business process explained. We present a business-process-model-driven approach that extends LIME, a conventional machine-learning-model-agnostic eXplainable AI tool, to cope with business processes constraints that is replicable and reproducible. Our results show that our extended LIME approach produces correct and significantly more adequate explanations than the ones given by LIME as-is. Keywords eXplainable Artificial Intelligence, Business Process, Augmented Business Process Management System, Machine Learning, Situation-Aware eXplainability 1. Introduction The wide penetration of Artificial Intelligence (AI) into Business Process Management Sys- tems (BPMSs) brings along many benefits and is considered by many as “the next disruptive technology that will touch almost all business process activities performed by humans.” [1, 2]. However, the suspiciousness of human operators in the presence of AI comes with adoption hesitance, especially when the focus of the automation involves the replacement of human labor. Contemporary studies show that eXplainable Artificial Intelligence (XAI) is one of the major factors for user adoption of AI-based systems. For example, the “Global AI Adoption Index 2021” recent study [3], conducted by Morning Consult, denotes that more than 90 percent of companies using AI say their ability to explain how it arrived at a decision is critical. The need to be able to provide explanations is also reinforced by recent regulations. or example, the General Data Protection Regulation (GDPR) [4] in Europe enforces “the right [...] to obtain an explanation of the decision reached” and the European Commission recently published an ethics guideline for trustworthy AI (High-Level Expert Group on AI, 2019), where the need to explain PMAI@IJCAI22: International IJCAI Workshop on Process Management in the AI era, July 23, 2022, Vienna, Austria † These authors contributed equally. $ guy.amit@ibm.com (G. Amit); fabiana@il.ibm.com (F. Fournier); shlomit.gur@ibm.com (S. Gur); liorli@il.ibm.com (L. Limonad)  0000-0001-6569-1023 (F. Fournier); 0000-0001-5174-3689 (S. Gur); 0000-0002-4784-2147 (L. Limonad) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) a model’s decisions is deemed an essential requirement for establishing trust between human users and AI systems. Similar legislation is also established in the US, such as the California Consumer Privacy Act (CCPA) of 2020 [5]. In their position paper [6], 14 researchers from nine different institutions laid the groundwork for a new generation of BPMSs that are augmented with AI, which they coined “Augmented Business Process Management Systems (ABPMSs)”. The goal of these ABPMSs is to enhance the execution of traditional business processes with novel AI-based capabilities, making them more adaptable, proactive, explainable, and context-sensitive. One of the main characteristics of ABPMSs is their ability to explain and reason about processes executions. This is vital to enable the people who engage with ABPMS to be convinced that the system they operate can be relied upon and act accordingly. However, finding an adequate explanation is not easy, because this requires understanding the situational conditions in which specific decision were made during process enactments. Frequently, explanations cannot be derived from “local” inference (e.g., current undergoing task or decision in a business process) but require reasoning and understanding of situation-wide contextual conditions, relevant to the current step derived from some actions in the past. The authors of this manifesto identify “Situation-Aware eXplainability (SAX)” as one of the most prominent research challenges that ABPMSs call for. SAX aims at being able to entail ongoing tracking of inferential associations between subsequent enactments as a basis to gaining confidence in its ability to provide trustworthy explanations. Previous work regarding explainability in the context of business processes calls for “process- aware explainability”. Jan et al. [3] relate to explainability as being local to the context in which AI operates along the overall (global) reasoning process and motivate the need for process- aware explanations. This is reinforced in a more recent paper [7] by showing that native use of an XAI tool such as Local Interpretable Model-agnostic Explanations (LIME) [8] may drive misleading and even wrong explanations as it does not take into account BPs’ constraints. Our proposed approach extends their work by applying a BP-model-driven approach, providing a comprehensive methodology that extends LIME to cope with BPs’ constraints. Moreover, it can be replicable and reproducible in ABPMSs as a first step towards full SAX BPMSs. 2. Background The use of ML models has become more wide spread in recent years [9, 10, 11]. Applications include domains that could potentially greatly affect individuals as well as society as a whole, such as medical diagnostics, credit scoring, recidivism prediction, and autonomous driving. At the same time, advancements in the ML field have increased the complexity of ML models, often at the expense of explainability, leading to "black box" ML models. Therefore, we have also seen an increase in the need to explain ML models and their predictions, fueled in part by legislation, but also by incentives from the user’s or stakeholder’s point of view (e.g., justify the ML model and gain domain insight), and from the developer’s point of view (e.g., evaluate and improve the ML model). Although some ML models are inherently explainable (e.g., decision trees and linear models) and their internal logic predictions are interpretable (i.e., can be easily understood by humans), more complex models (e.g., ensemble models and deep learning models) require external explanation frameworks, namely XAI, in order to be human-understandable [9, 10]. XAI frameworks are predominantly post-hoc [9, 10]; that is, they are applied to the ML model after its training has been completed. Context-wise they can be divided into global, local, and hybrid explanations [10, 12, 13]. Global explanations attempt to explain the ML model’s internal logic, local explanations try to explain the ML model’s prediction for a single input instance, and hybrid approaches vary (e.g., explaining the ML model’s internal logic for a subspace of the input space). Our main focus in this work is on the adaptation of ML-model-agnostic post-hoc local XAI frameworks that are compatible with tabular data for the interpretation of BP execution results. Such adaptation is needed since conventional employment of XAI comes with the need to assess a variety of model inputs against its resulting outputs via model replays. Such replays are not pragmatic using the real-world BP model, as the execution of each case takes a long time, may involve human decision-making that applies tacit knowledge, and may cost money. Hence, the use of XAI requires the development of a surrogate ML model (e.g., a decision- tree) that represents the real one. These surrogate ML models are typically trained using historical process execution logs. Therefore, being a data-driven surrogate, its faithfulness to the real BP model may be lacking, which in turn affects the adequacy of the explanations derived from it. In this work, we show how additional knowledge that can be derived from the original BP model promotes the adequacy of such explanations, while employing a conventional ML-model-agnostic XAI tool. The most commonly used ML-model-agnostic post-hoc local XAI frameworks, compatible with tabular data, are LIME [8] and SHapley Additive exPlanations (SHAP) [14]. LIME is based on the interpretability of linear models, whereas SHAP borrows from the domain of Game Theory and is based on the concept of Shapely additive values. Although different, the two share one important concept - both frameworks rely on a process of sampling data points (by way of feature perturbations), thereby providing a local neighborhood around the examined sample. Similar to [7], we focus on LIME, because its code is easier to modify than the code of SHAP. In addition, data-driven inter-feature dependencies are not considered in LIME, which is a benefit of using it over SHAP, since in our case, we prefer to have such dependencies inferred from the BP model rather than be [partially] elicited from the dataset. 2.1. LIME: Local Interpretable Model-agnostic Explanation LIME can be used to generate both local and global explanations; in this work we focus on the local explanation. LIME works under the assumption that within a local neighborhood of a sample instance, the ML model’s predictions depend on a subset of features. This assumption causes the ML model to provide explanations that are locally-faithful. That is, features that are important in the prediction on one instance may not be important in the prediction on another or in the global scope of the data set. LIME explanation process begins by generating a sample neighborhood, which is a set of 𝑘 data points {𝑥 ¯ 𝑖 }𝑘𝑖=1 , around an instance 𝑥. Numeric features are sampled using a continues distribution (e.g., Gaussian) around a feature value, which can be either the value of an instance to be explained (local neighborhood) or the global expected value of the feature, based on the training set (global neighborhood). Categorical features are sampled according to each category’s frequencies in the training data. It is important to note that the inter-feature dependencies are Figure 1: BP model in BPMN of the loan application process. not considered in this sampling approach, which is a critical limitation of LIME from a BP point of view. After the sampling process, the ML model to be explained, ℳ(·), is used to provide predictions: 𝑦¯𝑖 = ℳ (𝑥 ¯ 𝑖 ). The resulting labeled data set {𝑥 ¯ 𝑖 , 𝑦¯𝑖 }𝑘𝑖=1 is then used to train a sparse linear model 𝑓 (·) with a weighted loss function, which can be formulated as: 𝑘 ∑︁ ℒ= 𝒟 (𝑥, 𝑥 ¯ 𝑖 ) − 𝑦¯𝑖 )2 + Ω (𝑓 ) ¯ 𝑖 ) (𝑓 (𝑥 𝑖=1 Where 𝒟(·, ·) is a distance function and Ω(·) is a function ensuring the sparsity of the model’s weights. The weights of the resulting 𝑓 (·) reflect the behavior of ℳ(·) within the sample neighborhood and therefore serve as a local feature importance metric. 3. Illustrative Example Figure 1 depicts a loan application BP in which each loan application goes through a set of predefined set of tasks (e.g., verify amount and credit check) and decision gateways (e.g., amount >= 1000) resulting in either the acceptance or the rejection of the loan application. We follow the Business Process Management Notation (BPMN) 1 to model the loan application. In our work we assume that all traces are complete. That is, we have a complete sequence of the BP from start to finish and we do not have partial traces or “in-flight” process executions. Given the BP in figure 1, sets of rules are derived from it systematically as listed below. We denote each task execution in the BP model by a corresponding Boolean variable, named as the task’s capitalised name, that is by default initialised with a 𝑓 𝑎𝑙𝑠𝑒 value and gets assigned a 𝑡𝑟𝑢𝑒 value upon task execution. Internal BP variables are variables associated with conditional splits and are denoted by a lowercase naming convention. 1 https://www.bpmn.org/ 1. Rules derived from the gateways: a) 𝑎𝑚𝑜𝑢𝑛𝑡 ≥ 1000 & 𝐶𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 d) 𝑟𝑖𝑠𝑘 ≥ 0.6 & 𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 b) 𝑎𝑚𝑜𝑢𝑛𝑡 < 1000 & 𝑅𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 e) 𝑐𝑟𝑒𝑑𝑖𝑡 𝑠𝑐𝑜𝑟𝑒 < 700 & 𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 c) 𝑟𝑖𝑠𝑘 < 0.6 & 𝑁 𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 f) 𝑐𝑟𝑒𝑑𝑖𝑡 𝑠𝑐𝑜𝑟𝑒 ≥ 700 & 𝑁 𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 2. Activity indicators: a) 𝑉 𝑒𝑟𝑖𝑓 𝑦_𝑎𝑚𝑜𝑢𝑛𝑡 & 𝑎𝑚𝑜𝑢𝑛𝑡 ̸= NaN d) 𝑐𝑟𝑒𝑑𝑖𝑡 𝑠𝑐𝑜𝑟𝑒 ⊕ 𝑟𝑖𝑠𝑘 b) 𝐶𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 & 𝑐𝑟𝑒𝑑𝑖𝑡 𝑠𝑐𝑜𝑟𝑒 ̸= NaN e) 𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 & 𝑎𝑐𝑐𝑒𝑝𝑡 ̸= NaN c) 𝑅𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 & 𝑟𝑖𝑠𝑘 ̸= NaN f) 𝑁 𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 & 𝑎𝑐𝑐𝑒𝑝𝑡 ̸= NaN 3. Entailment (backtracking) rules: a) [𝑎𝑐𝑐𝑒𝑝𝑡 ̸= NaN] |= [𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 ‖ 𝑁 𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡] b) 𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 |= [𝐶𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 ‖ 𝑅𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡] c) 𝑁 𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 |= [𝐶𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 ‖ 𝑅𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡] d) 𝐶𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 |= 𝑉 𝑒𝑟𝑖𝑓 𝑦_𝑎𝑚𝑜𝑢𝑛𝑡 e) 𝑅𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 |= 𝑉 𝑒𝑟𝑖𝑓 𝑦_𝑎𝑚𝑜𝑢𝑛𝑡 f) 𝑉 𝑒𝑟𝑖𝑓 𝑦_𝑎𝑚𝑜𝑢𝑛𝑡 |= 𝑅𝑒𝑣𝑖𝑒𝑤_𝑙𝑜𝑎𝑛_𝑎𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 4. The note regarding instruction to novice agents during holiday time: Approve all loan applications with credit score that is 10% lower than the usual threshold. 5. The⃒ note regarding the acceptance rates of ⃒the agents implies that: ⃒{𝑥∈𝑋|(𝑥𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 =𝑡𝑟𝑢𝑒)∩(𝑥𝑎𝑐𝑐𝑒𝑝𝑡 =𝑡𝑟𝑢𝑒)}⃒ a) ⃒ ⃒{𝑥∈𝑋|𝑥𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 =𝑇 𝑟𝑢𝑒}⃒ ⃒ = 0.95 ⃒ ⃒ ⃒{𝑥∈𝑋|(𝑥𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 =𝑓 𝑎𝑙𝑠𝑒)∩(𝑥𝑎𝑐𝑐𝑒𝑝𝑡 =𝑓 𝑎𝑙𝑠𝑒)}⃒ b) ⃒ ⃒{𝑥∈𝑋|𝑥𝑆𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 =𝑓 𝑎𝑙𝑠𝑒}⃒ ⃒ = 0.01 Where 𝑋 denotes the collective set of reviewed application traces It is worth noting that in this work we only consider the basic sequencing of activities as expressed in the ’entailment’ segment of the rules, where we acknowledge that such require- ments could be further extended to express more complicated timing constraints associated with arrow semantics to be adhered to. 4. Methodology In this section we elaborate on the steps to be executed in order to systematically augment the LIME Tabular Explainer to respect the constraints imposed by the BP model. While LIME [8] is an open-source framework, it does not directly expose the feature per- turbation process to its users. To test the effect of manipulating the neighborhood created by LIME perturbations, we used in-house code that modified a particular step in the LIME Tabular Explainer’s process. We let the explainer generate a neighborhood, which we then exported to (1) edit and prune the neighborhood (see sections 4.2.1 and 4.2.2, respectively), (2) only prune it, or (3) leave it as-is. Next, we imported back into the explainer the resulting neighborhood and let LIME continue its process as-is, computing local feature importance based on this neighborhood. Variation (3) (keeping the neighborhood as-is) was introduced to guarantee a fair comparison to LIME without BPM-awareness. The results (explanations) for this variation were comparable to the results using a LIME Tabular Explainer without exporting and importing the neighborhood (results omitted). Our code is available at: https://github.com/IBM/SAX 4.1. Handling Missing Values When encoding a BP data set, some values may be missing (e.g., if the amount in the loan application is less than $1000, then there is no credit score). However, many ML models, including SKLearn’s [15] decision tree classifier, which we used in the current work (with default parameters), cannot handle missing values. Additionally, in the unmodified LIME Tabular Explainer, the distribution parameters (e.g., mean and standard deviation for Gaussian distributions) per feature are computed based on the training data and cannot handle missing values either. Therefore, as a first step, all missing values were replaced with −1’s in the data, thereby affecting the perceived distribution of certain features. To eliminate this undesired effect, we modified the extended version of LIME to compute the distribution parameters based only on valid values (which in our case were values greater than or equal to 0). 4.2. Translation of Business Process Rules to Code Our code implementation employs a combination of two strategies: editing and pruning (see sections 4.2.1 and 4.2.2, respectively). The rules in our illustrative example were translated to code as follows: 1. Rules derived from gateways were used in both editing and pruning as-is. 2. The activity indicators are inherent in the data representation (under the assumption of complete and error-free traces), with the exception of rules b, c and d, which were used in both editing and pruning. 3. The entailment rules are also inherent in the data representation, with the exception of rules b and c, which were used only in pruning, as they do not provide actionable rules regarding which of the two options should have taken place (a credit check or a risk assessment). 4. Rules that may cater to situational conditions with attributes that are not explicit in the BP model (e.g., Is_holiday) are not considered in the scope of the current methodology, and therefore were omitted from the translation to code. We plan to accommodate such “contextual annotations” rules in future work. 5. The rules pertaining to the acceptance rates apply to the global population, but may not hold in local neighborhoods within the population, and therefore were not used in the editing or in the pruning. 4.2.1. Editing the Neighbors According to the Business Process Rules As previously mentioned, LIME’s perturbations may result in instances that do not adhere to the BP rules. Therefore, we edited the neighborhood instances produced by LIME in accordance with the process flow in figure 1 and to match the reference sample’s BP model route. As a first processing step, we replaced all features that have NaN placeholder in the reference instance to NaN in all the neighbors, thereby forcing them to belong to the same BP model route. Next, editing may apply any of the two operators: (1) toggling value of a Boolean feature and (2) assigning NaN placeholder to a numeric feature. In our illustrative example, after the NaN correction we enforced the BP rules 1.a and 1.b, then 2.b-2.d, and then either 1.c-1.d or 1.e-1.f. 4.2.2. Pruning the Neighborhood According to the Business Process Rules In order to create a BP-adhering neighborhood, instances that do not conform to the BP rules should be removed. Following the editing step, certain instances may still not conform to some of the BP rules (e.g., have neither credit_score nor risk). Therefore pruning was performed after editing. In our illustrative example, we checked the instances against BP rules 1.a-1.f, 2.b-2.d, and 3.b- 3.c. If any of these BP rules were violated, the instance was excluded from the local neighborhood that was later used by LIME for the explanation. 5. Evaluation 5.1. Synthetic Data Overall, our raw data set’s structure conformed to the tuple: 𝑡 = ⟨𝑥𝑎𝑚𝑜𝑢𝑛𝑡 , 𝑥𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 , 𝑥𝑟𝑖𝑠𝑘 , 𝑥𝐷𝑜𝑛𝑒_𝑐𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 , 𝑥𝐷𝑜𝑛𝑒_𝑟𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 , 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 , 𝑥𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 , 𝑥𝐷𝑜𝑛𝑒_𝑎𝑐𝑐𝑒𝑝𝑡 ⟩ In accordance with figure 1, we populated the raw data set (𝑋) with 1000 instances of type 𝑡 (|𝑋| = 1000) with features drawn at random from Gaussian distributions. ∀𝑥 ∈ 𝑋, 𝑥𝑎𝑚𝑜𝑢𝑛𝑡 ∼ 𝒩 (1000, 200), 𝑥𝐶𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 ∼ 𝒩 (700, 200), and 𝑥𝑟𝑖𝑠𝑘 ∼ 𝒩 (0.6, 0.15). ∀𝑥 ∈ 𝑋, if 𝑥𝑎𝑚𝑜𝑢𝑛𝑡 ≥ 1000, we set 𝑥𝑟𝑖𝑠𝑘 = NaN, 𝑥𝐷𝑜𝑛𝑒_𝑐𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 = 𝑡𝑟𝑢𝑒, and 𝑥𝐷𝑜𝑛𝑒_𝑟𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 = 𝑓 𝑎𝑙𝑠𝑒; otherwise we set 𝑥𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 = NaN, 𝑥𝐷𝑜𝑛𝑒_𝑐𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 = 𝑓 𝑎𝑙𝑠𝑒, and 𝑥𝐷𝑜𝑛𝑒_𝑟𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡 = 𝑡𝑟𝑢𝑒. Next, ∀𝑥 ∈ 𝑋, if 𝑥𝑐𝑟𝑒𝑑𝑖𝑡 𝑠𝑐𝑜𝑟𝑒 ≥ 700, we set 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 = 𝑓 𝑎𝑙𝑠𝑒 and 𝑥𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 = 𝑡𝑟𝑢𝑒; otherwise, we set 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 = 𝑡𝑟𝑢𝑒 and 𝑥𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 = 𝑓 𝑎𝑙𝑠𝑒. Similarly, ∀𝑥 ∈ 𝑋, if 𝑥𝑟𝑖𝑠𝑘 ≥ 0.6, we set 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 = 𝑡𝑟𝑢𝑒 and 𝑥𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 = 𝑓 𝑎𝑙𝑠𝑒; otherwise, we set 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 = 𝑓 𝑎𝑙𝑠𝑒 and 𝑥𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡 = 𝑡𝑟𝑢𝑒. To remove redundancy in the representation, we unified mirroring features: ∀𝑥 ∈ 𝑋, 𝑥𝐼𝑠_𝑐𝑟𝑒𝑑𝑖𝑡 = 𝑥𝐷𝑜𝑛𝑒_𝑐𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘 and 𝑥𝐼𝑠_𝑠𝑘𝑖𝑙𝑙𝑒𝑑 = 𝑥𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡 and then we re- moved the features 𝐷𝑜𝑛𝑒_𝑟𝑖𝑠𝑘_𝑎𝑠𝑠𝑒𝑠𝑠𝑚𝑒𝑛𝑡, 𝐷𝑜𝑛𝑒_𝑐𝑟𝑒𝑑𝑖𝑡_𝑐ℎ𝑒𝑐𝑘, 𝐷𝑜𝑛𝑒_𝑠𝑘𝑖𝑙𝑙𝑒𝑑_𝑎𝑔𝑒𝑛𝑡, and 𝐷𝑜𝑛𝑒_𝑛𝑜𝑣𝑖𝑐𝑒_𝑎𝑔𝑒𝑛𝑡. Finally, to implement the rejection rates of the two agent types (95% for skilled agent and 1% for novice agent), we normalized the credit and risk scores to make them comparable. For details, refer to code in github (see in 4). Output example of this process can be seen in Table 1. Table 1 Example BP records. ’amount’ indicates the amount of the loan request. ’credit_score’ is the applicant’s credit score in case a credit check was performed (see ’Is_credit’) and NaN otherwise. ’risk’ is the applicant’s risk score in case a risk assessment was performed (see ’Is_credit’) and NaN otherwise. ’Is_credit’ indicates whether the BP included a credit check (true) or a risk assessment (false). ’Is_skilled’ indicates whether the BP was processed by a skilled agent (true) or a novice agent (flase). ’accept’ indicates whether the loan request was approved (true) or rejected (false). The explanations for the first four records correspond to a, b, c, and d in Figure 2, in order. amount credit_score risk Is_credit Is_skilled accept $1024.7716 422.4298 NaN true true false $927.0857 NaN 0.40616 false false true $1245.4166 520.2818 NaN true true false $1290.8104 528.5517 NaN true true false 5.2. Evaluation Criteria After setting a random seed (for reproducibilty), we split the data at random into training and test sets: 800 of the records were used for training and the rest (200 records) were used for testing. We examined three variations of the modified LIME Tabular Explainer, corresponding to the three variations presented in section 4: (1) with editing and pruning, which is referred to from here on as “extended LIME”, (2) with pruning only, and (3) LIME as-is. For each of these three variations we evaluated their neighborhoods’ BP-adherence and explainability adequacy as follows. 5.2.1. Evaluating the Neighborhood’s BP-Adherence We evaluated the quality of the feature perturbations by the adherence of the resulting neighbors to the extracted BP rules in section 4.2. For this comparison, we only considered LIME with editing only (A) and LIME as-is (B), as after the pruning step 100% of the instances adhere to the BP rules. To this end we generated 5000 neighbors per test instance and computed the percentage of neighbors that adhered to the BP rules for each of the two variations. Using LIME as-is, only the test instance to be explained adhered to the BP rules. Due to the way LIME operates, all the generated neighbors did not adhere to BP rule 2.d. Conversely, the percentage of BP-adhering neighbors using LIME with editing-only greatly varied. These percentages did not follow a normal distribution (𝑝 << 0.01 in scipy’s [16] normality test), so we used a stochastic dominance test [17, 18] with 𝛼 = 0.01 to check for significant difference in the rate of BP-adhering neighbors between the two variations. We repeated the process using four additional random seeds. 5.2.2. Evaluating the Explainability Adequacy Two of the authors with prior experience in BP modeling have been assigned to independently assess the explanations. For the empirical evaluation, we arbitrarily chose one of the five random seed’s (3) test sets. We extracted only the instances where the top-importance feature was in disagreement between extended LIME and LIME as-is (see Table 2). This set included an extract of 49 application traces. On each application trace, raters have been requested to express what they anticipated a proper explanation would consist of. This included their anticipated outcome (i.e., acceptance or rejection of the loan request) and the feature they considered to have the greatest contribution to that outcome. After evaluating separately, an inner-rater measure of Chronbach’s-alpha was computed, reflecting the level of agreement between the two raters. Since none of the neighbors adhered to the BP rules without editing, the variation of pruning- only was not relevant to this comparison, leaving only two variations: extended LIME and LIME as-is. After dropping all instances that were in dispute between the two raters (two in total), we used the remaining (47) in-consensus instances to compare the raters’ anticipated explanations with the two relevant variations. 6. Results 6.1. Business-Process-Adherence of Neighbors For each of the five random seeds that were used, we ran the stochastic dominance test, comparing the percentage of BP-adhering neighbors with and without editing. The results show that the percentage of BP-adhering LIME-based neighbors with editing-only (A) is indeed stochastically greater than LIME as-is (B) with 𝑝_𝑣𝑎𝑙𝑢𝑒 < 0.01. Moreover, they show it is stochastically greater with a gap of at least 0.40 (40% of the LIME-generated neighborhood) with 𝑝_𝑣𝑎𝑙𝑢𝑒 < 0.01. This means that with BP-aware editing, with high probability, at least 40% of the LIME-based neighbors become BP-adhering. 6.2. Explainability Adequacy As manual rating is labor-intense, we focused on test instances that had a different top- importance feature in their LIME as-is explanation as compared to their extended LIME ex- planation. Out of 200 test instances of one of the random seeds (3), for 49 instances the two explanations provided different top-importance features (see Table 2). We then asked the two raters to asses these instances. The results from the two independent raters for the 49 instances have achieved an overall inter-rater agreement on 47 out of 49 instances (96% agreement). The Chronbach’s reliability levels were 𝛼 = .937 for the process results and 𝛼 = .939 for the top-importance feature. Shown in figure 2 are the explanations for four of the traces that were included in the assessment. Instance (a) in shows an explanation where raters agreed on the top-importance feature being 𝑐𝑟𝑒𝑑𝑖𝑡_𝑠𝑐𝑜𝑟𝑒 (denoted by * and ^ for raters 1 and 2, respectively). The same feature was also deemed most-important by the extended LIME variation. The LIME as-is variation, however, concluded 𝐼𝑠_𝑠𝑘𝑖𝑙𝑙𝑒𝑑 was the top-importance feature in its explanations. Instance (b) shows a similar situation to (a), where both raters agreed on 𝑟𝑖𝑠𝑘 as the top-importance feature, which was also in agreement with the extended LIME variation, but not not LIME as-is. Instances (c) and (d) reflect the two test instances where the raters did not agree on Table 2 Cross method agreement regarding top-importance feature. Top-feature × Top-feature Crosstabulation Extended LIME Total credit_ score risk amount Is_ skilled Is_ credit credit_score 67 0 0 0 0 67 risk 0 84 0 0 0 84 LIME amount 0 3 0 0 0 3 Is_skilled 31 15 0 0 0 46 Is_credit 0 0 0 0 0 0 Total 98 102 0 0 0 200 the top-importance feature (i.e., the two instances that we dropped from the overall set of 49 instances). We assessed the adequacy of the explanations based on all the 47 instances that were in consensus among the raters. Comparing the top-importance feature according to the raters in these instances, they matched the LIME as-is explanation in 11 instances (23%), while they matched the extended LIME’s explanation in 36 instances (77%). For the other random seeds, there were 114, 148, 118, and 128 instances (out of 200 test instances each) for which the two explanation methods (LIME as-is and extended LIME) provided the same top-importance feature. 7. Discussion and Future Research We presented a BP-model-driven methodology that augments LIME to produce correct and significantly more adequate explanations than the explanations given by LIME as-is. Contemporary data-driven approaches that infer inter-feature dependencies to provide expla- nations in the context of BPMSs suffer from the the ‘cold-start’ limitation (i.e., the inability to give explanations prior to having sufficient process execution historical traces). Our proposed BP-model-driven approach overcomes this limitation. Our work opens a variety of potential extensions for future research. First, we tested our hypotheses on a relatively small and simple BPs with complete traces. More realistic scenarios in which we also consider “in-flight” instances and the BP rules are automatically generated and tested will make our approach more generic, robust, and scalable. Second, we merely include sequencing constraints as BP-model derivation rules. Inclusion of timestamps and more sophisticated semantics will enrich the scope of our methodology. Third, we do not address global distribution adjustments to the local neighborhood distribution. For example, in our illustrative example (figure 1), we have annotations regarding the global acceptance/rejection rates for the two agent types. These were left out of the scope of our current work but could be part of the local explanations given to specific instances. Forth, to be fully situational-aware, an Figure 2: Top-importance features by method for example processes. a, b, c, and d correspond to the complete processes described by the four records in Table 1, in order. In red and blue are the importance values assigned to each feature by LIME and extended LIME (with pruning and editing), respectively. The amplitude, regardless of sign, indicates the importance assigned to the feature. * Top-importance features according to rater 1.^Top-importance features according to rater 2. ABPMS should also include reasoning about conditional circumstances and implicit rules that are not part of the (explicit) BP model. For example, we suggest such temporal rule annotated for activity “Novice agent” in figure 1, which relates to some temporal implicit changes in the BP execution (“during holiday time”). Acknowledgement The research was conducted with the support of the State of Israel’s Ministry of Aliyah and Integration, Center for Integration in Science. References [1] D. Fahland, M. Weidlich, Scenario-based process modeling with greta, in: Proceedings of the Business Process Management 2010 Demo Track, volume 615, CEUR-WS.org, Hoboken, NJ, USA, 2010, pp. 1–6. [2] A. Deutsch, R. Hull, Y. Li, V. Vianu, Automatic verification of database-centric systems, ACM SIGLOG News 5 (2018) 37–56. [3] S. T. Jan, V. Ishakian, V. Muthusamy, Ai trust in business processes: The need for process- aware explanations, Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020) 13403–13404. [4] GDPR, General data protection regulation, 2018. https://gdpr-info.eu/. [5] CCPA, The california consumer privacy act, 2018. https://leginfo.legislature.ca.gov/faces/ billTextClient.xhtml?bill_id=201720180AB375. [6] M. Dumas, et al., Augmented business process management systems: A research manifesto, 2022. arXiv:2201.12855. [7] S. Upadhyay, V. Isahagian, V. Muthusamy, Y. Rizk, Extending lime for business process automation, 2021. arXiv:2108.04371. [8] M. T. Ribeiro, S. Singh, C. Guestrin, "why should i trust you?" explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144. [9] C. Meske, E. Bunde, J. Schneider, M. Gersch, Explainable artificial intelligence: objectives, stakeholders, and future research opportunities, Information Systems Management 39 (2022) 53–63. [10] A. Adadi, M. Berrada, Peeking inside the black-box: a survey on explainable artificial intelligence (xai), IEEE access 6 (2018) 52138–52160. [11] S. Verma, A. Lahiri, J. P. Dickerson, S.-I. Lee, Pitfalls of explainable ml: An industry perspective, 2021. arXiv:2106.07758. [12] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (2018) 1–42. [13] J.-R. Rehse, N. Mehdiyev, P. Fettke, Towards explainable process predictions for industry 4.0 in the dfki-smart-lego-factory, KI-Künstliche Intelligenz 33 (2019) 181–187. [14] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances in neural information processing systems 30 (2017). [15] F. Pedregosa, et al., Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [16] P. Virtanen, et al., SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods 17 (2020) 261–272. [17] H. B. Mann, D. R. Whitney, On a test of whether one of two random variables is stochasti- cally larger than the other, The Annals of Mathematical Statistics 18 (1947) 50–60. [18] R. Dror, S. Shlomov, R. Reichart, Deep dominance - how to properly compare deep neural models, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 2773–2785.