Justifications Derived from Inconsistent Case Bases Using Authoritativeness Joeri G. T. Peters1,2 , Floris J. Bex1,3 and Henry Prakken1,4 1 Department of Information and Computing Sciences, Utrecht University, the Netherlands 2 National Police Lab AI, Netherlands National Police, the Netherlands 3 Tilburg Institute for Law, Technology and Society, Tilburg University, the Netherlands 4 Faculty of Law, University of Groningen, the Netherlands Abstract Post hoc analyses are used to provide interpretable explanations for machine learning predictions made by an opaque model. We modify a top-level model (AF-CBA) that uses case-based argumentation as such a post hoc analysis. AF-CBA justifies model predictions on the basis of an argument graph constructed using precedents from a case base. The effectiveness of this approach is limited when faced with an inconsistent case base, which are frequently encountered in practice. Reducing an inconsistent case base to a consistent subset is possible but undesirable. By altering the approach’s definition of best precedent to include an additional criterion based on an expression of authoritativeness, we allow AF-CBA to handle inconsistent case bases. We experiment with four different expressions of authoritativeness using three different data sets in order to evaluate their effect on the explanations generated in terms of the average number of precedents and the number of inconsistent a fortiori forcing relations. Keywords Justifications, Inconsistent case bases, Authoritativeness 1. Introduction Both machine learning (ML) and rule-based classification approaches involve a trade off between accuracy and transparency, specifically the ability of end-users to understand decisions (class predictions) [1]. Deep neural networks in particular tend to produce predictions with a high degree of accuracy at the cost of transparency due to their technical complexity. However, the perceived complexity may vary according to a person’s level of understanding, so even much simpler approaches might be thought of as relatively opaque by some people. Another reason for poor transparency can be proprietary protection of the approach, which can render even a relatively simple approach opaque. Regardless of the underlying reason, the term ‘black box’ is often used to refer to a particularly opaque approach [1, 2]. A black box model is more difficult to trust. It is harder to see its shortcomings, including biases and ethical concerns, which is why Explainable Artificial Intelligence (XAI) is aimed at increasing the transparency of black box models [3]. In the case of a binary classification problem in ML, this entails that we can explain why one class label was predicted by the model and not the other in a particular instance. 1st International Workshop on Argumentation for eXplainable AI (ArgXAI, co-located with COMMA ’22), September 12, 2022, Cardiff, UK $ j.g.t.peters@uu.nl (J. G. T. Peters); f.j.bex@uu.nl (F. J. Bex); h.prakken@uu.nl (H. Prakken) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Methods for explaining ML decisions vary in a number of respects. A distinction can be made between methods that generate local explanations (explaining individual instances) and those that generate global explanations (explaining a whole model). Some methods have access to the learnt model, while others are model agnostic. We use the term ‘justifications’ for explanations generated without model access to signify that such explanations do not explain exactly how a decision was reached, but instead they explain the assumptions under which the model’s decision can be justified. Justifications are thus not intended as separate predictions. This is related to the notion of post hoc analysis, which implies that an explanation is produced after the fact [1]. In this paper, we are concerned with local justifications produced by a model-agnostic, post hoc analysis for binary classification. One way to justify a classifier’s prediction is to show a most similar case to one whose class is being predicted (the focus case). Pointing out a similar case constitutes an argument that requires no knowledge of ML to understand. To this end, Prakken & Ratsma [4] draw on AI & law research to propose a top-level model using case-based argumentation (CBA) to explain black-box predictions based on Horty’s model of a fortiori reasoning [5, 6] and inspired by CATO [7], hereafter referred to as ‘A Fortiori Case-Based Argumentation’ (AF-CBA). As ML classification is a supervised approach, there exists a training set used to train a classifier. AF- CBA requires that this set be accessible. AF-CBA produces a human-interpretable justification of that classifier’s binary prediction for a focus case by treating this training set as a case base (CB) and comparing precedents and their outcomes from that CB to the focus case. The underlying a fortiori assumption of AF-CBA is that the focus case should have the same outcome as a precedent case if the differences between these cases only serve to add further support for that same outcome [4]. When asked to provide a justification, AF-CBA constructs an argument graph through a grounded argument game consisting of a fixed set of allowed moves. A proponent defends why the focus case should receive the same outcome as a best precedent (a most similar case) and the opponent argues against this. In doing so, they cite examples and counterexamples from the CB. There are distinguishing moves that set cases apart and moves to downplay these differences. When a precedent case has no relevant differences with the focus case, deciding for the focus case is said to be ‘forced’. The effectiveness of AF-CBA hinges in part on the distance measure between cases and any feature selection technique used to promote an interpretable argument graph [4]. CBs may contain inconsistencies. Because training sets are used as CBs, they constitute annotated data, i.e. data instances labelled by a person or process with the intention of allowing the ML model to learn to perform the same classification task. Annotators (people who label data to this end) produce a labelled data set specifically for the purpose of training a model, but may not necessarily be fully consistent when doing so [8]. Multiple annotators might disagree or an annotator can make an occasional mistake, thus leading to an inconsistent case base. Labels may also be produced by decision makers—people who produce labels as part of their role in some decision process, such as judges who decide on court cases, with their verdict being the label that is stored in a body of case law. This can also lead to contradictory classifications, as case law can contain conflicting opinions and interpretations. Finally, the feature vector itself may be a subset of all relevant details, thereby potentially lacking necessary data to discriminate between seemingly similar cases [9]. These sources of noise make the labelling seem inconsistent, since 2 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 identical feature vectors might receive conflicting labels. Under the a fortiori assumption, this notion of inconsistency becomes even broader: a case which is at least as good as another yet receives the opposite outcome is a source of inconsistency. For these reasons, CB consistency is generally not a safe assumption in practice. AF-CBA does not strictly require that the CB be consistent, but inconsistencies are often due to exceptional cases (with a surprising outcome) and these can be problematic for the explanation due to the focus case being forced for both outcomes. In experiments by Prakken & Ratsma [4], significant portions of a CB had to be ignored (by removing a minimal number of cases when instantiating the CB) in order to make them consistent—namely 0.32%, 11.35% and 3.20% for three different inconsistent data sets. We would preferably use the whole training set as a CB, without having to take this consistent subset to circumvent the problem. The problem is exasperated by feature selection techniques, which would otherwise benefit the simplicity of AF-CBA’s explanations. In conclusion, CB consistency forms a problematic constraint for AF-CBA. In this paper, we present a modification of AF-CBA that takes into account the degree (which we call ‘authoritativeness’) to which the CB is consistent in regards to a best precedent. This measure is used to prevent inconsistent forcing by modifying the selection of best precedents to cite, as it makes intuitive sense to cite cases with the highest authoritativeness. We investigate the desirability of this modification through exploratory experiments with several possible alternatives of quantifying authoritativeness, demonstrating it to have a beneficial effect on AF-CBA without adversely affecting its explanations. The rest of this paper is structured as follows. We describe AF-CBA and its background in Section 2. We consider how to address the problem of inconsistency in Section 3. We subsequently experiment with our proposed solution in Section 4. We discuss the results and future work in Section 5. 2. Case-Based Argumentation In this section, we present the CBA framework by Prakken & Ratsma (with some differences in notation) for explanations with dimensions [4]. As our running example, we make use of the Telco Customer Churn data set [10], which describes the customers of a telecommunications provider and whether or not they churned (switched providers). Table 1 describes the dimensions (‘features’ in ML) used. The optional superscript arrow reflects the tendency of a dimension, i.e. whether a higher value promotes a result of 1 for the class label. Here, only the dimension of ℎ𝑖𝑔ℎ 𝑐𝑜𝑠𝑡 makes it likelier for a customer to churn; the other three dimensions make it less likely for a customer to do so. Table 1 The dimensions used in the Churn example. Dimension Name Description 𝑑↓1 Gift Whether the customer has received a gift from the provider 𝑑↓2 Present Whether the customer was present during the last organised event 𝑑↓3 Website The number of times the customer logged into their a profile 𝑑↑4 High cost Whether the customer is in a high-cost category 3 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Table 2 A fictional example based on the Churn data set with a CB consisting of only two cases and a new (focus) case. Customer 𝑑↓1 𝑑↓2 𝑑↓3 𝑑↑4 Label (churn) Alice 1 0 5 0 0 Bob 1 1 3 1 1 Charlie (focus) 0 1 3 0 ? We take Table 2 as our example CB. Let us presume that Alice and Bob are previous customers and Charlie is a new customer whose predicted outcome we want to justify (the focus case). Formally, we denote this as follows. Let 𝑜 and 𝑜′ be the two possible outcomes of a case in the CB. The variables 𝑠 and ¯𝑠 denote the two sides, meaning that 𝑠 = 𝑜 if ¯𝑠 = 𝑜′ and vice versa. A dimension is defined as a tuple 𝑑 = (𝑉, ≤𝑜 , ≤𝑜′ ), with value set 𝑉 and two partial orderings on 𝑉 , ≤𝑜 and ≤𝑜′ , such that 𝑣 ≤𝑜 𝑣 ′ iff 𝑣 ′ ≤𝑜′ 𝑣 for 𝑣, 𝑣 ′ ∈ 𝑉 . A value assignment is a pair (𝑑, 𝑣). We denote the value 𝑥 of dimension 𝑑 as 𝑣(𝑑, 𝑐) = 𝑥 for case 𝑐 ∈ 𝐶𝐵. Value assignments to all dimensions 𝑑 ∈ 𝐷 (where 𝐷 is nonempty) constitute a fact situation 𝐹 . A case is defined as 𝑐 = (𝐹, 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) for such a fact situation and an 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ∈ {𝑜, 𝑜′ }. In this context, a case base 𝐶𝐵 specifically refers to the set of cases with value assignments for 𝐷. We denote the fact situation of a case 𝑐 as 𝐹 (𝑐). In the rest of this paper, we assume that any two fact situations assign values to the same set 𝐷. Say we have a ML model that predicts Charlie will stay. An explanation of this outcome could be that it is forced by another case. We model Horty’s [5] a fortiori reasoning using Definitions 1 and 2, meaning that the outcome of a focus case is forced if there is a precedent with the same outcome where all their differences make the focus case even stronger for that outcome. Definition 1 (Preference relation for fact situations). Given two fact situations 𝐹 and 𝐹 ′ , 𝐹 ≤𝑠 𝐹 ′ iff 𝑣 ≤𝑠 𝑣 ′ for all (𝑑, 𝑣) ∈ 𝐹 and (𝑑, 𝑣 ′ ) ∈ 𝐹 ′ . Definition 2 (Precedential constraint). Given case base 𝐶𝐵 and fact situation 𝐹 , deciding 𝐹 for 𝑠 is forced iff CB contains a case 𝑐 = (𝐹 ′ , 𝑠) such that 𝐹 ′ ≤𝑠 𝐹 . A fact situation could be forced for both 𝑠 and ¯𝑠, which brings us to the following definition of CB consistency: Definition 3 (Case base consistency). A case base 𝐶𝐵 is consistent iff it does not contain two cases 𝑐 = (𝐹, 𝑠) and 𝑐′ = (𝐹 ′ , ¯𝑠) such that 𝐹 ≤𝑠 𝐹 ′ . Otherwise it is inconsistent. An explanation takes the form of an argument game for grounded semantics [11] played between a proponent and opponent of an outcome, in which they take turns to attack the other’s last argument. Since neither of the cases in Table 2 is identical to the focus case, it is not forced and the proponent and opponent argue about the outcome. An argument is justified if the proponent has a winning strategy, meaning the opponent runs out of moves. The proponent starts by citing a best precedent. This is a case which has the outcome for which the proponent is arguing and has a minimal subset of relevant differences with the focus case. Determining the relevant differences between two cases is defined according to Definition 4 (as presented in [4]). 4 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Definition 4 (Differences between cases). Let 𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and 𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases. The set 𝐷(𝑐, 𝑓 ) of differences between 𝑐 and 𝑓 is: 1. 𝐷(𝑐, 𝑓 ) = {(𝑑, 𝑣) ∈ 𝐹 (𝑐) | 𝑣(𝑑, 𝑐) ≰𝑠 𝑣(𝑑, 𝑓 )} if 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) = 𝑠. 2. 𝐷(𝑐, 𝑓 ) = {(𝑑, 𝑣) ∈ 𝐹 (𝑐) | 𝑣(𝑑, 𝑐) ≱¯𝑠 𝑣(𝑑, 𝑓 )} if 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑠. So one relevant difference between Charlie (assuming he is predicted to stay) and Alice (who stayed) in Table 2 would be for dimension 𝑑1 , where Alice received a gift and Charlie did not, making her case better for staying. Definition 5 (Best precedent). Let 𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and 𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases, where 𝑐 ∈ 𝐶𝐵 and 𝑓 ∈/ 𝐶𝐵. 𝑐 is a best precedent for 𝑓 iff: • 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and • there is no 𝑐′ ∈ 𝐶𝐵 such that 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐′ , 𝑓 ) ⊂ 𝐷(𝑐, 𝑓 ). Definition 5 defines a best precedent to cite. Multiple cases can meet these criteria. A lower number of best precedents is preferable, because of computational reasons and because one could say that a higher number of possible citations would make a single explanation somewhat arbitrary. This is why Prakken & Ratsma evaluated AF-CBA in part on the average number of best precedents found for three different data sets [4]. The opponent can reply to the initial citation by playing a distinguishing move or by citing a counterexample. The proponent can reply in turn with similar distinguishing moves. The distinguishing moves are 𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) — the focus case is on some dimensions 𝑥 worse than the precedent 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) — , 𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑥, 𝑦) — the dimensions 𝑥 on which the focus case is not at least as good as the precedent 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) are compensated by dimensions 𝑦 on which the focus case is better for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) than 𝑐 — and 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ) — the initial citation of a most similar case for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) can be transformed by the distinguishing moves into a case for which 𝐷(𝑐, 𝑓 ) = ∅ and which can therefore attack the counterexample. For the sake of brevity, see [4] for formal motivations of these moves and the need to allow the 𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠 move to be empty in order to state that the differences with the focus case do not matter. Returning to our example, Figure 1 presents the resulting explanation as an argument game, which can be read as follows. P1: Alice stayed and her case is similar to Charlie’s. O1: Char- lie’s scores for 𝑑↓1 and 𝑑↓3 make him worse for staying than Alice. P2: Charlie’s score for 𝑑↓2 compensates for O1. O2: Bob churned and his case is similar to Charlie’s. P3: Charlie’s score for 𝑑↑4 makes him worse for churning than Bob. O3: Charlie’s score for 𝑑↓1 compensates for P3. P2: Charlie’s score for 𝑑↓2 compensates for O3. After this, the opponent has run out of possible moves to make and the proponent wins. The similarity to Alice’s case has held up and acts as a explanation for the prediction that Charlie will stay as well. Formalising this brings us to the following definition (after [4]) for the AF-CBA framework: 5 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Figure 1: A fictional example of an explanation (dialogue between proponent and opponent). Definition 6 (Case-based argumentation framework). Given a finite case base 𝐶𝐵, a focus case 𝑓∈/ 𝐶𝐵, and definitions of compensation 𝑑𝑐, an abstract argumentation framework AAF is a pair < 𝒜, 𝑎𝑡𝑡𝑎𝑐𝑘 >, where: • 𝒜 = 𝐶𝐵 ∪ 𝑀 , with 𝑀 = {𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) | 𝑐 ∈ 𝐶𝐵, 𝑥 ̸= ∅ and 𝑥 = {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑓 ) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑐)}} ∪ {𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑦, 𝑥) | 𝑐 ∈ 𝐶𝐵, 𝑦 ⊆ {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑐) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑓 )}, 𝑥 = {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑓 ) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑐)} and 𝑦 compensates 𝑥 according to 𝑑𝑐} ∪ {𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ) | 𝑐 ∈ 𝐶𝐵 and 𝑐 can be transformed into 𝑐′ and 𝐷(𝑐′ , 𝑓 ) = 0} • 𝐴 attacks 𝐵 iff: – 𝐴, 𝐵 ∈ 𝐶𝐵 and 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐴) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) and 𝐷(𝐵, 𝑓 ) ̸⊂ 𝐷(𝐴, 𝑓 ); – 𝐵 ∈ 𝐶𝐵 with 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and 𝐴 is of the form 𝑊 𝑜𝑟𝑠𝑒(𝐵, 𝑥); – 𝐵 is of the form 𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) and 𝐴 is of the form 𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑦, 𝑥); – 𝐵 ∈ 𝐶𝐵 and 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and 𝐴 is of the form 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ). In summary, AF-CBA provides justifications for individual binary classifications predicted by a ML model by presenting a winning strategy for a grounded argument game in favour of the predicted class label. This winning strategy represents a dialogue between a proponent and opponent on the basis of citations from the labelled training set (the case base) and shows how the opponent runs out of moves and the proponent thus wins the argument. 3. CB inconsistency As we argued in Section 1, CB consistency is not always a safe assumption to make. Explanations containing inconsistent forcings essentially explain that a decision cannot be justified without acknowledging the inconsistency of the CB, which weakens the value of those explanations. 6 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 The larger the number of inconsistent forcings (𝑁𝑖𝑛𝑐 ), the larger the number of explanations where this problem occurs. Instead of mitigating the problem through case deletion [4], we explicitly take inconsistencies into account. Informally, one might say that when there is consistency, a precedential case has a strong backing when cited and should indeed immediately force the outcome; if there is inconsistency, it has less backing and thus should not. We therefore introduce the concept of ‘authoritativeness’, by which we mean that, given any case 𝑐 ∈ 𝐶𝐵, the authoritativeness 𝛼(𝑐) numerically expresses (normalised between 0 and 1) the degree to which the rest of the CB supports the citing of 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐). We subsequently use 𝛼(𝑐) as an additional criterion in the selection of best precedents. The intuition behind authoritativeness is that whereas the a fortiori rule applied to a consistent CB can be expressed as the phrase ‘cases like this always receive outcome 𝑜,’ our idea of authoritativeness changes this phrase to ‘cases like this usually receive outcome 𝑜’—where ‘usually’ has to be quantified in some manner which expresses the inconsistency of the CB with regards to the focus case. Since 𝛼(𝑐) is a number, we can have a total ordering ≤ on the authoritativeness of cases. Table 3 is another instance of our Churn example. Depending on how one chooses to define 𝛼(𝑐), 𝑐1 and 𝑐2 should arguably receive a higher value for 𝛼(𝑐) than 𝑐3 due to 𝑐4 having the opposite outcome. Table 3 Example of a CB with two identical cases that are consistent with each other and two identical cases which contradict each other. Customer 𝑑↓1 𝑑↓2 𝑑↓3 𝑑↑4 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑐1 1 1 0 0 𝑠 𝑐2 1 1 0 0 𝑠 𝑐3 1 1 5 0 𝑠 𝑐4 1 1 5 0 ¯𝑠 First of all, the definition of best precedent has to be modified to reflect the additional criterion of maximising the authoritativeness: Definition 7. (Best authoritative precedent) Let 𝐶𝐵 be a case base and let 𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and 𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases, where 𝑐 ∈ 𝐶𝐵 and 𝑓∈/ 𝐶𝐵. 𝑐 is a best precedent for 𝑓 iff: • 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ), • there is no 𝑐′ ∈ 𝐶𝐵 such that 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) while 𝐷(𝑐′ , 𝑓 ) ⊂ 𝐷(𝑐, 𝑓 ) and 𝛼(𝑐′ ) ≥ 𝛼(𝑐). In order to quantify authoritativeness, we require expressions of agreement and disagreement between a precedent and the rest of the CB: Definition 8. (Agreement) Let 𝐶𝐵 be a case base. Given 𝑐 ∈ 𝐶𝐵, the agreement 𝑛𝑎 (𝑐) is defined as: 𝑛𝑎 (𝑐) = | {𝑐′ ∈ 𝐶𝐵 | 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐, 𝑐′ ) = ∅} | 7 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Definition 9. (Disagreement) Let 𝐶𝐵 be a case base. Given 𝑐 ∈ 𝐶𝐵, the disagreement 𝑛𝑑 (𝑐) is defined as: 𝑛𝑑 (𝑐) = | {𝑐′ ∈ 𝐶𝐵 | 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐, 𝑐′ ) = ∅} | We understand 𝑛𝑎 (𝑐) as the number of cases which have the same outcome as the precedent case and are at least as good for that outcome as 𝑐 (thereby lending support to 𝑐). Similarly, 𝑛𝑑 (𝑐) is the number of cases which have the opposite outcome yet are at least as good for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐). The agreement 𝑛𝑎 (𝑐) has at least a value of 1 due to 𝑐 itself being a member of the CB. The disagreement 𝑛𝑑 (𝑐) can have a value of 0. Exactly how the level of agreement relates to authoritativeness is not self-evident, as various expressions may have equal merit. For example, given a case 𝑐 ∈ 𝐶𝐵, we could express the authoritativeness 𝛼(𝑐) as the relative number of cases which lend further support to 𝑐 (1). In Table 3, 𝑐3 is supported by (other than itself) 𝑐1 and 𝑐2 , but opposed by 𝑐4 . So in that situation, 𝛼(𝑐3 ) = 3/(3 + 1) = 0.75. 𝑛𝑎 (𝑐) 𝛼(𝑐) = (1) 𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐) However, this overlooks any intuitive understanding of authoritativeness which stems from the absolute number of cases that can act as precedents (2). Intuitively, obscure cases are less authoritative than common ones. In Table 4, 𝑐1 is supported by two other cases (again, other than itself), namely 𝑐2 and 𝑐3 , while 𝑐5 is supported by 𝑐1 through 𝑐4 . We divide by | 𝐶𝐵 | to normalise the expression between 0 and 1. So for example 𝛼(𝑐1 ) = 3/(3 + 0) = 1 according to (1) but 𝛼(𝑐1 ) = 3/7 ≈ 0.429 according to (2). 𝑛𝑎 (𝑐) 𝛼(𝑐) = (2) | 𝐶𝐵 | Table 4 Example of an inconsistent CB showcasing different levels of support. Customer 𝑑↓1 𝑑↓2 𝑑↓3 𝑑↑4 𝑜𝑢𝑡𝑐𝑜𝑚𝑒 𝑐1 1 1 0 0 𝑠 𝑐2 1 1 0 0 𝑠 𝑐3 1 1 0 0 𝑠 𝑐4 1 1 2 0 𝑠 𝑐5 1 1 2 0 𝑠 𝑐6 1 1 2 0 ¯𝑠 𝑐7 1 1 15 0 𝑠 Both (1) and (2) would appear to have some merit intuitively. Using a combination of the two seems even more intuitive. One option (3) is to take the product of (1) and (2), essentially using (1) as a weight factor for (2). 𝑛𝑎 (𝑐) 𝑛𝑎 (𝑐) 𝛼(𝑐) = · (3) 𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐) | 𝐶𝐵 | 8 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Alternatively, (1) and (2) can be combined as a harmonic mean (4). This introduces a parameter 𝛽, the relative importance of one expression over the other. The added advantage of this is that (1) could be considered twice as important than (2), for instance. At a value of 𝛽 = 1, the two are equally important. 𝑛𝑎 (𝑐) 𝑛𝑎 (𝑐) 𝑛𝑎 (𝑐)+𝑛𝑑 (𝑐) · |𝐶𝐵| 𝛼(𝑐) = (1 + 𝛽 2 ) · 𝑛𝑎 (𝑐) 𝑛𝑎 (𝑐) (4) 𝑛𝑎 (𝑐)+𝑛𝑑 (𝑐) + |𝐶𝐵| How desirable each expression is, is difficult to say. In the next section, we attempt to answer this question through experimentation. However, two observations can be made here regarding the four expressions. One is that 𝛼(𝑐) = 1 implies that the CB is consistent with regard to 𝑐, but only in case of (1) is this value obtained without the whole CB being in agreement with 𝑐. Proposition 1. Let 𝐶𝐵 be a case base and let 𝛼(𝑐) = 1 for a case 𝑐 ∈ 𝐶𝐵. Then 𝐶𝐵 must be consistent with regard to 𝑐. Proof. Recall that 𝐶𝐵 is consistent with regards to 𝑐 if there exists no other case 𝑐′ ∈ 𝐶𝐵 with 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) and 𝐹 (𝑐) ≤𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) 𝐹 (𝑐′ ). Recall also that this can be expressed as 𝑛𝑑 (𝑐) = 0 and that 𝑛𝑎 (𝑐) ≥ 1 for any 𝑐 ∈ 𝐶𝐵 since 𝑐 is in agreement with itself. Suppose that 𝛼(𝑐) = 1 according to (1). Then 𝑛𝑎 (𝑐) = 𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐) and it must follow that 𝑛𝑑 (𝑐) = 0. Suppose now that 𝛼(𝑐) = 1 according to (2). Then 𝑛𝑎 (𝑐) =| 𝐶𝐵 |. Since 𝑐 cannot count towards both 𝑛𝑎 (𝑐) and 𝑛𝑑 (𝑐), 𝑛𝑑 (𝑐) = 0. Suppose now that 𝛼(𝑐) = 1 according to (3). Since (3) is the product of (1) and (2), both expressions must have a value of 1 and therefore 𝑛𝑎 (𝑐) = | 𝐶𝐵 | and 𝑛𝑑 (𝑐) = 0. Suppose now that 𝛼(𝑐) = 1 according to (4). Since (4) is the harmonic mean of (1) and (2), both expressions must have a value of 1 and therefore 𝑛𝑎 (𝑐) = | 𝐶𝐵 | and 𝑛𝑑 (𝑐) = 0. The other observation is that 𝛼(𝑐) = 0 is not obtainable for any of the expressions for authoritativeness. Proposition 2. Let 𝐶𝐵 be a case base. Then 𝛼 > 0 for any 𝑐 ∈ 𝐶𝐵. Proof. Recall that 𝑛𝑎 (𝑐) is the cardinality of the set of cases for which the condition holds that 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) and 𝐷(𝑐, 𝑐′ ) = ∅ for 𝑐′ ∈ 𝐶𝐵. Suppose that 𝛼(𝑐) = 0. Then 𝑛𝑎 (𝑐) = 0 when evaluating 𝛼(𝑐) according to (1), (2), (3) or (4). Since the condition for the cases counting towards 𝑛𝑎 (𝑐) holds when 𝑐 = 𝑐′ , 𝑛𝑎 (𝑐) can only be 0 when 𝑐 ∈ / 𝐶𝐵. These observations do not affect our evaluation in the next section and do not form a limitation of our current approach, but they do result in the following question regarding the intuitive understanding of authoritativeness: should the minimum value of 0 and maximum value of 1 for 𝛼(𝑐) be of significance? If so, this would steer our choice for an expression for authoritativeness. We return to this point in Section 5. 9 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Table 5 The results of our evaluation experiments for three different data sets and four different expressions of authoritativeness in addition to the base method where authoritativeness is not taken into account. Base Relative (1) Absolute (2) Product (3) Harmonic (𝛽 = 1) (4) 𝜇 = 105.67 𝜇 = 112.1 𝜇 = 105.95 𝜇 = 106.0 𝜇 = 105.97 Admission 𝑁𝑖𝑛𝑐 = 496 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝜇 = 82.15 𝜇 = 148.81 𝜇 = 94.68 𝜇 = 94.76 𝜇 = 94.75 Churn 𝑁𝑖𝑛𝑐 = 38012 𝑁𝑖𝑛𝑐 = 2 𝑁𝑖𝑛𝑐 = 42 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝜇 = 70.25 𝜇 = 72.37 𝜇 = 84.66 𝜇 = 86.75 𝜇 = 84.83 Mushroom 𝑁𝑖𝑛𝑐 = 620 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 𝑁𝑖𝑛𝑐 = 0 4. Evaluation Using authoritativeness as a criterion when selecting a best precedent is intended to improve the ability of AF-CBA to generate useful justifications in light of CB inconsistency. Since AF-CBA generates justifications for the same outcome as the ML model predicts, we cannot use fidelity (the agreement between an XAI approach and the ML model it explains) to assess the efficacy of our modification. Evaluation of our approach therefore requires a more investigative experimentation and interpretation of the results. To this end, we follow a similar strategy to Prakken & Ratsma [4]. We also rely on the same data sets as they do, namely Graduate Admission [12], Telco Customer Churn [10] and Mushroom [13]. As an expression of how inconsistent each data set is, we determine the minimum number of case deletions required to make each CB consistent. The result is 26 (3.20%), 647 (9.20%) and 16 (0.32%) for the Admission, Churn and Mushroom data set, respectively. The tendencies of all dimensions are determined using the Pearson correlation coefficient. Prakken & Ratsma [4] attempt to gain insights into the feasibility of AF-CBA in terms of the justifications themselves and in the treatment of inconsistencies. As they explain, desirable characteristics for AF-CBA include fewer best precedents (reducing the solution space for citing a precedent, see Section 2). This is one of the metrics on which we compare our four alternative formulations of authoritativeness to the base method. We treat each case in the CB as a focus case and compute the number of best precedents for that case given the rest of the CB, reporting the mean number (𝜇). Changes in 𝜇 would depend on the average distribution of inconsistent cases per focus case. There is no well-motivated cut off point for when these numbers become too high, but it is worthwile to consider whether 𝜇 increases by orders of magnitude and whether there are surprising differences between the alternative formulations of authoritativeness. We also report the number of inconsistent forcing relations (𝑁𝑖𝑛𝑐 ) given each experiment. As described in Section 3, this is the number of forcing relations between two cases that contradict each other on the outcome. 𝑁𝑖𝑛𝑐 = 0 for a consistent data set and our intention is to achieve this without having to take a consistent subset of the data. We therefore expect 𝑁𝑖𝑛𝑐 to drop to very low numbers (if not zero) for all experiments where we make use of authoritativeness. We present the results of these experiments1 in Table 5. A qualitative assessment of these results suggests that inconsistent forcing is indeed largely avoided by taking the authorita- tiveness of precedents into account, without having a costly impact on the best precedent 1 https://github.com/JGTP/CBA-precedent.git 10 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 distributions. The relative version of authoritativeness (1) raises 𝜇 the most for two of the three data sets and especially for the most inconsistent set (Churn), which suggests that this particular version of authoritativeness can complicate explanations slightly with inconsistent CBs. Relative authoritativeness (1) but especially absolute authoritativeness (2) does not reduce 𝑁𝑖𝑛𝑐 completely to zero for the Churn data set. The product (3) and harmonic (4) versions of authoritativeness therefore appear to be the more desirable expressions. The results do not suggest any meaningful differences between (3) and (4). 5. Discussion and Future Work Post hoc analyses often constitute classifiers themselves, although evidently worse than the actual models (or they would be used as the models instead). This is not the case with AF-CBA. One can still hold the view that a simpler albeit more transparent model is preferable to a post hoc analysis. However, it is our experience that this is often unfeasible, as there exist many problems for which the only satisfactory solutions are too opague—especially for people who are not researchers or data scientists. It is our belief that this warrants the use of post hoc analyses in many situations. Our modification of AF-CBA relies on the intuition that one precedent can be more authorita- tive than another. We demonstrate its consequences in numerical terms (𝜇 and 𝑁𝑖𝑛𝑐 ), given that lower numbers indicate better explanations (as was argued in the original paper [4]). However, it could be argued that these metrics are but proxies for the capacity to justify ML predictions in an intuitive fashion. Testing this would require a usability study to evaluate the explanatory power and interpretability of various explanations. A variety of alternative modifications and additional metrics could then be compared to study their efficacy in a real-world setting. None of our expressions for authoritativeness would ever reach a value of 0 for any case in the CB. This seems intuitive, since any case should have at least some authoritativeness simply due to its being a precedent. A value of 𝛼(𝑐) = 1 is only realistic when using our relative expression of authoritativeness 1. This would only be a problem if values due to different expressions of authoritativeness would have to be compared to each other, which is not the aim of our method. If multiple explanations are ever to be compared as part of some overarching approach, these (and possibly other) characteristics of alternative authoritativeness expressions would have to be taken into account. Additional modifications to AF-CBA could include other criteria for ranking precedents, incorporating complex arguments in the explanations (AF-CBA is qualified as a ‘top-level’ model due to the possibility of providing it with a set of definitions as to why specific downplaying moves can be played) or accounting for dimensions which are highly dependent. Another possibility is an alteration that allows dimensions to have a more complex effect on predictions than the tendencies used in this paper. There exist binary classification tasks for which this would be desirable. For example, a dimension such as blood pressure could be a predictor for illness both at very low and very high values, with a value in the intermediate range being a predictor for the patient not being ill. We intend to include this in our future work. 11 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 Conclusion In this paper, we have presented an extension of an earlier top-level model (AF-CBA) for case-based argumentation used to provide post hoc justifications for opaque machine learning predictions. We have modified its definition of best precedent to include a quantified expression of how authoritative that precedent is, thereby affecting which cases are likely to be cited. This is not strictly in conflict with the a fortiori assumption underpinning the approach. Instead, it recognises the limitations of that assumption in light of the inconsistency that can be expected from real-world case bases. We have experimented with multiple versions of this expression to study which appears to be the most fruitful regarding the handling of inconsistency without adversely affecting the explanations. Our evaluation suggests our two somewhat more elab- orate expressions of authoritativeness are more suitable. Future work is to be aimed at other modifications and usability studies. Acknowledgements The authors would like to thank the anonymous reviewers for their feedback and suggestions. References [1] Z. Lipton, The mythos of model interpretability, Communications of the ACM 61 (2016) 96–100. doi:10.1145/3233231. [2] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of methods for explaining black box models, ACM Computing Surveys 51 (2018) 93:1–93:42. doi:10.1145/3236009. [3] T. Miller, Explanation in artificial intelligence: insights from the social sciences, Artificial Intelligence 267 (2019) 1–38. doi:10.1016/j.artint.2018.07.007. [4] H. Prakken, R. Ratsma, A top-level model of case-based argumentation for explanation: formalisation and experiments, Argument & Computation Preprint (2021) 1–36. doi:10. 3233/AAC-210009, publisher: IOS Press. [5] J. Horty, Rules and reasons in the theory of precedent, Legal Theory 17 (2011) 1–34. [6] J. Horty, Reasoning with dimensions and magnitudes, Artificial Intelligence and Law 27 (2019) 309–345. doi:10.1007/s10506-019-09245-0. [7] V. Aleven, Teaching case-based argumentation through a model and examples, Ph.D. thesis, University of Pittsburgh, Pittsburgh, 1997. [8] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive label errors in test sets destabilize machine learning benchmarks, arXiv:2103.14749 [cs, stat] (2021). ArXiv: 2103.14749. [9] B. Frenay, M. Verleysen, Classification in the presence of label noise: a survey, IEEE Transactions on Neural Networks and Learning Systems 25 (2014) 845–869. doi:10.1109/ TNNLS.2013.2292894. [10] IBM, Telco Customer Churn, 2018. [11] S. Modgil, M. Caminada, Proof Theories and Algorithms for Abstract Argumentation Frameworks, in: G. Simari, I. Rahwan (Eds.), Argumentation in Artificial Intelligence, Springer US, Boston, MA, 2009, pp. 105–129. doi:10.1007/978-0-387-98197-0_6. 12 Joeri G. T. Peters et al. CEUR Workshop Proceedings 1–13 [12] M. Acharya, A. Armaan, A. Antony, A comparison of regression models for prediction of graduate admissions, in: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), 2019, pp. 1–5. [13] D. Wagner, D. Heider, G. Hattab, Mushroom data creation, curation, and simu- lation to support classification tasks, Scientific Reports 11 (2021). doi:10.1038/ s41598-021-87602-3, number: 1 Publisher: Nature Publishing Group. 13