=Paper= {{Paper |id=Vol-3209/2855 |storemode=property |title=Justifications Derived from Inconsistent Case Bases Using Authoritativeness |pdfUrl=https://ceur-ws.org/Vol-3209/2855.pdf |volume=Vol-3209 |authors=Joeri Peters, Floris Bex, Henry Prakken |dblpUrl=https://dblp.org/rec/conf/comma/PetersBP22 }} ==Justifications Derived from Inconsistent Case Bases Using Authoritativeness== https://ceur-ws.org/Vol-3209/2855.pdf
Justifications Derived from Inconsistent Case Bases
Using Authoritativeness
Joeri G. T. Peters1,2 , Floris J. Bex1,3 and Henry Prakken1,4
1
  Department of Information and Computing Sciences, Utrecht University, the Netherlands
2
  National Police Lab AI, Netherlands National Police, the Netherlands
3
  Tilburg Institute for Law, Technology and Society, Tilburg University, the Netherlands
4
  Faculty of Law, University of Groningen, the Netherlands


                                         Abstract
                                         Post hoc analyses are used to provide interpretable explanations for machine learning predictions made
                                         by an opaque model. We modify a top-level model (AF-CBA) that uses case-based argumentation as such
                                         a post hoc analysis. AF-CBA justifies model predictions on the basis of an argument graph constructed
                                         using precedents from a case base. The effectiveness of this approach is limited when faced with an
                                         inconsistent case base, which are frequently encountered in practice. Reducing an inconsistent case base
                                         to a consistent subset is possible but undesirable. By altering the approach’s definition of best precedent
                                         to include an additional criterion based on an expression of authoritativeness, we allow AF-CBA to
                                         handle inconsistent case bases. We experiment with four different expressions of authoritativeness using
                                         three different data sets in order to evaluate their effect on the explanations generated in terms of the
                                         average number of precedents and the number of inconsistent a fortiori forcing relations.

                                         Keywords
                                         Justifications, Inconsistent case bases, Authoritativeness




1. Introduction
Both machine learning (ML) and rule-based classification approaches involve a trade off between
accuracy and transparency, specifically the ability of end-users to understand decisions (class
predictions) [1]. Deep neural networks in particular tend to produce predictions with a high
degree of accuracy at the cost of transparency due to their technical complexity. However, the
perceived complexity may vary according to a person’s level of understanding, so even much
simpler approaches might be thought of as relatively opaque by some people. Another reason
for poor transparency can be proprietary protection of the approach, which can render even a
relatively simple approach opaque. Regardless of the underlying reason, the term ‘black box’ is
often used to refer to a particularly opaque approach [1, 2]. A black box model is more difficult
to trust. It is harder to see its shortcomings, including biases and ethical concerns, which is why
Explainable Artificial Intelligence (XAI) is aimed at increasing the transparency of black box
models [3]. In the case of a binary classification problem in ML, this entails that we can explain
why one class label was predicted by the model and not the other in a particular instance.

1st International Workshop on Argumentation for eXplainable AI (ArgXAI, co-located with COMMA ’22), September 12,
2022, Cardiff, UK
$ j.g.t.peters@uu.nl (J. G. T. Peters); f.j.bex@uu.nl (F. J. Bex); h.prakken@uu.nl (H. Prakken)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                           1
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                             1–13


   Methods for explaining ML decisions vary in a number of respects. A distinction can be made
between methods that generate local explanations (explaining individual instances) and those
that generate global explanations (explaining a whole model). Some methods have access to the
learnt model, while others are model agnostic. We use the term ‘justifications’ for explanations
generated without model access to signify that such explanations do not explain exactly how
a decision was reached, but instead they explain the assumptions under which the model’s
decision can be justified. Justifications are thus not intended as separate predictions. This is
related to the notion of post hoc analysis, which implies that an explanation is produced after the
fact [1]. In this paper, we are concerned with local justifications produced by a model-agnostic,
post hoc analysis for binary classification.
   One way to justify a classifier’s prediction is to show a most similar case to one whose class
is being predicted (the focus case). Pointing out a similar case constitutes an argument that
requires no knowledge of ML to understand. To this end, Prakken & Ratsma [4] draw on AI &
law research to propose a top-level model using case-based argumentation (CBA) to explain
black-box predictions based on Horty’s model of a fortiori reasoning [5, 6] and inspired by
CATO [7], hereafter referred to as ‘A Fortiori Case-Based Argumentation’ (AF-CBA). As ML
classification is a supervised approach, there exists a training set used to train a classifier. AF-
CBA requires that this set be accessible. AF-CBA produces a human-interpretable justification of
that classifier’s binary prediction for a focus case by treating this training set as a case base (CB)
and comparing precedents and their outcomes from that CB to the focus case. The underlying
a fortiori assumption of AF-CBA is that the focus case should have the same outcome as a
precedent case if the differences between these cases only serve to add further support for that
same outcome [4].
   When asked to provide a justification, AF-CBA constructs an argument graph through a
grounded argument game consisting of a fixed set of allowed moves. A proponent defends why
the focus case should receive the same outcome as a best precedent (a most similar case) and the
opponent argues against this. In doing so, they cite examples and counterexamples from the CB.
There are distinguishing moves that set cases apart and moves to downplay these differences.
When a precedent case has no relevant differences with the focus case, deciding for the focus
case is said to be ‘forced’. The effectiveness of AF-CBA hinges in part on the distance measure
between cases and any feature selection technique used to promote an interpretable argument
graph [4].
   CBs may contain inconsistencies. Because training sets are used as CBs, they constitute
annotated data, i.e. data instances labelled by a person or process with the intention of allowing
the ML model to learn to perform the same classification task. Annotators (people who label data
to this end) produce a labelled data set specifically for the purpose of training a model, but may
not necessarily be fully consistent when doing so [8]. Multiple annotators might disagree or an
annotator can make an occasional mistake, thus leading to an inconsistent case base. Labels may
also be produced by decision makers—people who produce labels as part of their role in some
decision process, such as judges who decide on court cases, with their verdict being the label
that is stored in a body of case law. This can also lead to contradictory classifications, as case law
can contain conflicting opinions and interpretations. Finally, the feature vector itself may be a
subset of all relevant details, thereby potentially lacking necessary data to discriminate between
seemingly similar cases [9]. These sources of noise make the labelling seem inconsistent, since



                                                  2
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                          1–13


identical feature vectors might receive conflicting labels. Under the a fortiori assumption, this
notion of inconsistency becomes even broader: a case which is at least as good as another yet
receives the opposite outcome is a source of inconsistency. For these reasons, CB consistency is
generally not a safe assumption in practice.
   AF-CBA does not strictly require that the CB be consistent, but inconsistencies are often
due to exceptional cases (with a surprising outcome) and these can be problematic for the
explanation due to the focus case being forced for both outcomes. In experiments by Prakken &
Ratsma [4], significant portions of a CB had to be ignored (by removing a minimal number of
cases when instantiating the CB) in order to make them consistent—namely 0.32%, 11.35% and
3.20% for three different inconsistent data sets. We would preferably use the whole training set
as a CB, without having to take this consistent subset to circumvent the problem. The problem
is exasperated by feature selection techniques, which would otherwise benefit the simplicity
of AF-CBA’s explanations. In conclusion, CB consistency forms a problematic constraint for
AF-CBA.
   In this paper, we present a modification of AF-CBA that takes into account the degree (which
we call ‘authoritativeness’) to which the CB is consistent in regards to a best precedent. This
measure is used to prevent inconsistent forcing by modifying the selection of best precedents to
cite, as it makes intuitive sense to cite cases with the highest authoritativeness. We investigate
the desirability of this modification through exploratory experiments with several possible
alternatives of quantifying authoritativeness, demonstrating it to have a beneficial effect on
AF-CBA without adversely affecting its explanations. The rest of this paper is structured as
follows. We describe AF-CBA and its background in Section 2. We consider how to address the
problem of inconsistency in Section 3. We subsequently experiment with our proposed solution
in Section 4. We discuss the results and future work in Section 5.


2. Case-Based Argumentation
In this section, we present the CBA framework by Prakken & Ratsma (with some differences in
notation) for explanations with dimensions [4]. As our running example, we make use of the
Telco Customer Churn data set [10], which describes the customers of a telecommunications
provider and whether or not they churned (switched providers). Table 1 describes the dimensions
(‘features’ in ML) used. The optional superscript arrow reflects the tendency of a dimension,
i.e. whether a higher value promotes a result of 1 for the class label. Here, only the dimension
of ℎ𝑖𝑔ℎ 𝑐𝑜𝑠𝑡 makes it likelier for a customer to churn; the other three dimensions make it less
likely for a customer to do so.

Table 1
The dimensions used in the Churn example.
    Dimension     Name        Description
    𝑑↓1           Gift        Whether the customer has received a gift from the provider
    𝑑↓2           Present     Whether the customer was present during the last organised event
    𝑑↓3           Website     The number of times the customer logged into their a profile
    𝑑↑4           High cost   Whether the customer is in a high-cost category




                                                  3
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                              1–13


Table 2
A fictional example based on the Churn data set with a CB consisting of only two cases and a new
(focus) case.
                         Customer          𝑑↓1   𝑑↓2       𝑑↓3   𝑑↑4   Label (churn)
                         Alice             1     0         5     0     0
                         Bob               1     1         3     1     1
                         Charlie (focus)   0     1         3     0     ?

   We take Table 2 as our example CB. Let us presume that Alice and Bob are previous customers
and Charlie is a new customer whose predicted outcome we want to justify (the focus case).
Formally, we denote this as follows. Let 𝑜 and 𝑜′ be the two possible outcomes of a case in the
CB. The variables 𝑠 and ¯𝑠 denote the two sides, meaning that 𝑠 = 𝑜 if ¯𝑠 = 𝑜′ and vice versa. A
dimension is defined as a tuple 𝑑 = (𝑉, ≤𝑜 , ≤𝑜′ ), with value set 𝑉 and two partial orderings on
𝑉 , ≤𝑜 and ≤𝑜′ , such that 𝑣 ≤𝑜 𝑣 ′ iff 𝑣 ′ ≤𝑜′ 𝑣 for 𝑣, 𝑣 ′ ∈ 𝑉 . A value assignment is a pair (𝑑, 𝑣).
We denote the value 𝑥 of dimension 𝑑 as 𝑣(𝑑, 𝑐) = 𝑥 for case 𝑐 ∈ 𝐶𝐵. Value assignments to all
dimensions 𝑑 ∈ 𝐷 (where 𝐷 is nonempty) constitute a fact situation 𝐹 . A case is defined as
𝑐 = (𝐹, 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) for such a fact situation and an 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ∈ {𝑜, 𝑜′ }. In this context, a
case base 𝐶𝐵 specifically refers to the set of cases with value assignments for 𝐷. We denote
the fact situation of a case 𝑐 as 𝐹 (𝑐). In the rest of this paper, we assume that any two fact
situations assign values to the same set 𝐷.
   Say we have a ML model that predicts Charlie will stay. An explanation of this outcome could
be that it is forced by another case. We model Horty’s [5] a fortiori reasoning using Definitions
1 and 2, meaning that the outcome of a focus case is forced if there is a precedent with the same
outcome where all their differences make the focus case even stronger for that outcome.

Definition 1 (Preference relation for fact situations). Given two fact situations 𝐹 and 𝐹 ′ ,
𝐹 ≤𝑠 𝐹 ′ iff 𝑣 ≤𝑠 𝑣 ′ for all (𝑑, 𝑣) ∈ 𝐹 and (𝑑, 𝑣 ′ ) ∈ 𝐹 ′ .

Definition 2 (Precedential constraint). Given case base 𝐶𝐵 and fact situation 𝐹 , deciding 𝐹 for
𝑠 is forced iff CB contains a case 𝑐 = (𝐹 ′ , 𝑠) such that 𝐹 ′ ≤𝑠 𝐹 .

   A fact situation could be forced for both 𝑠 and ¯𝑠, which brings us to the following definition
of CB consistency:

Definition 3 (Case base consistency). A case base 𝐶𝐵 is consistent iff it does not contain two
cases 𝑐 = (𝐹, 𝑠) and 𝑐′ = (𝐹 ′ , ¯𝑠) such that 𝐹 ≤𝑠 𝐹 ′ . Otherwise it is inconsistent.

   An explanation takes the form of an argument game for grounded semantics [11] played
between a proponent and opponent of an outcome, in which they take turns to attack the other’s
last argument. Since neither of the cases in Table 2 is identical to the focus case, it is not forced
and the proponent and opponent argue about the outcome. An argument is justified if the
proponent has a winning strategy, meaning the opponent runs out of moves. The proponent
starts by citing a best precedent. This is a case which has the outcome for which the proponent
is arguing and has a minimal subset of relevant differences with the focus case. Determining the
relevant differences between two cases is defined according to Definition 4 (as presented in [4]).




                                                       4
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                           1–13


Definition 4 (Differences between cases). Let 𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and
𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases. The set 𝐷(𝑐, 𝑓 ) of differences between 𝑐 and 𝑓 is:
   1. 𝐷(𝑐, 𝑓 ) = {(𝑑, 𝑣) ∈ 𝐹 (𝑐) | 𝑣(𝑑, 𝑐) ≰𝑠 𝑣(𝑑, 𝑓 )} if 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) = 𝑠.
   2. 𝐷(𝑐, 𝑓 ) = {(𝑑, 𝑣) ∈ 𝐹 (𝑐) | 𝑣(𝑑, 𝑐) ≱¯𝑠 𝑣(𝑑, 𝑓 )} if 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and
      𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑠.

   So one relevant difference between Charlie (assuming he is predicted to stay) and Alice (who
stayed) in Table 2 would be for dimension 𝑑1 , where Alice received a gift and Charlie did not,
making her case better for staying.

Definition 5 (Best precedent). Let 𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and
𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases, where 𝑐 ∈ 𝐶𝐵 and 𝑓 ∈/ 𝐶𝐵. 𝑐 is a best precedent for 𝑓
iff:

    • 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and
    • there is no 𝑐′ ∈ 𝐶𝐵 such that 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐′ , 𝑓 ) ⊂ 𝐷(𝑐, 𝑓 ).

   Definition 5 defines a best precedent to cite. Multiple cases can meet these criteria. A lower
number of best precedents is preferable, because of computational reasons and because one
could say that a higher number of possible citations would make a single explanation somewhat
arbitrary. This is why Prakken & Ratsma evaluated AF-CBA in part on the average number of
best precedents found for three different data sets [4].
   The opponent can reply to the initial citation by playing a distinguishing move or by citing
a counterexample. The proponent can reply in turn with similar distinguishing moves. The
distinguishing moves are 𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) — the focus case is on some dimensions 𝑥 worse than the
precedent 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) — , 𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑥, 𝑦) — the dimensions 𝑥 on which the focus
case is not at least as good as the precedent 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) are compensated by dimensions
𝑦 on which the focus case is better for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) than 𝑐 — and 𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ) — the
initial citation of a most similar case for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) can be transformed by the distinguishing
moves into a case for which 𝐷(𝑐, 𝑓 ) = ∅ and which can therefore attack the counterexample.
For the sake of brevity, see [4] for formal motivations of these moves and the need to allow the
𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠 move to be empty in order to state that the differences with the focus case do
not matter.
   Returning to our example, Figure 1 presents the resulting explanation as an argument game,
which can be read as follows. P1: Alice stayed and her case is similar to Charlie’s. O1: Char-
lie’s scores for 𝑑↓1 and 𝑑↓3 make him worse for staying than Alice. P2: Charlie’s score for 𝑑↓2
compensates for O1. O2: Bob churned and his case is similar to Charlie’s. P3: Charlie’s score
for 𝑑↑4 makes him worse for churning than Bob. O3: Charlie’s score for 𝑑↓1 compensates for P3.
P2: Charlie’s score for 𝑑↓2 compensates for O3. After this, the opponent has run out of possible
moves to make and the proponent wins. The similarity to Alice’s case has held up and acts as a
explanation for the prediction that Charlie will stay as well.
   Formalising this brings us to the following definition (after [4]) for the AF-CBA framework:




                                                  5
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                          1–13




Figure 1: A fictional example of an explanation (dialogue between proponent and opponent).


Definition 6 (Case-based argumentation framework). Given a finite case base 𝐶𝐵, a focus case
𝑓∈/ 𝐶𝐵, and definitions of compensation 𝑑𝑐, an abstract argumentation framework AAF is a pair
< 𝒜, 𝑎𝑡𝑡𝑎𝑐𝑘 >, where:

    • 𝒜 = 𝐶𝐵 ∪ 𝑀 ,
      with 𝑀 = {𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) | 𝑐 ∈ 𝐶𝐵, 𝑥 ̸= ∅ and
      𝑥 = {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑓 ) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑐)}} ∪
      {𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑦, 𝑥) | 𝑐 ∈ 𝐶𝐵, 𝑦 ⊆ {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑐) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑓 )},
      𝑥 = {(𝑑, 𝑣) ∈ 𝐹 (𝑓 ) | 𝑣(𝑑, 𝑓 ) <𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) 𝑣(𝑑, 𝑐)} and 𝑦 compensates 𝑥 according to
      𝑑𝑐} ∪
      {𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ) | 𝑐 ∈ 𝐶𝐵 and 𝑐 can be transformed into 𝑐′ and 𝐷(𝑐′ , 𝑓 ) = 0}
    • 𝐴 attacks 𝐵 iff:
         – 𝐴, 𝐵 ∈ 𝐶𝐵 and 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐴) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) and 𝐷(𝐵, 𝑓 ) ̸⊂ 𝐷(𝐴, 𝑓 );
         – 𝐵 ∈ 𝐶𝐵 with 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and 𝐴 is of the form 𝑊 𝑜𝑟𝑠𝑒(𝐵, 𝑥);
         – 𝐵 is of the form 𝑊 𝑜𝑟𝑠𝑒(𝑐, 𝑥) and 𝐴 is of the form 𝐶𝑜𝑚𝑝𝑒𝑛𝑠𝑎𝑡𝑒𝑠(𝑐, 𝑦, 𝑥);
         – 𝐵 ∈ 𝐶𝐵 and 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝐵) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ) and 𝐴 is of the form
           𝑇 𝑟𝑎𝑛𝑠𝑓 𝑜𝑟𝑚𝑒𝑑(𝑐, 𝑐′ ).

  In summary, AF-CBA provides justifications for individual binary classifications predicted by
a ML model by presenting a winning strategy for a grounded argument game in favour of the
predicted class label. This winning strategy represents a dialogue between a proponent and
opponent on the basis of citations from the labelled training set (the case base) and shows how
the opponent runs out of moves and the proponent thus wins the argument.


3. CB inconsistency
As we argued in Section 1, CB consistency is not always a safe assumption to make. Explanations
containing inconsistent forcings essentially explain that a decision cannot be justified without
acknowledging the inconsistency of the CB, which weakens the value of those explanations.




                                                  6
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                                1–13


The larger the number of inconsistent forcings (𝑁𝑖𝑛𝑐 ), the larger the number of explanations
where this problem occurs.
   Instead of mitigating the problem through case deletion [4], we explicitly take inconsistencies
into account. Informally, one might say that when there is consistency, a precedential case
has a strong backing when cited and should indeed immediately force the outcome; if there is
inconsistency, it has less backing and thus should not. We therefore introduce the concept of
‘authoritativeness’, by which we mean that, given any case 𝑐 ∈ 𝐶𝐵, the authoritativeness 𝛼(𝑐)
numerically expresses (normalised between 0 and 1) the degree to which the rest of the CB
supports the citing of 𝑐 for 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐). We subsequently use 𝛼(𝑐) as an additional criterion in
the selection of best precedents. The intuition behind authoritativeness is that whereas the a
fortiori rule applied to a consistent CB can be expressed as the phrase ‘cases like this always
receive outcome 𝑜,’ our idea of authoritativeness changes this phrase to ‘cases like this usually
receive outcome 𝑜’—where ‘usually’ has to be quantified in some manner which expresses the
inconsistency of the CB with regards to the focus case. Since 𝛼(𝑐) is a number, we can have a
total ordering ≤ on the authoritativeness of cases.
   Table 3 is another instance of our Churn example. Depending on how one chooses to define
𝛼(𝑐), 𝑐1 and 𝑐2 should arguably receive a higher value for 𝛼(𝑐) than 𝑐3 due to 𝑐4 having the
opposite outcome.

Table 3
Example of a CB with two identical cases that are consistent with each other and two identical cases
which contradict each other.
                             Customer 𝑑↓1 𝑑↓2 𝑑↓3 𝑑↑4 𝑜𝑢𝑡𝑐𝑜𝑚𝑒
                             𝑐1          1    1     0    0     𝑠
                             𝑐2          1    1     0    0     𝑠
                             𝑐3          1    1     5    0     𝑠
                             𝑐4          1    1     5    0    ¯𝑠

   First of all, the definition of best precedent has to be modified to reflect the additional criterion
of maximising the authoritativeness:

Definition 7. (Best authoritative precedent) Let 𝐶𝐵 be a case base and let
𝑐 = (𝐹 (𝑐), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐)) and 𝑓 = (𝐹 (𝑓 ), 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 )) be two cases, where 𝑐 ∈ 𝐶𝐵 and
𝑓∈/ 𝐶𝐵. 𝑐 is a best precedent for 𝑓 iff:

    • 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑓 ),
    • there is no 𝑐′ ∈ 𝐶𝐵 such that 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) while 𝐷(𝑐′ , 𝑓 ) ⊂ 𝐷(𝑐, 𝑓 ) and
      𝛼(𝑐′ ) ≥ 𝛼(𝑐).

  In order to quantify authoritativeness, we require expressions of agreement and disagreement
between a precedent and the rest of the CB:

Definition 8. (Agreement) Let 𝐶𝐵 be a case base. Given 𝑐 ∈ 𝐶𝐵, the agreement 𝑛𝑎 (𝑐) is defined
as:
𝑛𝑎 (𝑐) = | {𝑐′ ∈ 𝐶𝐵 | 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐, 𝑐′ ) = ∅} |




                                                   7
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                             1–13


Definition 9. (Disagreement) Let 𝐶𝐵 be a case base. Given 𝑐 ∈ 𝐶𝐵, the disagreement 𝑛𝑑 (𝑐) is
defined as:
𝑛𝑑 (𝑐) = | {𝑐′ ∈ 𝐶𝐵 | 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) and 𝐷(𝑐, 𝑐′ ) = ∅} |

  We understand 𝑛𝑎 (𝑐) as the number of cases which have the same outcome as the precedent
case and are at least as good for that outcome as 𝑐 (thereby lending support to 𝑐). Similarly,
𝑛𝑑 (𝑐) is the number of cases which have the opposite outcome yet are at least as good for
𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐). The agreement 𝑛𝑎 (𝑐) has at least a value of 1 due to 𝑐 itself being a member of the
CB. The disagreement 𝑛𝑑 (𝑐) can have a value of 0.
  Exactly how the level of agreement relates to authoritativeness is not self-evident, as various
expressions may have equal merit. For example, given a case 𝑐 ∈ 𝐶𝐵, we could express the
authoritativeness 𝛼(𝑐) as the relative number of cases which lend further support to 𝑐 (1). In
Table 3, 𝑐3 is supported by (other than itself) 𝑐1 and 𝑐2 , but opposed by 𝑐4 . So in that situation,
𝛼(𝑐3 ) = 3/(3 + 1) = 0.75.

                                                      𝑛𝑎 (𝑐)
                                       𝛼(𝑐) =                                                     (1)
                                                 𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐)
   However, this overlooks any intuitive understanding of authoritativeness which stems from
the absolute number of cases that can act as precedents (2). Intuitively, obscure cases are less
authoritative than common ones. In Table 4, 𝑐1 is supported by two other cases (again, other
than itself), namely 𝑐2 and 𝑐3 , while 𝑐5 is supported by 𝑐1 through 𝑐4 . We divide by | 𝐶𝐵 | to
normalise the expression between 0 and 1. So for example 𝛼(𝑐1 ) = 3/(3 + 0) = 1 according to
(1) but 𝛼(𝑐1 ) = 3/7 ≈ 0.429 according to (2).

                                                            𝑛𝑎 (𝑐)
                                           𝛼(𝑐) =                                                 (2)
                                                           | 𝐶𝐵 |


Table 4
Example of an inconsistent CB showcasing different levels of support.
                             Customer      𝑑↓1   𝑑↓2        𝑑↓3   𝑑↑4    𝑜𝑢𝑡𝑐𝑜𝑚𝑒
                             𝑐1            1     1          0     0      𝑠
                             𝑐2            1     1          0     0      𝑠
                             𝑐3            1     1          0     0      𝑠
                             𝑐4            1     1          2     0      𝑠
                             𝑐5            1     1          2     0      𝑠
                             𝑐6            1     1          2     0     ¯𝑠
                             𝑐7            1     1          15    0      𝑠

   Both (1) and (2) would appear to have some merit intuitively. Using a combination of the two
seems even more intuitive. One option (3) is to take the product of (1) and (2), essentially using
(1) as a weight factor for (2).

                                                𝑛𝑎 (𝑐)      𝑛𝑎 (𝑐)
                                  𝛼(𝑐) =                  ·                                       (3)
                                           𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐) | 𝐶𝐵 |




                                                       8
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                            1–13


   Alternatively, (1) and (2) can be combined as a harmonic mean (4). This introduces a parameter
𝛽, the relative importance of one expression over the other. The added advantage of this is that
(1) could be considered twice as important than (2), for instance. At a value of 𝛽 = 1, the two
are equally important.
                                                           𝑛𝑎 (𝑐)      𝑛𝑎 (𝑐)
                                                       𝑛𝑎 (𝑐)+𝑛𝑑 (𝑐) · |𝐶𝐵|
                              𝛼(𝑐) = (1 + 𝛽 2 ) ·         𝑛𝑎 (𝑐)        𝑛𝑎 (𝑐)
                                                                                                 (4)
                                                      𝑛𝑎 (𝑐)+𝑛𝑑 (𝑐) + |𝐶𝐵|

   How desirable each expression is, is difficult to say. In the next section, we attempt to answer
this question through experimentation. However, two observations can be made here regarding
the four expressions. One is that 𝛼(𝑐) = 1 implies that the CB is consistent with regard to 𝑐,
but only in case of (1) is this value obtained without the whole CB being in agreement with 𝑐.

Proposition 1. Let 𝐶𝐵 be a case base and let 𝛼(𝑐) = 1 for a case 𝑐 ∈ 𝐶𝐵. Then 𝐶𝐵 must be
consistent with regard to 𝑐.

Proof. Recall that 𝐶𝐵 is consistent with regards to 𝑐 if there exists no other case 𝑐′ ∈ 𝐶𝐵 with
𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) ̸= 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) and 𝐹 (𝑐) ≤𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) 𝐹 (𝑐′ ). Recall also that this can be expressed
as 𝑛𝑑 (𝑐) = 0 and that 𝑛𝑎 (𝑐) ≥ 1 for any 𝑐 ∈ 𝐶𝐵 since 𝑐 is in agreement with itself. Suppose
that 𝛼(𝑐) = 1 according to (1). Then 𝑛𝑎 (𝑐) = 𝑛𝑎 (𝑐) + 𝑛𝑑 (𝑐) and it must follow that 𝑛𝑑 (𝑐) = 0.
Suppose now that 𝛼(𝑐) = 1 according to (2). Then 𝑛𝑎 (𝑐) =| 𝐶𝐵 |. Since 𝑐 cannot count towards
both 𝑛𝑎 (𝑐) and 𝑛𝑑 (𝑐), 𝑛𝑑 (𝑐) = 0. Suppose now that 𝛼(𝑐) = 1 according to (3). Since (3) is the
product of (1) and (2), both expressions must have a value of 1 and therefore 𝑛𝑎 (𝑐) = | 𝐶𝐵 |
and 𝑛𝑑 (𝑐) = 0. Suppose now that 𝛼(𝑐) = 1 according to (4). Since (4) is the harmonic mean
of (1) and (2), both expressions must have a value of 1 and therefore 𝑛𝑎 (𝑐) = | 𝐶𝐵 | and
𝑛𝑑 (𝑐) = 0.

  The other observation is that 𝛼(𝑐) = 0 is not obtainable for any of the expressions for
authoritativeness.

Proposition 2. Let 𝐶𝐵 be a case base. Then 𝛼 > 0 for any 𝑐 ∈ 𝐶𝐵.

Proof. Recall that 𝑛𝑎 (𝑐) is the cardinality of the set of cases for which the condition holds that
𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐) = 𝑜𝑢𝑡𝑐𝑜𝑚𝑒(𝑐′ ) and 𝐷(𝑐, 𝑐′ ) = ∅ for 𝑐′ ∈ 𝐶𝐵. Suppose that 𝛼(𝑐) = 0. Then
𝑛𝑎 (𝑐) = 0 when evaluating 𝛼(𝑐) according to (1), (2), (3) or (4). Since the condition for the cases
counting towards 𝑛𝑎 (𝑐) holds when 𝑐 = 𝑐′ , 𝑛𝑎 (𝑐) can only be 0 when 𝑐 ∈    / 𝐶𝐵.

   These observations do not affect our evaluation in the next section and do not form a limitation
of our current approach, but they do result in the following question regarding the intuitive
understanding of authoritativeness: should the minimum value of 0 and maximum value of 1 for
𝛼(𝑐) be of significance? If so, this would steer our choice for an expression for authoritativeness.
We return to this point in Section 5.




                                                    9
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                                 1–13


Table 5
The results of our evaluation experiments for three different data sets and four different expressions of
authoritativeness in addition to the base method where authoritativeness is not taken into account.
                  Base             Relative (1) Absolute (2) Product (3) Harmonic (𝛽 = 1) (4)
                  𝜇 = 105.67       𝜇 = 112.1     𝜇 = 105.95 𝜇 = 106.0          𝜇 = 105.97
   Admission
                  𝑁𝑖𝑛𝑐 = 496       𝑁𝑖𝑛𝑐 = 0      𝑁𝑖𝑛𝑐 = 0        𝑁𝑖𝑛𝑐 = 0      𝑁𝑖𝑛𝑐 = 0
                  𝜇 = 82.15        𝜇 = 148.81 𝜇 = 94.68          𝜇 = 94.76     𝜇 = 94.75
   Churn
                  𝑁𝑖𝑛𝑐 = 38012 𝑁𝑖𝑛𝑐 = 2          𝑁𝑖𝑛𝑐 = 42       𝑁𝑖𝑛𝑐 = 0      𝑁𝑖𝑛𝑐 = 0
                  𝜇 = 70.25        𝜇 = 72.37     𝜇 = 84.66       𝜇 = 86.75     𝜇 = 84.83
   Mushroom
                  𝑁𝑖𝑛𝑐 = 620       𝑁𝑖𝑛𝑐 = 0      𝑁𝑖𝑛𝑐 = 0        𝑁𝑖𝑛𝑐 = 0      𝑁𝑖𝑛𝑐 = 0


4. Evaluation
Using authoritativeness as a criterion when selecting a best precedent is intended to improve
the ability of AF-CBA to generate useful justifications in light of CB inconsistency. Since
AF-CBA generates justifications for the same outcome as the ML model predicts, we cannot use
fidelity (the agreement between an XAI approach and the ML model it explains) to assess the
efficacy of our modification. Evaluation of our approach therefore requires a more investigative
experimentation and interpretation of the results.
   To this end, we follow a similar strategy to Prakken & Ratsma [4]. We also rely on the
same data sets as they do, namely Graduate Admission [12], Telco Customer Churn [10]
and Mushroom [13]. As an expression of how inconsistent each data set is, we determine the
minimum number of case deletions required to make each CB consistent. The result is 26 (3.20%),
647 (9.20%) and 16 (0.32%) for the Admission, Churn and Mushroom data set, respectively. The
tendencies of all dimensions are determined using the Pearson correlation coefficient. Prakken
& Ratsma [4] attempt to gain insights into the feasibility of AF-CBA in terms of the justifications
themselves and in the treatment of inconsistencies. As they explain, desirable characteristics for
AF-CBA include fewer best precedents (reducing the solution space for citing a precedent, see
Section 2). This is one of the metrics on which we compare our four alternative formulations of
authoritativeness to the base method. We treat each case in the CB as a focus case and compute
the number of best precedents for that case given the rest of the CB, reporting the mean number
(𝜇). Changes in 𝜇 would depend on the average distribution of inconsistent cases per focus
case. There is no well-motivated cut off point for when these numbers become too high, but it
is worthwile to consider whether 𝜇 increases by orders of magnitude and whether there are
surprising differences between the alternative formulations of authoritativeness.
   We also report the number of inconsistent forcing relations (𝑁𝑖𝑛𝑐 ) given each experiment. As
described in Section 3, this is the number of forcing relations between two cases that contradict
each other on the outcome. 𝑁𝑖𝑛𝑐 = 0 for a consistent data set and our intention is to achieve
this without having to take a consistent subset of the data. We therefore expect 𝑁𝑖𝑛𝑐 to drop to
very low numbers (if not zero) for all experiments where we make use of authoritativeness.
   We present the results of these experiments1 in Table 5. A qualitative assessment of these
results suggests that inconsistent forcing is indeed largely avoided by taking the authorita-
tiveness of precedents into account, without having a costly impact on the best precedent
    1
        https://github.com/JGTP/CBA-precedent.git



                                                    10
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                          1–13


distributions. The relative version of authoritativeness (1) raises 𝜇 the most for two of the
three data sets and especially for the most inconsistent set (Churn), which suggests that this
particular version of authoritativeness can complicate explanations slightly with inconsistent
CBs. Relative authoritativeness (1) but especially absolute authoritativeness (2) does not reduce
𝑁𝑖𝑛𝑐 completely to zero for the Churn data set. The product (3) and harmonic (4) versions of
authoritativeness therefore appear to be the more desirable expressions. The results do not
suggest any meaningful differences between (3) and (4).


5. Discussion and Future Work
Post hoc analyses often constitute classifiers themselves, although evidently worse than the
actual models (or they would be used as the models instead). This is not the case with AF-CBA.
One can still hold the view that a simpler albeit more transparent model is preferable to a post
hoc analysis. However, it is our experience that this is often unfeasible, as there exist many
problems for which the only satisfactory solutions are too opague—especially for people who
are not researchers or data scientists. It is our belief that this warrants the use of post hoc
analyses in many situations.
   Our modification of AF-CBA relies on the intuition that one precedent can be more authorita-
tive than another. We demonstrate its consequences in numerical terms (𝜇 and 𝑁𝑖𝑛𝑐 ), given that
lower numbers indicate better explanations (as was argued in the original paper [4]). However,
it could be argued that these metrics are but proxies for the capacity to justify ML predictions
in an intuitive fashion. Testing this would require a usability study to evaluate the explanatory
power and interpretability of various explanations. A variety of alternative modifications and
additional metrics could then be compared to study their efficacy in a real-world setting.
   None of our expressions for authoritativeness would ever reach a value of 0 for any case in the
CB. This seems intuitive, since any case should have at least some authoritativeness simply due
to its being a precedent. A value of 𝛼(𝑐) = 1 is only realistic when using our relative expression
of authoritativeness 1. This would only be a problem if values due to different expressions of
authoritativeness would have to be compared to each other, which is not the aim of our method.
If multiple explanations are ever to be compared as part of some overarching approach, these
(and possibly other) characteristics of alternative authoritativeness expressions would have to
be taken into account.
   Additional modifications to AF-CBA could include other criteria for ranking precedents,
incorporating complex arguments in the explanations (AF-CBA is qualified as a ‘top-level’ model
due to the possibility of providing it with a set of definitions as to why specific downplaying
moves can be played) or accounting for dimensions which are highly dependent. Another
possibility is an alteration that allows dimensions to have a more complex effect on predictions
than the tendencies used in this paper. There exist binary classification tasks for which this
would be desirable. For example, a dimension such as blood pressure could be a predictor for
illness both at very low and very high values, with a value in the intermediate range being a
predictor for the patient not being ill. We intend to include this in our future work.




                                                  11
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                            1–13


Conclusion
In this paper, we have presented an extension of an earlier top-level model (AF-CBA) for
case-based argumentation used to provide post hoc justifications for opaque machine learning
predictions. We have modified its definition of best precedent to include a quantified expression
of how authoritative that precedent is, thereby affecting which cases are likely to be cited. This
is not strictly in conflict with the a fortiori assumption underpinning the approach. Instead, it
recognises the limitations of that assumption in light of the inconsistency that can be expected
from real-world case bases. We have experimented with multiple versions of this expression to
study which appears to be the most fruitful regarding the handling of inconsistency without
adversely affecting the explanations. Our evaluation suggests our two somewhat more elab-
orate expressions of authoritativeness are more suitable. Future work is to be aimed at other
modifications and usability studies.

Acknowledgements
The authors would like to thank the anonymous reviewers for their feedback and suggestions.


References
 [1] Z. Lipton, The mythos of model interpretability, Communications of the ACM 61 (2016)
     96–100. doi:10.1145/3233231.
 [2] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
     methods for explaining black box models, ACM Computing Surveys 51 (2018) 93:1–93:42.
     doi:10.1145/3236009.
 [3] T. Miller, Explanation in artificial intelligence: insights from the social sciences, Artificial
     Intelligence 267 (2019) 1–38. doi:10.1016/j.artint.2018.07.007.
 [4] H. Prakken, R. Ratsma, A top-level model of case-based argumentation for explanation:
     formalisation and experiments, Argument & Computation Preprint (2021) 1–36. doi:10.
     3233/AAC-210009, publisher: IOS Press.
 [5] J. Horty, Rules and reasons in the theory of precedent, Legal Theory 17 (2011) 1–34.
 [6] J. Horty, Reasoning with dimensions and magnitudes, Artificial Intelligence and Law 27
     (2019) 309–345. doi:10.1007/s10506-019-09245-0.
 [7] V. Aleven, Teaching case-based argumentation through a model and examples, Ph.D. thesis,
     University of Pittsburgh, Pittsburgh, 1997.
 [8] C. G. Northcutt, A. Athalye, J. Mueller, Pervasive label errors in test sets destabilize
     machine learning benchmarks, arXiv:2103.14749 [cs, stat] (2021). ArXiv: 2103.14749.
 [9] B. Frenay, M. Verleysen, Classification in the presence of label noise: a survey, IEEE
     Transactions on Neural Networks and Learning Systems 25 (2014) 845–869. doi:10.1109/
     TNNLS.2013.2292894.
[10] IBM, Telco Customer Churn, 2018.
[11] S. Modgil, M. Caminada, Proof Theories and Algorithms for Abstract Argumentation
     Frameworks, in: G. Simari, I. Rahwan (Eds.), Argumentation in Artificial Intelligence,
     Springer US, Boston, MA, 2009, pp. 105–129. doi:10.1007/978-0-387-98197-0_6.



                                                  12
Joeri G. T. Peters et al. CEUR Workshop Proceedings                                    1–13


[12] M. Acharya, A. Armaan, A. Antony, A comparison of regression models for prediction of
     graduate admissions, in: 2019 International Conference on Computational Intelligence in
     Data Science (ICCIDS), 2019, pp. 1–5.
[13] D. Wagner, D. Heider, G. Hattab, Mushroom data creation, curation, and simu-
     lation to support classification tasks, Scientific Reports 11 (2021). doi:10.1038/
     s41598-021-87602-3, number: 1 Publisher: Nature Publishing Group.




                                                  13