=Paper= {{Paper |id=Vol-3209/2119 |storemode=property |title=An XAI method for the Automatic Formation of an Abstract Argumentation Framework from a Neural Network and its Objective Evaluation |pdfUrl=https://ceur-ws.org/Vol-3209/2119.pdf |volume=Vol-3209 |authors=Giulia Vilone, Luca Longo |dblpUrl=https://dblp.org/rec/conf/comma/ViloneL22 }} ==An XAI method for the Automatic Formation of an Abstract Argumentation Framework from a Neural Network and its Objective Evaluation== https://ceur-ws.org/Vol-3209/2119.pdf
A Global Model-Agnostic XAI method for the
Automatic Formation of an Abstract Argumentation
Framework and its Objective Evaluation
Giulia Vilone1 , Luca Longo1
1 The Artificial Intelligence and Cognitive Load research lab,

The applied Intelligence Research Center, School of Computer Science,
Technological University Dublin, Dublin, Ireland


                                      Abstract
                                      Explainable Artificial Intelligence (XAI) aims to train data-driven, machine learning (ML) models possess-
                                      ing both high predictive accuracy and a high degree of explainability for humans. Comprehending and
                                      explaining the inferences of a model can be seen as a defeasible reasoning process which is expected to
                                      be non-monotonic meaning that a conclusion, linked to a set of premises, can be withdrawn when new
                                      information becomes available. Computational argumentation, a paradigm within Artificial Intelligence
                                      (AI), focuses on modeling defeasible reasoning. This research study explored a new way for the automatic
                                      formation of an argument-based representation of the inference process of a data-driven ML model to
                                      enhance its explainability by employing principles and techniques from computational argumentation,
                                      including weighted attacks within its argumentation process. An experiment was conducted on five datasets
                                      to test, in an objective manner, if the explanations of the proposed XAI method are more comprehensible
                                      than decision trees, which are considered naturally transparent. Findings demonstrate that usually the
                                      argument-based method can represent the logic of the model with fewer rules than a decision tree, but
                                      further work is required to achieve the same performances in terms of other characteristics, such as fidelity
                                      to the model.

                                      Keywords
                                      Explainable artificial intelligence, Argumentation, Non-monotonic reasoning, Method evaluation, Metrics
                                      of explainability




1. Introduction
XAI, a sub-field of AI, aims to develop a unified approach to learning data-driven models that are
both highly accurate in their predictions and explainable to experts and laypeople. The explosion
in the quantity of available data and the success of ML, especially Deep Learning, have led to the
development of new models with outstanding predictive performances. However, most of these
models have complex, non-linear structures that are hard to understand and explain. Researchers
have proposed numerous XAI methods generating explanations in different formats (numerical,
rules, textual, visual or mixed) [1, 2]. The XAI methods returning rule-based explanations extract

1st International Workshop on Argumentation for eXplainable AI (ArgXAI, co-located with COMMA ’22), September
12, 2022, Cardiff, UK
$ giulia.vilone@tudublin.ie (G. Vilone)
 0000-0002-4401-5664 (G. Vilone); 0000-0002-2718-5426 (L. Longo)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)



                                                                                                         1
Giulia Vilone et al. CEUR Workshop Proceedings                                                1–13


a set of rules mimicking the inferential process of a complex ML model [3]. However, these
methods do not necessarily capture and describe the actual inferential process. They just report
the relationships between inputs and outputs as learned by the model without verifying if they
are consistent with the background knowledge of the application field or are instead based on
spurious correlations of the data. Understanding the inferential process of a model should be
seen as a non-monotonic reasoning process [4]. This requires a mechanism replicating the way
human reasons to support humans in the comprehension of the inherent inferential process learnt
by a model. Argumentation is a multidisciplinary subfield of AI that studies how arguments
can be presented, supported or discarded in a defeasible reasoning process. It also investigates
formal approaches to evaluate the validity of the conclusions reached at the end of the reasoning
process[5, 6]. Argumentation Theory (AT) provides the basis for implementing these processes
computationally [6] and it is inspired by how humans reason. This research experiment shows that
AT can be a viable solution for building novel global model-agnostic XAI methods generating
argument-based explanations. The quality of these explanations was preliminarily tested via
an objective study based on eight quantitative metrics that assess distinct aspects of rule-based
explanations, thus providing vital insights on the inferential process of a ML model [7], and
compared to another rule-extraction XAI method generating Decision Trees (DTs), which are
considered as naturally transparent [8, 3].
   The remainder of this manuscript is organised as follows. Section 2 summarises the strategies
used by scholars to generate rule-based explanations of ML models and how to assess the quality
of these explanations. Section 3 describes the design of a primary research experiment. Section
4 discusses the findings of this experiment and its limitations. Lastly, Section 5 highlights the
contribution to the existing body of knowledge and suggests future directions.


2. Related work
Rule-based explanations are a structured but still intuitive format for reporting information to
humans in a compact way. They represent the logic of a ML model as a ruleset that can be
easily read, interpreted and visualised. Therefore, scholars consider rulesets and DTs as naturally
transparent and intelligible [8, 3]. However, current rule-extraction XAI methods merely produce
a rulesets mimicking the inferential process of an underlying complex model. The rules can
also be in conflict with the expert domain knowledge, thus perplexing the users of such models.
It must remember that such rules aim at faithfully representing the relationships captured by
the model during its training process between the independent variables of the input data with
its target variable. Thus, this conflict can be an essential signal of an issue occurring during
training. Similarly, the XAI methods do not provide any tool to handle potential inconsistencies
among the extracted rules, should they arise. Thus, these rules are not suitable to support a richer
reasoning process [9]. AT provides formal approaches to model non-monotonic logic and assess
the validity of the conclusions reached by a set of arguments to be considered as acceptable
[5, 6]. Non-monotonic logic consists of a family of formal frameworks devised to capture and
represent defeasible inferences. In formal logic, a defeasible concept consists of a set of pieces
of information or arguments that can be rebutted by additional information or arguments [10].
Generally, arguments are designed by domain experts to create a knowledge-base in single or



                                                 2
Giulia Vilone et al. CEUR Workshop Proceedings                                               1–13


multi-agent environments [11]. In a single-agent environment, arguments are constructed by
an autonomous reasoner, thus conflictual information tends to be minimal. In a multi-agent
environment, multiple reasoners participate in argument construction, so more conflicts among
them usually arise, enabling in practice non-monotonic reasoning [12]. Defeasible argumentation
supplies a sound formalisation for reasoning with uncertain and incomplete information from a
defeasible knowledge-base [13]. The process of defeasible argumentation frequently requires the
recursive analysis of conflicting arguments in a dialectical setting to determine which arguments
should be accepted or discarded [14]. Abstract AT (AAT) is the dominant paradigm, whereby
arguments are abstractly considered in a dialogical structure. Formal semantics are habitually
adopted to identify conflict-free sets of arguments that can subsequently support decision-making,
explanations and justification [14, 6]. Existing AAT-based frameworks have common features:
[13, 15, 16]:
    • a defeasible knowledge-base in the form of interactive arguments, usually formalised with
      a first-order logical language;
    • a set of attacks that are modelled whenever two arguments are in conflict;
    • a semantic which consists of mechanism for conflict resolution. It implements in practice
      non-monotonicity and provides a dialectical status to the arguments.
   The integration between AT and ML is still a young field. Minimal work exists on automatic
argument and attack mining from data-driven ML models, how the interpretation of these models
can be augmented via argumentation to, in turn, improve their explainability. [17, 13, 15].
In relation to this, the first issue is the automatic extraction of rules and their conflicts from
these models. The second issue is their automatic integration into an argumentation framework
that can serve as a mechanism for interpreting and explaining the inferential process of such
models without any explicit human declarative knowledge. A two-step approach for AT-ML
integration was proposed in [18]. In the first step, rules are extracted from a given dataset with
the Apriori algorithm for mining association rules. In the second step, the rules are fed into
structured argumentation approaches, such as ASPIC+ [19]. Using their argumentative inferential
procedures, new observations are classified by constructing arguments on top of these rules and
determining their justification status. Another study exploits argumentative graphs to depict the
structure of argument-based frameworks [20]. Arguments are the nodes connected by directed
edges representing attacks. The status of the arguments is provided by a label (accepted or
rejected) and is determined by using argumentation semantics [21].


3. Design
The informal research hypothesis of this study is that a ruleset extracted by an XAI method from
data-driven ML models supports the automatic formation of an argumentation framework. The
expectation is that this framework possesses a higher degree of explainability when compared
to other formats of explanations considered naturally interpretable and transparent in Computer
Science, like a DT. The difference in the degree of explainability of the two methods was tested
in an objective and quantitative manner with eight metrics that measure different aspects of a
ruleset, such as number and length of its rules. The research hypothesis was tested by carrying
out a set of phases described in the following paragraphs and depicted in the diagram of Fig. 1.



                                                 3
Giulia Vilone et al. CEUR Workshop Proceedings                                                 1–13




Figure 1: High-level representation of the process to build the envisioned argument-based XAI method.


3.1. Phase 1: Dataset preparation
The first step was to select a few training datasets containing multi-dimensional data built by
domain experts, so they cannot contain data produced by an algorithm. The datasets must
not present issues that can impede the successful training of a model, such as the course of
dimensionality or a significant portion of missing data. The labeled target variable, represented
by block YT in Fig. 1, must be categorical, ideally with more than two target classes, whereas the
independent features should be a mix of continuous and categorical predictors. In this study, the
experiment was carried out on five public datasets downloaded from Kaggleor the UCI Machine
Learning Repository (see Tab. 1).The Adult database, based on the 1994 US Census, was designed
to train ML models to predict if a person earns or not over $50K on annual basis. Avila contains
data about 800 images of a Latin copy of the Bible, called the Avila Bible, manufactured during
the XII century by 12 Italian and Spanish copyists who were individuated from a paleographic
analysis of the manuscript. The model must associate each image with the copyist who drew



                                                 4
Giulia Vilone et al. CEUR Workshop Proceedings                                                    1–13


it. The Credit Card Default dataset was created to train ML models that predict if Taiwanese
clients will fail to repay their credit card debts. The Hotel Bookings dataset includes booking
information for a city hotel and a resort hotel such as the booking date, length of stay, and the
number of adult and child guests, among other things. The target variable represents the final
status of the reservation, whether it was cancelled, checked-out or the client did not show up.
Online Shopper Intention records thousands of sessions on e-commerce websites. The negative
target class represents customers who did not buy anything, whilst the positive class are sessions
that ended with a purchase.
   The datasets were preprocessed to avoid data-related issues in the model’s training process.
None of the selected datasets have missing data, so no action was required. However, the input
features “fnlwgt” of the Adult dataset, which is the statistical weights measuring how many
US citizens are represented by each subject, and the Client ID from the Credit Card Default
dataset had to be discarded because they did not represent discriminative attributes. All the data
in the independent features were scaled into the range [0,1]. Features with very large values
might dominate over other in the training process of the model. Then, a correlation analysis
was performed on each dataset to detect pairs of highly correlated features and discard one of
the two to reduce the risk of multicollinearity. There is no consensus on the thresholds between
strong, moderate and weak correlations. In this study, the absolute Spearman’s rank correlation
coefficients were grouped into three segments: values in the range (0, 0.33) were considered
weak, (0.33, 0.66) moderate, and (0.66, 1) strong correlations. The best subset selection analysis
was carried out to chose which variable from a strongly-correlated pair had to be discarded [22].
A linear regression model was built over each combination of the independent features excluding
one from each strongly correlated pairs. These models were then sorted in descending order
according to their R2 values and the first one was selected. The best subset selection approach was
chosen for its simplicity and because it requires little computational time and resources. Some of
the chosen datasets are unbalanced, meaning that one specific class contains more instances than
the others. This disparity can lead some learning algorithms to classify all the instances into the
majority class and ignore the minority one. To avoid this, each dataset was split into a training
and a validation subsets with the stratified five-fold cross-validation technique to ensure that
each class was represented with the same proportion as in the original dataset. Furthermore, the
Synthetic Minority Over-Sampling Technique (SMOTE) [23] was applied to the training datasets
to up-sample the minority classes.

Table 1
Properties of the five datasets selected for the experiment.
                                       Total    No. of input     No. of continuous      No. of
         Dataset
                                    instances    features      (categorical) features   classes
         Adult                        48,842         14                 6 (8)               2
         Avila                        20,867         10                10 (0)              12
         Credit Card Default          30,000         23                20 (3)               2
         Hotel Bookings              119,385         23                16 (7)               3
         Online Shopper Intention     12,330         17                14 (3)               2




                                                   5
Giulia Vilone et al. CEUR Workshop Proceedings                                                   1–13


3.2. Phase 2: Model training
A feed-forward neural network with two fully-connected hidden layers was trained on each
datasets to fit YT . The block YM in Fig. 1 represents the predictions obtained from the trained
model (represented by block f (x)) over the evaluation dataset (test data) whose original labelled
target variable is depicted by block YE . YE is compared with YM to assess the model’s prediction
accuracy. The number of hidden nodes and the value of other model’s hyperparameters, reported
in Tab. 2, were determined with a grid search to reach the highest feasible prediction accuracy.
To avoid overfitting, the training process was early stopped when the validation accuracy did not
improve for five epochs in a row. The networks were trained five times over the five training
subsets extracted from the datasets with the five-fold cross-validation technique. The models
with the highest validation accuracy were chosen. Lastly, the not relevant input features were
pruned by recursively removing one at a time, retraining the selected model and checking if its
prediction accuracy decreased. If this was not the case, the pruned variable was removed.

Table 2
Optimal hyperparameters of neural networks obtained through grid search procedure, grouped by
dataset, and their resulting accuracies.
                                                          Dataset list
                                                           Credit                     Online
                                                                          Hotel
          Model parameters       Adult        Avila        Card                        Shop.
                                                                         Bookings
                                                          Default                    Intention
          Optimizer              Adam       RMSprop       Adamax           SGD         SGD
                                                                          Lecun-
          Weight initialisa-    Uniform     He-Unif.       Normal                    He-Unif.
                                                                           Unif.
          tion
          Activation function     Tanh        Relu         Softplus       Softplus    Softmax
          Dropout rate             0%          0%            10%            0%          0%
          Batch size               128         16             16              8          8
          Hidden neurons            16         32             32             24          8
          Accuracy (valida-     83% (79%)   98% (91%)     68% (79%)      65% (59%)   84% (87%)
          tion)



3.3. Phase 3: Formation of the explainable argumentation framework
The trained models were translated into an explainable argument-based representation which can
be easily embedded into an online interactive platform where the argumentation framework is
represented as a graph (an example can be found in [4], page 10). The process of argumentation
towards the achievement of a justifiable conclusion, as emerged from theoretical works of AT, can
be broken down into five layers [6], as depicted in Fig. 2 and detailed in the following subsections.

  Layer 1: definition of the internal structure of arguments. In standard logic, an argument
consists of a set premises leading to a conclusion, or more formally:

Definition 3.1 (Argument). An argument Ar is a tentative inference → that links one or more
premises Pi to a conclusion C and can be written as Ar : P1 , . . . , Pn → C.




                                                      6
Giulia Vilone et al. CEUR Workshop Proceedings                                                           1–13




Figure 2: Five layers upon which argumentation systems are generally built, retrieved from [6].


    In this study, an argument corresponds to an IF-THEN rule, thus the premises and conclusion
of an argument correspond to the rule’s antecedents and conclusion. The ML models and the
evaluation datasets were fed into a bespoke rule-extraction method that generates a set of IF-
THEN rules by using a two-step algorithm. First, each dataset was divided into groups according
to the target class as predicted by the model. In other words, all the instances assigned by the
model to the same class were grouped together. Second, the Ordering Points To Identify the
Clustering Structure (OPTICS) [24] algorithm was exploited to further split the groups into
clusters that coincide with areas of the input space having a high density of samples. Then, each
cluster was translated into a rule by finding, for each relevant feature, the minimum and maximum
values that include all the samples in the cluster. These ranges determine the rule’s antecedents,
whereas the conclusion corresponds to the predicted class of the cluster’s samples. A typical rule
is:
                   IF m1 ≤ X1 ≤ M1 AND . . . AND mN ≤ XN ≤ MN T HEN ClassX                     (1)
where Xi , i = 1, . . . , N are the N independent relevant features, mi and Mi , i = 1, . . . , N are the mini-
mum and maximum values w.r.t the i-th independent feature of the samples included in the cluster.

   Layer 2: definition of the attacks between arguments. The inconsistencies between the formed
arguments were modelled via the notion of attack. Generally, attacks are binary relations between
two conflicting arguments. They can be of different kinds [6], but only the following two types
were considered in this study.
Definition 3.2 (Rebutting attack). Given two distinct arguments A, B ∈ AR, where AR represents
the set of all the arguments, with A : P1 , . . . , Pn → C1 , B : P1 , . . . , Pm → C2 , A is rebuttal of B and
is denoted as (A, B) if C1 logically contradicts C2 . A rebuttal attack is symmetrical, so it holds
that iff (A, B), then ∃(B, A).
Definition 3.3 (Undercutting attack). Given an argument A ∈ AR that challenges some or all of
the premises used to construct another argument B ∈ AR, A undercuts B and is denoted as (A, B)
when A claims there is a special case that does not allow the application of the inference rule
(→) of argument B.
  Attacks are usually specified by domain experts, but in this study they can be automatically
extracting by identifying conflicting rules. Two rules are conflictual if they are overlapping
and reach different conclusions. Two rules overlap if their covers intersect. The cover of a rule



                                                      7
Giulia Vilone et al. CEUR Workshop Proceedings                                                     1–13


corresponds to the set of input instances whose attribute values satisfy the rule’s antecedents [25].
As depicted in Fig. 3, two rules can be 1) fully overlapping, with one rule including the second
one (part a), 2) partially overlapping (part b) or 3) sharing the same cover (part c). The first case
could be seen as an undercutting attack because the internal rule represents an exception of the
external one. The remaining two cases could be equivalent to a rebutting attack as two rules start
from the same premises, at least in part, but reach different conclusions.




  (a) Undercutting attack       (b) Rebutting attack         (c) Rebutting attack
Figure 3: Relative positions of two conflicting rules that can be a) fully overlapping, with one rule
including the other, b) partially overlapping or c) covering the same area of the input space (retrieved
from [4]).


   Layer 3: evaluation and definition of valid attacks. Once arguments and attacks are embodied in
a dialogical structure, the formalised knowledge-base, a fundamental characteristics of argument-
based systems is their ability to determine the success of an attack. Different approaches can be
found in the literature to decide if an attack is successful, thus valid, including a) binary attacks,
b) strengths of arguments, and c) strengths of attacks [6]. In this study, a weighted notion of
attack is considered; weights represent the strength of the attacks. There are various ways to
compute these weights [26]. Here, they are computed as the percentage of instances belonging to
the intersection of the covers of two conflictual rules that are assigned by the model to the same
target class of the conclusion of the attacking rule:
                                   |{x ∈ cover(A) ∩ cover(B) : f (x) = CA }|
                        w(A,B) =                                                                     (2)
                                         |{x ∈ cover(A) ∩ cover(B)}|
where x represents an input instance of the training dataset, CA is the conclusion of the attacking
rule (argument) A, and | • | is the cardinality function. For example, two conflicting rules have
respectively target classes Q and S as conclusion and there are in their cover intersection 20
instances classified by the model in class Q and 30 in class S. In this case, the attack from the
second rule with conclusion S is stronger than the attack from the first rule and has a weight
equal to 30                                           20
         50 . The weight of the reciprocal attack is 50 . It might happen that the difference in the
number of instances per class is small, like 20 versus 21. In this case, is it fair to say that the rule
with conclusion S is actually stronger than the other rule? As a consequence, the concept of
inconsistency budget [26] was used to set a threshold on the fraction of supporting instances
of the attacking rules. In this study, it was set equal to 0.55, meaning that an attack must be
supported by at least 55% of the samples in the cover intersection. Future work will involve a
study to fine-tune it. It is important to underline that all the arguments formed in layer one have
the same importance and the notion of weight of argument is not used in this study. Not all the



                                                   8
Giulia Vilone et al. CEUR Workshop Proceedings                                                 1–13


arguments are activated by each training instance since not all their premises might be satisfied.
The activated portion of the knowledge-base is considered for the next computations.

   Layer 4: definition of the dialectal status of arguments. Dung-style acceptability semantics
investigate the inconsistencies that might emerge from the interaction of arguments [14]. Given a
set of arguments where some attack others, it must be decided which arguments can be accepted.
In Dung’s theory, the internal structure of arguments is not considered. This leads to an abstract
argumentation framework (AAF) which is a finite set of arguments and attacks. In Dung’s terms,
usually, an argument defeats another argument if and only if it represents a reason against the
second argument. It is also essential to assess whether the defeaters are defeated themselves to
determine the acceptability status of an argument. This is known as acceptability semantics:
given an AAF, it specifies zero or more conflict-free sets of acceptable arguments. However,
other semantics have been proposed in the literature, not necessarily based on the notion of
acceptability, such as the ranking-base categoriser semantic, introduced by [27] and employed in
this experiment, which consists of a recursive function that rank-orders a set of arguments from
the most to the list acceptable. The rank of an argument is inversely proportional to the number
of its attacks and the rank of the attacking arguments. This semantic deems as acceptable the
argument(s) with the lowest number of attacks and/or attacks coming from the weakest arguments.

   Layer 5: Accrual of acceptable arguments. The previous layer produces a rank of activated
arguments, and a final conclusion should be brought forward as the most rational conclusion
associable to a single input instance. The highest-ranked argument is selected as the most
representative, and its conclusion is deemed the most rationale. In the case of ties (multiple
arguments with the highest rank), these are grouped into sets according to the conclusion they
support. The set with the highest cardinality is deemed the most representative of an input record
of the dataset, and the conclusion supported by its argument(s) is deemed the most rationale. In
the case of ties with respect to cardinality, the input case is treated as undecided, as not enough
information is available to associate a possible conclusion.

3.4. Phase 4: Objective evaluation analysis
The evaluation of the degree of explainability of the two XAI methods, the argument-based one
developed in this study and the DT created with the C4.5 learning algorithm, followed the same
process proposed in [7]. Eight metrics were selected to assess, objectively and quantitatively, the
degree of explainability of their rulesets (see Tab. 3). The objectivity is achieved by excluding
any human intervention in this evaluation process. Two metrics, number of rules and average rule
length, measure the syntactic simplicity of the rules and should be minimised as short rulesets are
deemed more interpretable [25]. Fraction of classes and fraction of overlap enhance the clarity
and coherence of the extracted rules. Whilst the fraction overlap should be minimised to avoid
conflicts between the rules, the fraction of classes should be maximised to guarantee that all the
target classes, even the minor ones, are considered. A ruleset must also be complete, correct,
faithful to the model’s predictions, and robust to small perturbations of the inputs. To assure that
the C4.5 algorithm returned the most compact and accurate DT, a grid search was carried out on
the following hyperparameters: 1) the criterion function to measure the quality of a split (Gini,



                                                 9
Giulia Vilone et al. CEUR Workshop Proceedings                                                                            1–13


Entropy, Log-Loss), 2) the maximum depth of the DT (from 6 to 48), and the minimum number
of instances required to 3) split an internal node (from 2 to 16) and 4) be at a leaf node (from 1 to
8).

Table 3
Objective metrics to assess the explainability of rulesets.
          Factor       Definition                                                         Formula
                       Ratio of input instances covered by rules (c) over                      c
          Completeness                                                                         N
                       total input instances (N)
                       Ratio of input instances correctly classified by                        r
          Correctness                                                                          N
                       rules (r) over total input instances
                       Ratio of input instances on which the predictions                       f
          Fidelity                                                                             N
                       of model and rules agree ( f ) over total instances
                       The persistence of methods to withstand small
          Robustness perturbations of the input (δ ) that do not change             ∑Nn=1 f (xn )− f (xn +δ )
                                                                                                N
                       the prediction of the model ( f (xn ))
          Number of    The cardinality of the ruleset (A) generated by the
                                                                                              |A|
          rules        two XAI methods under analysis
                       The average number of antecedents, connected
          Average      with the AND operator, of the rules. ai represents                   ∑Ri=1 ai
          rule length  the number of antecedents of the ith rule and                           R
                       R = |A| the number of rules
                       Fraction of the output class labels in the data are
                       predicted by at least one rule in a ruleset R. A rule
                                                                                |C| ∑c ≤C 1(∃r = (s, c)
                                                                                 1
          Fraction of                                                                 ′
                       r is represented by a tuple (s, c) where s is the set
          classes
                       of antecedents and c is a class label. |C| represents          ∈ R|c = c′ )
                       the number of class labels
                       The extent of overlap between every pair of rules.
          Fraction                                                                 2                  overlap(ri ,r j )
                       Given two rules ri and r j , overlap is the set set of
          overlap                                                               R(R−1) ∑ri ,r j ,i≤ j       N
                       instances that satisfy the conditions of both rules




4. Results and discussion
The values of the metrics calculated over the two rulesets extracted with the C4.5 learning
algorithm and the proposed argument-based XAI method are summarised in Tab. 4. Both methods
generate complete rulesets, meaning that they cover the entire input space and all the output
classes. The only exception occurs in the Avila dataset where the ruleset of the argument-based
method does not consider one of the 12 output classes. This is due to the presence of several
attacks, some of which have high weights, towards the rules having this class in their conclusions.
Modifying the inconsistency budget might fix this issue. The C4.5 method scores higher in terms
of correctness, fidelity and robustness throughout the five datasets. It can also be considered
the most coherent method as its rulesets reach completeness without overlapping areas. On the
other hand, it generated rulesets that contains more and longer rules than the argument-based
method with only one exception represented by the Online Shopper Intention datasets where
the argument-based method extracted more rules than the C4.5. However, these rules contained
less antecedents, on average. In the other three datasets (Adult, Avila, and Hotel Bookings),
the C4.5 returns thousands of rules whereas the argument-based method never reaches the 500



                                                       10
Giulia Vilone et al. CEUR Workshop Proceedings                                                     1–13


rules. Such big numbers of rules would hinder the explainability of these rulesets as struggle
with reading and retaining such a big amount of information. Overall, the argument-based XAI
method generates simpler ruleset that are potentially more comprehensible than the C4.5 DT,
but there is the need to identify a way to fine tune the inconsistency budget to reach the optimal
argumentation framework.

Table 4
Quantitative measures of the degree of explainability of the rulesets automatically generated by a novel
argument-based XAI method and the C4.5 decision tree learning algorithm over five datasets.
                                                     Credit
                                                                Hotel        Online Shopper
          Metric                Adult    Avila        Card
                                                               Bookings        Intention
                                                    Default
                                      Argument-based XAI method
          Completeness           1.0      1.0           1.0        1.0             1.0
          Correctness            0.7      0.55         0.65       0.64             0.89
          Fidelity               0.81     0.52          0.8       0.52             1.0
          Robustness             0.04     0.01         0.16       0.13             0.55
          Number of rules        294      139          491        151              108
          Average rule length    11.8     8.99          7.0      32.27             2.0
          Fraction of overlap    0.9      0.64         0.87       0.99             0.15
          Fraction of classes    1.0      0.92          1.0        1.0             1.0
                                            C4.5 decision tree
          Completeness           1.0      1.0           1.0        1.0             1.0
          Correctness            0.81     0.12          0.6       0.71             0.89
          Fidelity               0.99     0.59         0.99       0.97             1.0
          Robustness             0.99     0.60         0.95       0.96             0.98
          Number of rules       4064     1614          686       6041               52
          Average rule length   13.61    11.41         12.1       15.8             6.25
          Fraction of overlap    0.0      0.0           0.0        0.0             0.0
          Fraction of classes    1.0      1.0           1.0        1.0             1.0




5. Conclusions
This study presented a novel XAI method to form an argumentation framework with weighted
attacks representing the inferential process of complex data-driven ML models. These models
were trained on five datasets with handcrafted features manually engineered by humans. Eight
quantitative and objective metrics were used to assess the degree of explainability of the rulesets
extracted by the proposed XAI method and a DT, used as baseline. The results suggested the
presence of a trade-off between completeness, number of rules and average length, measuring
the syntactic simplicity of the rulesets, and the other five metrics. The C4.5 algorithm usually
generate bigger rulesets, but it is more correct and faithful to the model than the argument-based
method. In conclusion, the proposed XAI method returns rulesets that are complete, simpler and
smaller in terms of rule cardinality and length, thus more comprehensible. However, they are not
as faithful to the model, correct and robust as the C4.5 DTs. Future work will extend this research
study by training deeper neural networks, employing datasets with additional types of input data,
like texts and images, fine tuning the inconsistency budget between weighted attacks to obtain



                                                  11
Giulia Vilone et al. CEUR Workshop Proceedings                                                  1–13


the optimal set of arguments and attacks, and using semantics designed for handling weighted
argumentation frameworks. The evaluation of the argumentation frameworks will include a
human-centred study, as done in [4], to compare the outcome of the objective metrics with users’
perception of their explainability.


References
 [1] L. Longo, R. Goebel, F. Lecue, P. Kieseberg, A. Holzinger, Explainable artificial intelligence:
     Concepts, applications, research challenges and visions, in: International Cross-Domain
     Conference for Machine Learning and Knowledge Extraction, Springer, 2020, pp. 1–16.
 [2] G. Vilone, L. Longo, Classification of explainable artificial intelligence methods through
     their output formats, Machine Learning and Knowledge Extraction 3 (2021) 615–661.
 [3] F. K. Došilović, M. Brčić, N. Hlupić, Explainable artificial intelligence: A survey, in: 41st
     International Convention on Information and Communication Technology, Electronics and
     Microelectronics (MIPRO), IEEE, 2018, pp. 0210–0215.
 [4] G. Vilone, L. Longo, A novel human-centred evaluation approach and an argument-based
     method for explainable artificial intelligence, in: IFIP International Conference on Artificial
     Intelligence Applications and Innovations, Springer, 2022, pp. 447–460.
 [5] D. Bryant, P. Krause, A review of current defeasible reasoning implementations, The
     Knowledge Engineering Review 23 (2008) 227–260.
 [6] L. Longo, Argumentation for knowledge representation, conflict resolution, defeasible
     inference and its integration with machine learning, in: Machine Learning for Health
     Informatics, Springer, 2016, pp. 183–208.
 [7] G. Vilone, L. Longo, A quantitative evaluation of global, rule-based explanations of
     post-hoc, model agnostic methods, Frontiers in artificial intelligence 4 (2021).
 [8] H. K. Dam, T. Tran, A. Ghose, Explainable software analytics, in: Proceedings of the
     40th International Conference on Software Engineering: New Ideas and Emerging Results,
     ACM, 2018, pp. 53–56.
 [9] Z. C. Lipton, The mythos of model interpretability, Commun. ACM 61 (2018) 36–43.
[10] L. Longo, Formalising human mental workload as a defeasible computational concept, The
     University of Dublin, Trinity College, 2014.
[11] L. Rizzo, L. Longo, An empirical evaluation of the inferential capacity of defeasible
     argumentation, non-monotonic fuzzy reasoning and expert systems, Expert Systems with
     Applications 147 (2020) 113220.
[12] L. Longo, L. Rizzo, P. Dondio, Examining the modelling capabilities of defeasible argu-
     mentation and non-monotonic fuzzy reasoning, Knowledge-Based Systems 211 (2021)
     106514.
[13] S. A. Gómez, C. I. Chesnevar, Integrating defeasible argumentation and machine learning
     techniques: A preliminary report, in: In Procs. V Workshop of Researchers in Comp.
     Science, 2003, pp. 320–324.
[14] P. M. Dung, On the acceptability of arguments and its fundamental role in nonmonotonic
     reasoning, logic programming and n-person games, Artificial intelligence 77 (1995) 321–
     357.



                                                 12
Giulia Vilone et al. CEUR Workshop Proceedings                                              1–13


[15] S. Modgil, F. Toni, F. Bex, I. Bratko, C. I. Chesnevar, W. Dvořák, M. A. Falappa, X. Fan,
     S. A. Gaggl, A. J. García, et al., The added value of argumentation, in: Agreement
     technologies, Springer, 2013, pp. 357–403.
[16] S. A. Gómez, C. I. Chesnevar, Integrating defeasible argumentation with fuzzy art neural
     networks for pattern classification, Journal of Computer Science & Technology 4 (2004)
     45–51.
[17] O. Cocarascu, F. Toni, Argumentation for machine learning: A survey., in: COMMA, 2016,
     pp. 219–230.
[18] M. Thimm, K. Kersting, Towards argumentation-based classification, in: Logical Founda-
     tions of Uncertainty and Machine Learning, IJCAI Workshop, volume 17, 2017.
[19] S. Modgil, H. Prakken, The aspic+ framework for structured argumentation: a tutorial,
     Argument & Computation 5 (2014) 31–62.
[20] R. Riveret, G. Governatori, On learning attacks in probabilistic abstract argumentation, in:
     Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent
     Systems, 2016, pp. 653–661.
[21] P. Baroni, M. Caminada, M. Giacomin, An introduction to argumentation semantics, The
     knowledge engineering review 26 (2011) 365–410.
[22] R. R. Hocking, R. Leslie, Selection of the best subset in regression analysis, Technometrics
     9 (1967) 531–540.
[23] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
     over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
[24] H.-P. Kriegel, P. Kröger, J. Sander, A. Zimek, Density-based clustering, Wiley interdisci-
     plinary reviews: data mining and knowledge discovery 1 (2011) 231–240.
[25] H. Lakkaraju, S. H. Bach, J. Leskovec, Interpretable decision sets: A joint framework
     for description and prediction, in: Proceedings of the 22nd ACM SIGKDD international
     conference on knowledge discovery and data mining, ACM, 2016, pp. 1675–1684.
[26] P. E. Dunne, A. Hunter, P. McBurney, S. Parsons, M. Wooldridge, Weighted argument
     systems: Basic definitions, algorithms, and complexity results, Artificial Intelligence 175
     (2011) 457–486.
[27] P. Besnard, A. Hunter, A logic-based theory of deductive arguments, Artificial Intelligence
     128 (2001) 203–235.




                                                 13