Argumentation-based Explainable Machine Learning (ArgEML): a Real-life Use Case on Gynecological Cancer Nicoletta Prentzasa1, Athena Gavrielidoua, Marios Neophytoua and Antonis Kakasa a University of Cyprus, 1 Panepistimiou Avenue, Nicosia, 2109, Cyprus Abstract This paper studies the application of a general methodology of synthesis of Learning with Explainable Argumentation (ArgEML) to a particular real-life learning problem with the aim to validate the approach and to provide feedback for its further development. The problem concerns that of learning to prognose from a real-life image of a gynecological tumor whether this is benign or malignant. This dataset has already been analyzed and studied using various methods. Our goal is to synthesize and integrate these lower-level statistical and sub-symbolic methods with a symbolic and explainable layer of argumentation. The purpose is not so much to improve on the accuracy of these previous efforts but rather to validate the argumentation approach to ML and to possibly learn from this example how to further automate the search for learning argumentation theories from real-life data. The application of the ArgEML approach was carried out in a semi-automated manner using the Gorgias argumentation framework and the Gorgias Cloud system. We show how using the natural explanations for the predictions (definite or plausible) of the learned argumentation theory we can separate the problem space into groups showing in each such group the basic argumentative tension between arguments for and against the alternatives. Keywords Argumentation, Explainable Machine Learning, Explainable AI 1. Introduction Argumentation is a naturally suitable target language for Machine Learning (ML) model’s representation. It offers flexible coverage and prediction notions that are appropriate in the context of learning, where the data from which we are learning may be incomplete and appear to be inconsistent, or simply is inadequate to reveal the full process or theory generating the data. This suitability of argumentation as an umbrella framework in which learning can occur has been exposed recently in [1],[2] where the emphasis is shifted away from achieving optimal predictive accuracy to that of satisfactory or confident accuracy together with the recognition of difficult dilemma cases or sub- domains of the problem where a definite prediction cannot be safely taken. Rather in these cases, the learned theory provides explanations that support the possible alternatives thus helping a subsequent process that is to utilize the learned theory to take a more informed decision. Explanations not only give enhanced meaning to the learned theory but they can also be used during the learning process to guide this, e.g. by focusing on the more relevant features for cases that are ambitious under the current state of the learned theory. In this paper we present an Argumentation-based Explainable Machine Learning (ArgEML) framework and its application to real-life imaging data on Gynecological Cancer. The ArgEML approach relies on a strong coupling of Learning with Reasoning within a framework of structured argumentation. In this work, we will be using the Gorgias argumentation framework [3] but most of the conceptual elements of the approach can be applied using other structured argumentation frameworks. We show how the ArgEML approach can help us understand the learning problem space by partitioning 1 Corresponding author, E-mail: nicolep@ucy.ac.cy. ArgML’22: 1st InternationalWorkshop on Argumentation & Machine Learning, September 13, 2022, Cardiff,Wales, UK ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) this into sub spaces each of which is classified by its own argumentation framework and argumentative explanations for the prediction. Our work follows the same motivation as that of several other studies in the literature that explore how to integrate machine learning and argumentative reasoning. A review of these studies up to 2020 can be found in [1], while [4–8] and references therein reflect more recent efforts in this area. All these aim to exploit the flexibility of argumentation and its natural connection to explanation in order to enhance the expressibility and interpretability of a learned function. The rest of the paper is organized as follows. In section 2 we provide background information about (1) the real-life imaging dataset we will be learning from and (2) the Gorgias argumentation framework we will be using. In Sections 3 and 4 we present the general elements of the ArgEML approach and its application to the real-life dataset. Then in Section 5 we present an analysis of the problem space based on the explanations for prediction that can be drawn from the learned argumentation theory and how this can help in understanding the problem space into its possible subclasses. Finally, Section 6 concludes and discusses future work. 2. Background Information We briefly describe the dataset for endometrial cancer detection taken from [9]. Then in Section 2.2 we review the basic concepts and terminology of the Gorgias argumentation framework that are relevant for the learning process that we will be using in this paper. 2.1. From Imaging Data to Prognosis In previous work a hysteroscopy Computer Aided Diagnostic system (CADs) was developed for the early detection of endometrial cancer [9–11]. Regions of Interest (ROIs) were extracted from hysteroscopic images of patients with (1) postmenopausal uterine bleedings and/or suspected endometrial lesions, and, patients with (2) normal endometrium. The ROIs were equally distributed among normal and abnormal cases. The CADs supported the ROIs texture feature extraction in different color systems. A total of 26 texture features were extracted from each color component, using three texture features algorithms: (i) Statistical Features (SF), (ii) Spatial Gray Level Dependence Matrices (SGLDM), and (iii) Gray Level Difference Statistics (GLDS). Our work builds on a combination of SF+SGLDM+GLDS features from the endometrial cancer detection dataset2, as these are shown in Table 1.The dataset consists of 445 records, 209 (47%) correspond to normal cases (benign) and 236 (53%) to abnormal cases (malignant). Tumor is classified as 0-Malignant or 1-Benign. Table 1 Dataset Features Algorithm Texture Feature Feature Name Feature Code SGLDM Homogeneity sgldm_homog Feature_0 (Spatial gray-level dependence Entropy sgldm_entr Feautre_1 matrices) SF Energy fos_ener Feature_2 (Statistical features) Entropy fos_ent Feuture_3 GLDS Homogeneity gldm_hom Feature_4 (Gray-level difference statistics) Contrast gldm_con Feature_5 Energy gldm_eng Feature_6 Entropy gldm_ent Feature_7 Mean gldm_mean Feature_8 2 The dataset is available upon request from the authors. 2.2. Gorgias Argumentation Framework Gorgias3 is a structured argumentation framework where arguments are constructed using a basic (content independent) scheme of argument rules. Two types of arguments rules are constructed within a Gorgias argumentation theory: object-level arguments and priority arguments expressing a preference, or relative strength, between other arguments. The dialectic argumentation process of Gorgias to determine the acceptability (admissibility) of an argument supporting a desirable claim or conclusion typically occurs between composite arguments where priority arguments are included alongside object-level arguments in order to strengthen (against counter-arguments) the arguments currently committed to. In general, argument rules are named associations between a set of premises and a claim or position that these premises are supporting via the argument rule. They have the general form of: “Argument_Name: Premises ►Claim”, where Premises is a set of literal (i.e. positive or negative atomic statement) conditions and Claim is a single literal. They can be chained together to form a support of a desired claim. In their concrete form within the Gorgias system, argument rules are expressed using the syntax of Extended Logic Programming, where an argument rule has the following parametric syntactic form4: 𝑟𝑢𝑙𝑒(𝐴𝑟𝑔𝑢𝑚𝑒𝑛𝑡_𝑁𝑎𝑚𝑒, 𝐶𝑙𝑎𝑖𝑚, 𝐷𝑒𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒𝑃𝑟𝑒𝑚𝑖𝑠𝑒𝑠): −𝑁𝑜𝑛_𝐷𝑒𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒_𝑃𝑟𝑒𝑚𝑖𝑠𝑒𝑠. (1) Argument_Name can be any Prolog term with which we parametrically name arguments expressed by this rule. Claim is a positive or negative atomic formula (negation in the Gorgias system is written by wrapping the positive atom with ``neg(.)''). Defeasible_Premises and Non_Defeasible_Premises are conjunctions of positive or negative atomic formulae: the former are executed under Gorgias while the latter directly under Prolog. In the context of learning, the non-defeasible conditions of argument rules are built from the concrete information that we have on the features of our dataset cases. The defeasible conditions allow the opportunity to use conditions for which we do not have complete information or even to invent new conditional predicates (we will not be concerned with the later in this paper). Example: The Gorgias code below shows two object-level argument rules (i.e. r1(), r2()) for and against buying an object with priority argument rules (i.e. pr1(), pr2()) between the object-level rules depending on whether we are low on funds. 𝑟𝑢𝑙𝑒(𝑟1(𝑋), 𝑏𝑢𝑦(𝑋), []) ∶ − 𝑛𝑒𝑒𝑑(𝑋). 𝑟𝑢𝑙𝑒(𝑟2(𝑋), 𝑛𝑒𝑔(𝑏𝑢𝑦(𝑋)), []): − 𝑢𝑟𝑔𝑒𝑛𝑐𝑦(𝑋, 𝑛𝑜). 𝑟𝑢𝑙𝑒(𝑝𝑟1(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟1(𝑋), 𝑟2(𝑋)), []). 𝑟𝑢𝑙𝑒(𝑝𝑟2(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟2(𝑋), 𝑟1(𝑋)), []): − 𝑙𝑒𝑣𝑒𝑙_𝑜𝑓_𝑓𝑢𝑛𝑑𝑠(𝑙𝑜𝑤). The combination of object-level arguments together with the contextual priority arguments result into a theory that captures the policy of “Normally, we buy something that we need even if this is not urgently needed. But when we are low on funds we may not buy something for which there is no urgency.”. In a learning context we would have an underlying process that generates, according to this policy data points by observing if an object is bought or not in different scenarios described by the three features of “need(.), urgency(.,.) and level_on_funds(.)”. The task is then to learn or reconstruct the above Gorgias theory (or an equivalent form of this). The coverage and prediction notions for the argumentation-based approach to learning will be build using the standard argumentation reasoning within a structured argumentation like the one of Gorgias. This depends on the central notion of an acceptable coalition of arguments, which in the case of the Gorgias framework relates to a (minimal) composite argument that is admissible. As in the standard definition of admissibility [12] a composite argument is admissible iff it is conflict free and it attacks back all other composite arguments that attack it. 3 The Gorgias Argumentation framework was introduced in [13] and extended in [14]. The system of GORGIAS was developed in 2003 and has since been used by several research groups for a variety of real-life applications [3]. Today it is publicly available through Gorgias Cloud as a https://aiasvm1.amcl.tuc.gr:8087/. 4 In this paper, we will be using the cumbersome internal code syntax of the Gorgias system to present examples. This will help the interested reader to reproduce the learned results and/or apply the learning process to their own learning problems using the open Gorgias Cloud system. We can then define plausible and definite conclusions or predictions according to whether there exists an admissible composite argument that supports the conclusion of interest, in which case we say the conclusion is plausible or possible. If in addition there exists no admissible composite argument that supports any other conclusion that is in conflict with the conclusion of interest then we say that this is a definite conclusion. Note that it is possible for a conclusion and some other conflicting conclusion to both be plausible conclusions from the same argumentation theory, in which case we say the theory is (locally) ambiguous and the conclusion forms a dilemma within the theory. The above definition of admissibility of composite arguments hinges on the definition of attacks between composite arguments. Informally, a composite argument, D1, attacks another one, D2, iff they are in conflict and the arguments in D1 are rendered by the priority arguments that it contains at least as strong as the arguments contained in D2. The exact technical details of this central notion can be found in the associated references [13, 14]. What is important to note is that attacks can occur at two levels: (1) the object level based on a conflict between statements in the application language or at (2) a (hierarchy of) priority level(s) where the conflict between the two composite arguments refers to a preference between two arguments at a lower level. Accordingly, to build an admissible composite argument we consider attacks at the object level and then include priority arguments to strengthen its object rules against the attacking ones. To illustrate this, consider in the above example an object, obj1, for which need(obj1), urgency(obj1,no) and level_of_funds(low) all hold true and let us ask the Gorgias query of buy(obj1). This is supported by the simple argument arg1= [r1(obj1)] but this is not admissible as it is attacked by arg2=[r2(obj1),pr2(obj1)] which arg1 does not attack back. To do so we can extend arg1 to form the composite argument arg1’=[r1(obj1), pr1(obj1)]. Both arg1’ and arg2 are then admissible indicating that the case of obj1 is a dilemma of the theory having reasons for both to buy it or not to buy it. The ambiguity, “But when we are low on funds we may not buy something for which there is no urgency.” in the policy, that is represented by this theory, is reflected by the existence of such dilemma cases where the theory cannot make a definite prediction. Indeed, in a learning context the data produced by this policy will contain the ambiguity and it is thus natural for a theory learned from this data to reflect this ambiguity as a reasoned dilemma rather than insist on making a definite prediction for these cases. 3. ArgEML Framework and Methodology The argumentation-based framework for Explainable Machine Learning (ArgEML) is based on a novel approach to ML that integrates sub-symbolic methods with logical methods of argumentation to provide explainable solutions to learning problems. The goal is to learn argumentation theories from data, using statistical learning techniques, to uncover significant features in developing argumentation theories and represent knowledge as contextual hierarchies within a preference-based structured argumentation framework. In the following subsections we present a conceptual description of the ArgEML approach and a high-level description of its learning process. 3.1. ArgEML approach (conceptual description) Our ArgEML approach is based on acknowledging the predictive accuracy difficulties in real-life learning problems and the importance of explanations, as a means of understanding the reasoning behind a prediction and providing the domain expert with a tool to take more informed decisions. The approach views the notion of prediction from a different perspective than that of a traditional ML model, by relaxing the requirement of accuracy and introducing the notions of definite prediction and ambiguity. In this perspective, if we cannot uniquely predict, but can focus the prediction and give justifications for the alternatives, we have a valuable output of learning. Utilizing argumentation as a framework for explainable decision making we aim at learning contextual hierarchies starting from general and simple statements to more specific ones and structuring these using priorities between them. The learning process is not driven only by strict accuracy but for solutions that would be sufficiently good in terms of accuracy compensating with the high-level of explainability of the learned theory. This concept of sufficiently good but explainable solution motivates a set of metrics that will govern the learning process. These are defined and explained in Section 3.1.1. The ArgEML method consists of a high-level iterative learning process that follows a set of semi- automated steps as presented in Figure 1. The first step initiates the learning process by (1) deciding the language of the problem and (2) defining the basic contexts of the problem domain in terms of object-level arguments. The iterative process starts from an interim evaluation of the initial theory and repeats steps (3) mitigate errors and/or (4) reduce dilemmas until the evaluation results in no further improvement of the learned theory or exit criteria are met. The ArgEML methodology steps are further explained in Section 3.2. Figure 1: ArgEML conceptual description. 3.1.1. Learning metrics Table 2 Learning Metrics - Equations Metric Equation Coverage 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝐷, 𝑎𝑟𝑔_𝑖) ← 𝑂𝑏𝑗𝑠_𝑖 ⁄𝑁 (2) 𝑚 (3) Total Coverage 𝑇𝑜𝑡𝑎𝑙𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝐷, 𝑎𝑟𝑔_𝑡ℎ𝑒𝑜𝑟𝑦) ← ⋃ 𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒(𝐷, 𝑎𝑟𝑔_𝑖) 𝑖=1 Definite Accuracy 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝐷, 𝑎𝑟𝑔_𝑡ℎ𝑒𝑜𝑟𝑦) ← 𝑂𝑏𝑗𝑠_𝑎𝑐𝑐 ⁄𝑁 (4) Definite Errors 𝐸𝑟𝑟𝑜𝑟𝑠(𝐷, 𝑎𝑟𝑔_𝑡ℎ𝑒𝑜𝑟𝑦) ← 𝑂𝑏𝑗𝑠_𝑒𝑟𝑟⁄𝑁 (5) Ambiguity 𝐴𝑚𝑏𝑖𝑔𝑢𝑖𝑡𝑦(𝐷, 𝑎𝑟𝑔_𝑡ℎ𝑒𝑜𝑟𝑦) ← 𝑂𝑏𝑗𝑠_𝑎𝑚𝑏⁄𝑁 (6) Learning metrics are defined in terms of the number of observations or data points N in the dataset D that we are learning from, using the equations in Table 2. • Coverage. The coverage of an argument arg_i: Premises_i ►Claim is equal to the number of observations Objs_i in a dataset D that Premises_i is true (equation (2) in Table 2). The total coverage metric for an argumentation theory arg_theory with m arguments is defined as in equation (3) in Table 2. • Definite Prediction. This metric is related to the predictive accuracy that we normally have in a ML model, but in the ArgEML approach this only applies to the observations for which the theory provides a definite prediction (see Table 2). o Accuracy or Definite Accuracy: is defined as the percentage of the number of observations Objs_acc in a dataset D that an argumentation theory arg_theory provides a definite prediction and the prediction matches the actual target value (equation (4) in Table 2). o Errors or Definite Errors: is defined as the percentage of the number of observations Objs_err in a dataset D that an argumentation theory arg_theory provides a definite prediction but the prediction does not match the target value (equation (5) in Table 2). • Ambiguity. Ambiguity measures the percentage of observations Objs_amb in a dataset D that an argumentation theory arg_theory provides plausible predictions (equation (6) in Table 2). • Compactness. This metric relates to the explanation complexity and aims to capture a form of simplicity. It can be defined in a number of ways, in relation to the argumentation theory, suggesting a compact (small) number of arguments, or, in relation to an individual argument, indicating low complexity of its premises (small number of conditions). Compact Coverage is one of the major metrics of the ArgEML approach, it combines the metric of total coverage and the notion of compactness, suggesting a compact argumentation theory with high total coverage. Given this set of metrics, a solution (theory) can be evaluated using a combination of properties, not simply based on optimal prediction. Hence, a solution can be “sufficiently good” if it provides compact coverage, and acceptable levels of definite accuracy (or definite errors) and ambiguity with (useful) justifications (explanations), depending on how hard the problem is. 3.2. Integrated learning process - Methodology Starting from a state of absolute ambiguity, the objective is to learn an argumentation theory that covers all or most observations in a given dataset, eliminates ambiguity, and improves the accuracy of definite predictions, by mitigating the errors. A high-level overview of the methodology is illustrated in Table 3. The first step (step 1) aims at selecting the language (features) to develop the theory. The second step (step 2) concerns the selection of a compact set of arguments to describe the basic contexts of the problem domain. Then, the learning process repeats step 3 and step 4, generates different versions of the argumentation theory, until an exit criterion is met or learning has no further improvement. Exit criteria can be defined using e.g. thresholds for the metrics of definite errors (Err_Thold) (or definite accuracy) and ambiguity (Amb_Thold). Table 3 ArgEML Methodology Overview Learning Step Goal Step 1: Decide the language of the learning problem. Feature selection Step 2: Select the basic contexts of the problem domain. Compact coverage Repeat (Steps 3 & 4) until Goal is reached or learning has no further improvement: Step 3: Mitigate the error of individual arguments. Errors ≤ Err_Thold Step 4: Reduce dilemmas between pairs of arguments in conflict. Ambiguity ≤ Amb_Thold Evaluation: Select “sufficiently good” argumentation theory. Explainable Model We now briefly describe these steps in operational terms. Initialize theory: • Step 1: Decide the language of the problem. This step is similar to the data processing step in a machine learning pipeline. It mostly involves independent statistical analysis of the feature set to separate out a set of significant features. Examples include filter methods that select features based on their correlation to the output (target variable). More information on these methods can be found in [15]. • Step 2: Select the basic contexts of the problem domain. In Step 2 we initialize the argumentation theory by building a compact set of object-level arguments (general scenarios) that achieve a high total coverage of the data (Compact Coverage). We can use a combination of learning operators, working directly on the significant features set, or use a surrogate sub-symbolic machine learning algorithm amenable to rule-extraction. For example, we can train a Random Forest or XGBoost model and use a rule-extraction method (e.g. Interpreting Tree Ensembles with inTrees [16]) to construct object-level arguments that form the basic contexts of the argumentation theory. Iterative Learning Process: The process starts with an interim evaluation of the initial theory and repeats steps 3 and 4 based on the exit criteria. • Step 3: Mitigate the error of individual arguments. Individual object-level arguments will support erroneously the target conclusion for a number of cases. To mitigate this error, we construct a defeat argument against this, which together with a (possibly conditional) priority argument will remove a significant number of these erroneous predictions. Step 3 is executed as long as condition Errors > Err_Thold holds. At the end of each execution we generate a new version of the theory and we repeat the iterative learning process (steps 3 & 4). • Step 4: Reduce dilemmas between pairs of arguments in conflict. In Step 4 we identify the pairs of object-level arguments (and local defeat arguments, if any that are in conflict to construct conditional priority arguments to resolve the conflict in either way. Step 4 is executed as long as condition Ambiguity > Amb_Thold holds. At the end of each execution we generate a new version of the theory and we repeat the iterative learning process (steps 3 & 4). • Evaluation step: select a “sufficiently good” argumentation theory. This step carries out a global evaluation, in terms of some overall information gain, of the results of the previous local steps in the current theory. Using this we can compare different versions of the argumentation theory and select a sufficiently good improvement of the current theory or terminate. For example, information gain can be calculated using some adopted notion of entropy (as in Decision Trees) based on the values of the new metrics of compact coverage, definite errors and ambiguity. We can use definite errors or definite accuracy interchangeably. While these metrics- based evaluation approaches, also known as objective approaches, are the ones mainly used today, human-centered evaluation is of equal importance with studies suggesting a more active role of the end user in the process [17][18]. 4. ArgEML applied to Cancer Prognosis In this section, we illustrate the (semi-automated) application of the ArgEML methodology on the dataset described in Section 2 for the classification of hysteroscopy images and the endometrial cancer detection. At the beginning of the process Err_Thold is set to 20% and Amb_Thold to 30%. • Step 1: Decide the language of the problem. We used a set of features from [9] as show in Table 1. The dataset of 445 observations was divided into training and test sets with 400 (90%) and 45 (10%) observations respectively. While techniques like cross-validation are usually employed at this step we simplified this process to focus on the validation of the ArgEML approach. • Step 2: Select the basic contexts of the problem domain. We followed the rule-extraction method, trained a Random Forest model using the training set and extracted a number of decision rules from the model. Then we selected a compact list of these rules, to cover most of the observations in the training set, to create the basic object-level arguments of the theory. This gave us an initial version of the theory with a small number of low-complexity arguments, as show in Table 55, and a total coverage of 99.75%. At this point we noticed that each data point is covered (roughly) twice by this initial theory and hence its predictive accuracy as a whole is low. Table 4 Object-level arguments. Argument Premises Claim C A E r4(X) 𝑔𝑙𝑑𝑚_𝑚𝑒𝑎𝑛 > 1.65 𝐴𝑁𝐷 𝑠𝑔𝑙𝑑𝑚_𝑒𝑛𝑡𝑟 > 5.25 benign(X) 48% 79% 21% r6(X) 𝑔𝑙𝑑𝑚_𝑐𝑜𝑛 > 4.89 𝐴𝑁𝐷 𝑓𝑜𝑠_𝑒𝑛𝑒𝑟 ≤ 0.06 benign(X) 50% 78% 22% r8(X) 𝑔𝑙𝑑𝑚_𝑐𝑜𝑛 ≤ 5.03 𝐴𝑁𝐷 𝑔𝑙𝑑𝑚𝑒𝑛𝑡 > 1.30 malignant(X) 50% 72% 28% r10(X) 𝑔𝑙𝑑𝑚_𝑚𝑒𝑎𝑛 ≤ 1.65 𝐴𝑁𝐷 𝑠𝑔𝑙𝑑𝑚_ℎ𝑚𝑜𝑔 > 0.45 malignant(X) 50% 72% 28% C: Coverage. A: Accuracy. E: Error. • Step 3: Mitigate the error of individual arguments. The object-level arguments selected in Step 2 were further analyzed using the properties of Coverage, Accuracy and Error as shown in Table 4. For each argument in the list (r4, r6, r8, r10) we 5 The numerical conditions in these argument rules can be discretized, e.g. into low, medium and high, to help with the readability of the explanations generated from these. This matter is beyond the scope of this paper. isolate the observations in the training set that the argument covers and try to learn a new set of conditions (premises) to construct a defeat argument. For example, for the argument r8, we examined the 201 (50%) observations from the training set, using a feature frequency distribution operator, looking for new conditions to support the contratictive conclusion of “benign(X)”. We learned the defeat argument r8b defined as follows: 𝑟𝑢𝑙𝑒(𝑟8𝑏(𝑋), 𝑏𝑒𝑛𝑖𝑔𝑛(𝑋), [ ]): −𝑔𝑙𝑑𝑚_ℎ𝑜𝑚 > 0.50 𝐴𝑁𝐷 𝑓𝑜𝑠_𝑒𝑛𝑒𝑟 ≤ 0.05. In the context of mitigating errors, defeat arguments are created together with the corresponding priority arguments to ensure local correction of the error. Therefore, for the arguments r8, r8b we added the priority argument pr3: 𝑟𝑢𝑙𝑒(𝑝𝑟3(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟8𝑏(𝑋), 𝑟8(𝑋)), [ ]). Furthermore, to avoid side effects of defeat arguments on other object-level arguments we can add further priority rules that make these weaker than other conflicting arguments. For argument r8b we have therefore added: 𝑟𝑢𝑙𝑒(𝑝𝑟7(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟10(𝑋), 𝑟8𝑏(𝑋)), [ ]). The revised properties of Accuracy and Error for the initial object-level arguments is shown in Table 5. Step 3 improved the quality of the object-level arguments by reducing their Errors and satisfying the threshold of 20%. Table 5 Object-level argument’s properties revised, after execution of Step 3. Argument Claim C A E r4(X) benign(X) 48% 83% 17% r6(X) benign(X) 50% 82% 18% r8(X) malignant(X) 50% 82% 18% r10(X) malignant(X) 50% 80% 20% • Step 4: Reduce dilemmas between pairs of arguments in conflict. During this step we examined all pairs of contradictory object-level arguments created in Step 2. This examination resulted in the following list of {(arguments pair=number of dilemmas)}: {pair(r4(X), r8(X))=5, pair(r4(X), r10(X))=0, pair(r6(X), r8(X))=7, pair(r6(X), r10(X))=8}. If a pair of arguments was in conflict then we tried to eliminate the dilemma using priority arguments, making object-level arguments stronger under a particular set of conditions. For each pair of contradictory object-level arguments we isolate the observations in the training set that both arguments covered, and try to find new conditions, using a frequency distribution operator, to construct priority arguments in favor of each contradictory conclusion. For example, for the pair of arguments r6(X), r10(X), we see that the majority of these dilemma cases belong in the class of benign. Therefore, we added a general priority argument, to express this preference. 𝑟𝑢𝑙𝑒(𝑝𝑟12(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟6(𝑋), 𝑟10(𝑋)), [ ]). Secondly, we searched for a condition or a set of conditions under which argument r10 is stronger than r6, and constructed the preference argument pr13: 𝑟𝑢𝑙𝑒(𝑝𝑟13(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑟10(𝑋), 𝑟6(𝑋)), [ ]): −𝑠𝑔𝑙𝑑𝑚_ℎ𝑜𝑚𝑜𝑔 > 0.454 𝐴𝑁𝐷 𝑠𝑔𝑙𝑑𝑚_ℎ𝑜𝑚𝑜𝑔 < 0.46 together with the higher-order preference of this specific preference over pr12: 𝑟𝑢𝑙𝑒(𝑐6(𝑋), 𝑝𝑟𝑒𝑓𝑒𝑟(𝑝𝑟13(𝑋), 𝑝𝑟12(𝑋)), [ ]). At the end of Step 4 all dilemmas between the basic object-level arguments (r4,r6,r8,10) were resolved while other dilemmas, between pairs of defeat arguments and object-level arguments, may still remain. The resulting argumentation theory is provided as a Gorgias file in the Appendix. This theory was considered “sufficiently good” on the training set. It was then evaluated on the test set with similar results, as shown in Table 6. Table 6 Argumentation theory assessment on the Training and Test sets. Metric Training set Assessment Test set Assessment Compact coverage Acceptable (TC: 99.75%) Acceptable (TC: 100%) * Definite accuracy 72% 71% Definite errors 18% 18% Ambiguity 10% 11% *TC: Total Coverage. 5. Explainable Analysis of the Problem Space Using argumentation as the coverage notion for ML naturally affords the provision of explanations alongside the prediction of the learned output structure. Predicting the label of a case is carried out via the existence of an acceptable set of arguments that supports the prediction. The acceptability of this set of arguments can then be unraveled to produce an explanation that contains information both at the level of the basic attributive support of the prediction claim and at the level of the relative strength of the claim in contrast to other possible alternative claims. For the case of the Gorgias framework, this process of extracting natural explanations is facilitated by the form of the composite admissible arguments that are constructed as Internal Explanations by the Gorgias system and returned along with its answer to a query. Let us illustrate this kind of application level explanations, generated automatically in Cloud Gorgias, by supposing that we have the Gorgias internal explanation [pr4 (101),r4(101),r6(101)] for predicting that case 101 is benign. From this, we can generate the explanation illustrated in Table 7. Table 7 Application level explanation. The statement "benign (101)" is supported by {gldm_mean(101)>1.65 & sgldm_entr(101)>5.25}a. This reason is strengthened against the reason of {gldm_mean>1.66 & gldm_hom <= 0.45}b supporting "malignant (101)" by {gldm_con (101)>4.89 & fos_ener(101) <0.06}c. a Premises of r4. b Premises of r4b. c Premises of r6. We can see that this contains an attributive part, giving the basic reasons on which, the prediction is supported (or else answering “why this prediction”) as well as a contrastive part, which gives additional reasons that strengthen the basic reason against reasons supporting the opposite prediction (or else answering “why-not a different prediction”). Such explanations provide a high-level of interpretability of the learned theory that facilitates its evaluation through experts who would be able to judge the prognosis results based not merely on the final result but on their accompanied explanations and, in fact, provide useful feedback at the level of the explanation. We can then improve the learned model through a new learning phase from such new data cases which are further annotated by the argumentative explanation that supports their labels (c.f. the learning method of [19]). Furthermore, and perhaps more importantly, the Gorgias internal explanations can help us analyze the problem space and understand how this can be structured into different sub-parts. We can use these internal explanations of composite arguments to partition the problem space into (equivalence) groups, where each group is characterized by a unique type or pattern of explanation. In our prognosis application, we have found that the training data space is partitioned into a set of groups as shown in Table 8. In Groups 1-4, the prediction of the learned argumentation theory is definite whereas in groups 5 and 6 the learned theory is in a dilemma, i.e. it returns admissible arguments supporting either of the two possible outcomes of the prediction. We can use this partitioning to grade our confidence in the prediction of the theory depending on the group that a new case may fall. For example, we might be more confident for a prediction that falls in group 3 over other predictions that fall in groups 1 or 4. As mentioned above, each group is defined by the unique pattern of the Gorgias internal explanation returned for all members of the group. From this we can extract two relevant pieces of information that describe the group: (1) the sub-space of features that concerns this group and (2) the arguments in the learned theory that are active in this sub-space as well as the active attacks between them. Combining these two pieces of information, we can understand how the learned theory captures the decision problem for each group by constructing the argumentation framework pertaining to each group. Table 8 Subgroups of Data identified by Gorgias Argumentative Explanations. Group ID Gorgias Explanation(s) Number of Cases Accuracy/ Dilemma 1 E1=[pr7(_),r10(_),r8(_)] 16 71% 2 E2=[r8(_)] 142 78% 3 E3=[pr4(_),r4(_),r6(_)] 170 83% 4 E4=[r4(_)] 14 69% 5 E51=[pr15(_),pr8(_),r10b(_),r8b(_)] 28 Dilemma E52=[pr19(_),pr7(_),r10(_),r8(_)] 6 E61=[pr16(_),pr4(_),r4(_),r6(_)] 14 Dilemma E62=[pr14(_),pr3(_),r4b(_),r6b(_)] Others ---- 16 ---- Let us present this for group 3 whose internal Gorgias explanation, is the composite argument E3= [pr4(.), r4(.), r6(.)]. From this, we can recognize that the active arguments involved are: A4= [r4(.)], A6= [r6(.), pr4(.): r6(.) > r4b(.)] and B4b= [r6b(.), pr3(.): r4b(.) > r4(.)], together with the following attacks between these as shown in Figure 2 (left). Figure 2. Argumentation Frameworks for Group 3 (left) and Group 6 (right) Given this argumentation framework we see that the only admissible subsets are {A6} and {A4, A6} (the latter being E3), and hence in this group we have a definite prediction of benign. Note that although the prediction within this sub-part of the problem can be supported simply by the argument A6, this actually forms another sub-part of the problem, a small sub-group in “Others” of Table 7. Here in group 3, we see that the role of A6 is different, namely it comes to the defense of A4 against its defeater attack of B4b. The two arguments of A4 and A6 supporting the same conclusion of benign aggregate together to give a more informative explanation (see above). Similarly, the argumentation framework corresponding to group 6 is shown in Figure 2 (right). This has two admissible subsets of composite arguments, D1={A4, A6}, and D2={B4b, B6b} supporting opposite predictions, indicating that this sub-part of the problem is identified by the theory as a “difficult case’’. The learned theory though is not agnostic. It provides a contrastive explanation for each possible prediction. 6. Conclusions and Future Work We have presented an integrated approach of Machine Learning with Argumentation and shown how this has been applied to a real-life problem of learning from images of endometrial cancer. The same method has been applied on other medical imaging data, e.g. on brain images for Alzheimer [19], and more recently on images relating to multiple sclerosis. We have shown how the explainability of such an argumentation-based approach to ML can help us understand and structure the learning problem space into meaningful sub-spaces. The proposed ArgEML learning process can be executed in different modes, from semi-automated and hybrid with the help of external statistical and other ML modules (as followed in this paper) to a fully automated process starting from the data and carrying out iteratively the learning operator steps. In particular, the learning operators of mitigation of errors and resolution of dilemmas can be automated with various parameters, depending on the features of the learning problem at hand. The long-term goal of our work is to automate this process of learning starting from the data to the final argumentation theory. While argumentation provides a natural link to explanations, a major challenge in this task of automating fully the learning process, is to consider how these explanations can meet the various qualities of explanations, as well as the involvement of the domain expert in the evaluation process, particularly in the context of Human-centric AI. The quality of explanations needs to drive the learning process as much as the prediction accuracy. 7. Acknowledgements Part of this work was undertaken under the University of Cyprus internal project, Integrated Explainable AI (IXAI) for Medical Decision Support, ARGEML 8037P-22046. This study is also partly funded by the project ‘Atherorisk’ “Identification of unstable carotid plaques associated with symptoms using ultrasonic image analysis and plaque motion analysis”, code: Excellence/0421/0292, funded by the Research and In-novation Foundation, the Republic of Cyprus. 8. References [1] Kakas A, Michael L. Abduction and Argumentation for Explainable Machine Learning: A Position Survey. arXiv; abs/2010.1, http://arxiv.org/abs/2010.12896 (2020). [2] Prentzas N, Nicolaides A, Kyriacou E, et al. Integrating machine learning with symbolic reasoning to build an explainable ai model for stroke prediction. In: Proceedings - 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering, BIBE 2019. Institute of Electrical and Electronics Engineers Inc., 2019, pp. 817–821. [3] Kakas AC, Moraitis P, Spanoudakis NI. GORGIAS: Applying argumentation. Argument Comput 2019; 10: 55–81. [4] Albini E, Lertvittayakumjorn P, Rago A, et al. DAX: Deep Argumentative eXplanation for Neural Networks. 2020. [5] Rosenfeld A. Better metrics for evaluating explainable artificial intelligence. In: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS. 2021, pp. 45–50. [6] Wardeh M, Coenen F, Capon TB. PISA: A framework for multiagent classification using argumentation. Data Knowl Eng 2012; 75: 34–57. [7] Bench-Capon T. Using Issues to Explain Legal Decisions, http://arxiv.org/abs/2106.14688 (2021). [8] Prakken H, Ratsma R. A top-level model of case-based argumentation for explanation: Formalisation and experiments. Argument Comput 2022; 13: 159–194. [9] Neofytou MS, Tanos V, Constantinou I, et al. Computer-aided diagnosis in hysteroscopic imaging. IEEE J Biomed Heal Informatics 2015; 19: 1129–1136. [10] Neofytou MS, Tanos V, Pattichis MS, et al. A standardised protocol for texture feature analysis of endoscopic images in gynaecological cancer. Biomed Eng Online; 6. Epub ahead of print 2007. DOI: 10.1186/1475-925X-6-44. [11] Neofytou MS, Pattichis MS, Pattichis CS, et al. Texture-based classification of hysteroscopy images of the endometrium. In: Annual International Conference of the IEEE Engineering in Medicine and Biology - Proceedings. 2006, pp. 3005–3008. [12] Dung PM. On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Artif Intell 1995; 77: 321–357. [13] Kakas A, Mancarella P, Dung PM. The Acceptability Semantics for Logic Programs. In: Logic Programming. The MIT Press. Epub ahead of print 2019. DOI: 10.7551/mitpress/4316.003.0051. [14] Kakas A, Moraïtis P. Argumentation Based Decision Making for Autonomous Agents. In: Proceedings of the International Conference on Autonomous Agents. 2003, pp. 883–890. [15] Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng 2014; 40: 16–28. [16] Deng H. Interpreting tree ensembles with inTrees. Int J Data Sci Anal 2019; 7: 277–287. [17] Zhou J, Gandomi AH, Chen F, et al. Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics (Switzerland) 2021; 10: 1–19. [18] Bruckert S, Finzel B, Schmid U. The Next Generation of Medical Decision Support: A Roadmap Toward Transparent Expert Companions. Front Artif Intell; 3. Epub ahead of print 2020. DOI: 10.3389/frai.2020.507973. [19] Achilleos KG, Leandrou S, Prentzas N, et al. Extracting Explainable Assessments of Alzheimer’s disease via Machine Learning on brain MRI imaging data. In: Proceedings - IEEE 20th International Conference on Bioinformatics and Bioengineering, BIBE 2020. 2020, pp. 1036–1041. Appendix :- dynamic feature0/2, feature1/2, feature2/2, feature3/2, feature4/2, feature5/2, feature6/2, feature7/2, feature8/2.6 complement(malignant(Tumor), benign(Tumor)). complement(benign(Tumor), malignant(Tumor)). rule(r4(Tumor), benign(Tumor),[]):-feature8(Tumor,Value),Value>1.65, feature1(Tumor,Value2),Value2>5.25. rule(r4b(Tumor), malignant(Tumor),[]):-feature8(Tumor,Value),Value>1.66, feature4(Tumor,Value2),Value2=<0.45. rule(pr3(Tumor), prefer(r4b(Tumor), r4(Tumor)),[]). rule(r8(Tumor), malignant(Tumor),[]):-feature5(Tumor,Value),Value=<5.03, feature7(Tumor,Value2),Value2>1.30. rule(r8b(Tumor), benign(Tumor),[]):-feature4(Tumor,Value),Value>0.50, feature2(Tumor,Value2),Value2=<0.05. rule(pr5(Tumor), prefer(r8b(Tumor), r8(Tumor)),[]). rule(pr1(Tumor), prefer(r4(Tumor), r8(Tumor)),[]). rule(pr2(Tumor),prefer(r8(Tumor), r4(Tumor)),[]):-feature0(Tumor,Value),Value>0.445, feature4(Tumor,Value2),Value2>0.445. rule(c1(Tumor),prefer(pr2(Tumor),pr1(Tumor)),[]). rule(r6(Tumor), benign(Tumor),[]):-feature5(Tumor,Value),Value>4.89, feature2(Tumor,Value2),Value2=<0.06. rule(r6b(Tumor), malignant(Tumor),[]):-feature7(Tumor,Value),Value>1.67, feature1(Tumor,Value2),Value2=<5.93, feature6(Tumor,Value3),Value3=<0.19. rule(pr14(Tumor), prefer(r6b(Tumor), r6(Tumor)),[]). rule(pr4(Tumor), prefer(r6(Tumor), r4b(Tumor)),[]). rule(pr16(Tumor), prefer(r4(Tumor), r6b(Tumor)),[]). rule(r10(Tumor), malignant(Tumor),[]):-feature8(Tumor,Value),Value=<1.65, feature0(Tumor,Value2),Value2>0.45. rule(r10b(Tumor), benign(Tumor),[]):-feature4(Tumor,Value),Value>0.50, feature0(Tumor,Value2),Value2>0.50, feature3(Tumor,Value3),Value3>3.31. rule(pr15(Tumor), prefer(r10b(Tumor), r10(Tumor)),[]). rule(pr12(Tumor), prefer(r6(Tumor), r10(Tumor)),[]). rule(pr13(Tumor),prefer(r10(Tumor), r6(Tumor)),[]):-feature0(Tumor,Value),Value>0.454, feature0(Tumor,Value2),Value2<0.46. rule(c6(Tumor),prefer(pr13(Tumor),pr12(Tumor)),[]). rule(pr19(Tumor), prefer(r8(Tumor), r10b(Tumor)),[]). rule(pr7(Tumor), prefer(r10(Tumor), r8b(Tumor)),[]). rule(pr40(Tumor), prefer(r6(Tumor), r8(Tumor)),[]). rule(pr41(Tumor),prefer(r8(Tumor), r6(Tumor)),[]):-feature0(Tumor,Value),Value>0.45. rule(c40(Tumor),prefer(pr41(Tumor),pr40(Tumor)),[]). 6 Predicates feature0,feature1,..,feature8 correspond to feature-names is shown in Table 1.