=Paper=
{{Paper
|id=Vol-3086/paper9
|storemode=property
|title=Towards Argumentative Decision Graphs: Learning Argumentation Graphs from Data
|pdfUrl=https://ceur-ws.org/Vol-3086/paper9.pdf
|volume=Vol-3086
|authors=Pierpaolo Dondio
|dblpUrl=https://dblp.org/rec/conf/aiia/Dondio21
}}
==Towards Argumentative Decision Graphs: Learning Argumentation Graphs from Data==
Towards Argumentative Decision Graphs: Learning Argumentation Graphs from Data Pierpaolo Dondio1 1 School of Computer Science, Technological University Dublin, Grangegorman Campus, D07 EWV4, Dublin, Ireland Abstract In this paper we present a novel data-mining model called argumentative decision graphs (π΄π·πΊ). An π΄π·πΊ is a special argumentation framework where arguments have a rule-based structure and an attack relation is defined among arguments. π΄π·πΊπ are graph-like models learnt from data in a supervised way that can be used for classification tasks. As in a decision tree, given a set of input features, an π΄π·πΊ returns the value of the target variable. Unlike decision trees, the output of an π΄π·πΊ can be also an undecided status, occurring when the model does not have enough reasons to predict a value for the target variable. This is due to the use of argumentation semantics to identify what arguments of an π΄π·πΊ are accepted and consequently make a prediction about the target variable. Unlike Bayesian Networks, π΄π·πΊπ are not required to be acyclic, but they can have any topology. Advantages of π΄π·πΊπ are the possibility of using different semantics to make predictions, the ability to deal with incomplete input data and to generate compact explanations. We evaluate a preliminary greedy algorithm to learn an π΄π·πΊ from data using public datasets and we compare our results with Decision Tree in terms of balanced accuracy and size of the model. Our results provide evidence to further progress our research. Keywords Argumentation, Data Mining, Decision Tree, Graph models, Explainability 1. Introduction In this paper we describe a novel data-mining model called Argumentative Decision Graphs (π΄π·πΊ). An π΄π·πΊ is a supervised data-mining model that learns the relationships between a target variable and a large enough set of examples. In this first paper we present a preliminary algorithm to learn π΄π·πΊπ from data for binary classification. An argumentative decision graph is an extension of Dungβs abstract argumentation graphs [1]. As in Dungβs framework, each node of the graph represents an argument, the links represent an attack relation among the arguments and no support relation is defined. However, unlike Dungβs abstract framework, arguments have a rule-based structure with a premise (called support) and a conclusion. Some arguments have a conclusion that can be used to predict the value of the target variable, while other arguments are not used to predict the target variable directly, but rather they are used to interact with other predictive arguments. AI3 2021, 5π‘β π ππππ βππ ππ π΄ππ£πππππ ππ π΄πππ’ππππ‘ππ‘πππ ππ π΄ππ‘ππ πππππ πΌ ππ‘πππππππππ Envelope-Open pierpaolo.dondio@tudublin.ie (P. Dondio) Orcid 0000-0001-7874-8762 (P. Dondio) Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The main reason to introduce π΄π·πΊ is the willingness to bring the advantages of symbolic non-monotonic reasoning into data-mining models. An π΄π·πΊ is an extension of an abstract argu- mentation framework, that is a symbolic reasoning system able to represent partial and conflicting knowledge and to formalize a large set of non-monotonic semantics. For this reason, an π΄π·πΊ can deal with partial information and make predictions when some of the inputs are missing. The predictions of an π΄π·πΊ are made by applying an argumentation semantics. We first recall that an argumentation semantics is a set of postulates used to identify the set of acceptable arguments, called extensions. In the labelling approach [2] adopted in this paper, the effect of an argumentation semantics is to assign a label in , out or undec to each argument. This means that an argument can respectively be accepted, rejected or deemed undecided. The undec label represents a situation in which the semantics has no reasons to definitely accept or reject an argument. A first advantage of using argumentation semantics is that the various semantics offer a rich toolset to model different ways of making decisions. Grounded semantics, for instance, represents a skeptical acceptance strategy, while preferred semantics is a so-called credulous semantics that has less conservative conditions to accept an argument. A second advantage is that an π΄π·πΊ could output an undecided status. This means that an π΄π·πΊ can either make a prediction or abstain from making one. This happens in situations in which the input data create a conflict that cannot be resolved by the rules of the semantics and therefore there is no ground to take a decision. The undecided status provides an π΄π·πΊ with a way of quantifying the uncertainty generated by conflicting information. Third, the use of an attack relation between arguments coupled with an argumentation semantics generate compact and understandable explanations. In a distinctive non-monotonic fashion, a single argument can invalidate many others and directly or indirectly support a specific conclusion. Even in a large argumentation graph, the arguments that are responsible for a conclusion could be a small subset. This could make the explanations more compact and understandable. Data-mining models that could be considered similar to π΄π·πΊπ are explainable models such as Decision Trees [3], or graph-like structures learnt from data such as Bayesian Networks [4]. In this paper, we provide an introduction to π΄π·πΊπ and we underline the differences between π΄π·πΊ and similar models. In the second part of this paper, we also provide a first greedy algorithm to learn an π΄π·πΊ from data and we evaluate the performance of such algorithm versus decision trees using benchmark classification datasets. The paper is organized as follows. In section 2 we recall the basics of abstract argumentation semantics, in section 3 we introduce our π΄π·πΊ model, while in section 4 we critically compare them to decision trees. Section 5 describes an algorithm to learn π΄π·πΊ from data that is evaluated in section 6, while related works are described in section 7. 2. Abstract Argumentation Semantics Definition 2.1. An argumentation framework π΄πΉ is a pair β¨π΄π, ββ©, where π΄π is a non-empty finite set whose elements are called arguments and β β π΄π Γ π΄π is a binary relation, called the attack relation. If (π, π) β β we say that π attacks π. An argument π is initial if it is not attacked by any argument, including itself. An argumentation framework π΄πΉ = β¨π΄π, ββ© identifies a directed graph. We define the restriction of an argumentation framework to a set of nodes π as the framework π΄πΉβπ corresponding to the vertex-induced subgraph of π΄πΉ identified by π: Definition 2.2. Given π΄πΉ = β¨π΄π, ββ©, the restriction of π΄πΉ to a set of nodes π β π΄π is the argumentation framework π΄πΉβπ = β¨π, βπ β© where βπ = β β© (π Γ π). An abstract argumentation semantics identifies a set of arguments that can survive the conflicts encoded by the attack relation β. Dungβs semantics require a group of acceptable arguments to be conflict-free (an argument and its attackers cannot be accepted at the same) and admissible (the set of arguments defends itself from external attacks). Definition 2.3. A set π΄ππ β π΄π is conflict-free iff βπ, π β π΄ππ, (π, π) β β. Definition 2.4. A set π΄ππ β π΄π defends an argument π β π΄π iff βπ β π΄π such that (π, π) β β, βπ β π΄ππ such that (π, π) β β. The set of arguments defended by π΄ππ is denoted β± (π΄ππ). A conflict-free set π΄ππ is admissible if π΄ππ β β± (π΄ππ) and it is complete if π΄ππ = β± (π΄ππ). We follow the labelling approach of [2], where a semantics assigns to each argument a label ππ, ππ’π‘ or π’ππππ. Definition 2.5. Let π΄πΉ = β¨π΄π, ββ©. A labelling is a total function β βΆ π΄π β {ππ, ππ’π‘, π’ππππ}. We write ππ(β ) for {π β π΄π|β (π) = ππ}, ππ’π‘(β ) for {π β π΄π|β (π) = ππ’π‘}, and π’ππππ(β ) for {π β π΄π|β (π) = π’ππππ}. Definition 2.6. Let π΄πΉ = (π΄π, β). A complete labelling is a labelling such that for every π β π΄π holds that: 1. if π is labelled ππ then all its attackers are labelled ππ’π‘; 2. if π is labelled ππ’π‘ then it has at least one attacker that is labelled ππ; 3. if π is labelled π’ππππ then it has at least one attacker labelled π’ππππ and no attackers labelled ππ. Definition 2.7. Given π΄πΉ = (π΄π, β), β is the grounded labelling iff β is a complete labelling where π’ππππ(β ) is maximal (w.r.t. set inclusion) among all complete labellings of π΄πΉ. β is the preferred labelling iff β is a complete labelling where ππ(β ) is maximal (w.r.t. set inclusion) among all complete labellings of π΄πΉ. A stable labelling is a complete labelling with π’ππππ(β ) = β . The grounded semantics, first introduced by Pollock [5], is a skeptical semantics that can be computed in polynomial time by accepting initial arguments and then any argument defended directly or indirectly by initial arguments. In this first paper, grounded semantics is the only semantics used by an π΄π·πΊ. 3. Argumentative Decision Graphs In this section we introduce the notion of argumentative decision graphs. We work with a dataset composed by a set of features, each of them taking values in a finite set. A feature represent the target variable of the classification task. Informally, an argumentative decision graph is an extension of Dungβs abstract argumentation graph [1] where arguments have a rule-based structure with a premise (also called the support) and a conclusion. The premise of each argument consists of a feature and an associated value, while the conclusion is a value for the target variable. Some arguments might have an empty conclusion, meaning that their role is not to predict the target variable directly, but rather to interact with other predictive arguments. Formally, we consider a dataset π represented by a π Γ π matrix-like data structure, where each row is called an instance of the dataset and each column is called a feature. An instance can be represented by a tuple π‘ = β¨π£1 , ., π£π β©, where π£π is the value associated with the ππ‘β feature ππ . We consider the predicates of the form ππ (π£), with the meaning βthe feature π has the value π£β. Given a tuple π‘ = β¨π£1 , ., π£π , .., π£π β©, the predicate ππ (π£) is verified by π‘ iff π£π = π£ (the value of the ππ‘β component of the tuple π‘ is equal to π£), and not verified otherwise. One of the features ππ is called the target variable and it is denoted with π¦, and therefore the predicates involving the target variable have the form π¦(π£). The set of all the predicates is called π«π βͺ π«π¦ , where π«π¦ is the set of predicates regarding the target variable π¦ and π«π the set of predicates regarding the other features {π1 ..ππ }. An π΄π·πΊ is an argumentation framework where each argument π has a structure π = β¨π, πβ©, where π β π«π and π is either the empty set or a target variable predicate π¦(π£). An argument with a non-empty conclusion is called a predictive argument. An attack relation is defined over the arguments. The definition is therefore the following: Definition 3.1. An argumentative decision graph ADG is an argumentation framework π΄πΉ = (π΄π, β) where each π β π΄π has the form π = β¨π, πβ©, π β π«π , π β π«π¦ βͺ β and β β π΄π Γ π΄π. Example 1. Let us consider the dataset π1 in Table 1, describing Paulβs activities on Sunday. The dataset has the following four Boolean features: w (whether the weather is windy), s (whether the weather is sunny), k (whether Paul has a sore knee or not) and l (whether Paul has a fishing licence) and a target variable a (activity), that takes the two values {surf, fish}. Using the features of π1 , the following two π΄π·πΊπ are given: β’ π΄π·πΊ1 = β¨{π1 , π2 , π3 }, {(π1 , π2 ), (π2 , π3 ), (π3 , π2 )} β’ π΄π·πΊ2 = β¨{π1 , π2 , π3 , π4 }, {(π1 , π2 ), (π2 , π3 ), (π3 , π2 )} where π1 = β¨π(π¦ππ ), β )β©, π2 = β¨π€(π¦ππ ), π(π π’ππ )β©, π3 = β¨π (π¦ππ ), π(π ππ β)β©, π4 = β¨π€(ππ), π(π ππ β)β©. π΄π·πΊ1 is saying that Paul goes surfing if the weather is windy (argument π2 ) and fishing if the weather is sunny (π3 ), but he cannot do both of the activities (π2 and π3 mutually attack each other), and he does not go surfing if he has a sore knee (π1 attacks π2 ). π΄π·πΊ2 is adding the information that Paul goes fishing if the weather is not windy (π4 ). We can represent an π΄π·πΊ with a directed graph where each node is labelled with the argument it represents. In figure 1 π΄π·πΊ1 and π΄π·πΊ2 are shown. For readability, we also wrote the description of each argument. 3.1. Well-formed ADGs In order to be well-formed, the attack relation of an π΄π·πΊ has to satisfy some consistency constraints. The first constraint is that there is no attack between two arguments whose supports contain the Table 1 The dataset π consists of 10 tuples and four features. The values of the target variable activity predicted by π΄π·πΊ1 and π΄π·πΊ2 are also reported. ID Sunny (s) Windy (w) Sore Knee (k) Activity (a) ππ΄π·πΊ1 ππ΄π·πΊ2 1 Yes Yes Yes Fish Fish Fish 2 Yes Yes No Surf und und 3 Yes No No Fish Fish Fish 4 Yes No No Fish Fish Fish 5 No Yes Yes Fish unk unk 6 No No Yes Fish unk Fish 7 No Yes No Surf Surf Surf 8 No No No Fish unk Fish 9 No Yes Yes Surf unk unk 10 No Yes No Surf Surf Surf Figure 1: Two ADGs. same feature. This is because the two arguments are mutually exclusive, since only one of the two supports can be verified by an input tuple. The second constrain is that there is no attack between two arguments whose conclusions are the same and not empty. Third, if two arguments π and π have mutually exclusive conclusions and they use different features in their supports, there must be an attack between them: either π attacks π, π attacks π or they symmetrically attack each other. This constrain is introduced to guarantee that the predictions of an π΄π·πΊ are conflict-free, i.e., two arguments with mutually exclusive conclusions cannot be accepted at the same time. Formally the three constrains are as follows. We remind how π¦ denotes the target variable and ππ the πππ‘β feature. Definition 3.2. Given an π΄π·πΊ = β¨π΄π, ββ©, the π΄π·πΊ is well-formed iff βπ1 , π2 β π΄π with π1 = β¨π1 , π1 β©, π2 = β¨π2 , π2 β© it holds that: 1. if π1 = ππ (π£1 ) and π2 = ππ (π£2 ) then (π1 , π2 ) β β β§ (π2 , π1 ) β β 2. if π1 = π2 β β then (π1 , π2 ) β β β§ (π2 , π1 ) β β 3. if π1 = β¨π1 (π£π₯ ), π¦(π£1 )β©, π2 = β¨π2 (π£π¦ ), π¦(π£2 )β© and π£1 β π£2 β§ π1 β π2 then (π1 , π2 ) β β β¨ (π2 , π1 ) β β. 3.2. Making Predictions using an ADG We consider a dataset π and an π΄π·πΊ built from π. The target variable is π¦ and π is the set of values that the target variable can take. An π΄π·πΊ identifies a function ππ΄π·πΊ βΆ π β π βͺ {und , unk } that, given an input tuple π‘ β π, it returns either a value for the target variable or one of two additional outputs: undecided (und ) or unknown (unk ). The computation of ππ΄π·πΊ is described in algorithm 1. Given an input instance π‘, we first consider the set π of all the arguments whose support is verified by π‘. Then, we apply grounded semantics on the abstract argumentation framework restricted to the verified arguments. We remind how, since the π΄π·πΊ is well-formed, all the accepted arguments have the same conclusion. We then consider the following situations: 1. if there is at least a predictive accepted argument, the value predicted by the π΄π·πΊ is the value of the conclusion of any of the predictive accepted arguments. 2. if the extension is empty but there is at least one predictive argument that is undecided, then the predicted value of the π΄π·πΊ is the status undecided : there are arguments that could be used to predict the target variable, but they are part of conflicts that cannot be resolved. 3. in all the other cases, when there are neither accepted nor undecided predictive arguments, the value returned by the π΄π·πΊ is the status unknown . Algorithm 1: The function ππ΄π·πΊ to make predictions given a tuple π‘ β π. The dataset π is described by the features π1 , .., ππ and target variable π¦ taking values in the set π. 1 Inputs: π΄π·πΊ = β¨π΄π, ββ©, a tuple π‘ β π; 2 Output: a predicted value in the set π βͺ {und,unk } 3 π β verified(π΄π, π‘) 4 β β Grounded(π΄π·πΊβπ ) 5 if in(β ) β β then 6 if βπ β in(β ) β£ π = β¨π, π¦(π£)β© then 7 return v 8 else 9 if und(β ) β β β§ βπ β und(β ) β£ π = β¨π, π¦(π£)β© then return und ; 10 return unk Example 2. Let us consider again π΄π·πΊ1 and π΄π·πΊ2 and the dataset π1 in Table 1. Let us consider the first tuple π‘1 = β¨π¦ππ , π¦ππ , π¦ππ , π ππ ββ© and compute ππ΄π·πΊ1 (π‘1 ). All the three arguments π1 , π2 , π3 are verified by π‘1 , and therefore the argumentation graph to be considered is equivalent to the full one. The grounded semantics returns two accepted arguments {π1 , π3 } since π2 is defeated by π1 . Since π3 = β¨π (π¦ππ ), π(π ππ β)β© then ππ΄π·πΊ1 (π‘1 ) = π ππ β. If we consider the tuple π‘2 = β¨π¦ππ , π¦ππ , ππ, π π’ππ β© (the weather is sunny and windy and Paul has no sore knee), only π2 and π3 are verified by π‘2 , and the resulting argumentation framework is a couple of mutually attacking arguments labelled undec by the grounded semantics and therefore ππ΄π·πΊ1 (π‘2 ) =und . Regarding tuple π‘6 = β¨ππ, ππ, π¦ππ , π π’ππ β© (weather neither sunny nor windy, Paul has a sore knee), only argument π1 is verified and accepted by grounded semantics. However, this argument does not predict the target variable and, since there are no undecided arguments, nothing can be said about the target variable and ππ΄π·πΊ1 (π‘6 ) =unk . By comparing the predictions of an π΄π·πΊ with the actual values of the target variable, the usual performance metrics derived from the confusion matrix such as accuracy, precision, recall and f1-score can be computed. The confusion matrix is a joint-frequency table of the predicted versus the actual values. In a binary classification the confusion matrix is a 2 Γ 2 table. However, since an π΄π·πΊ can also return the two special values undecided and unknown , the confusion matrix will have two additional lines. In Figure 2 the confusion matrices for π΄π·πΊ1 and π΄π·πΊ2 computed using the dataset π1 are shown. Figure 2: Confusion matrices for π΄π·πΊ1 and π΄π·πΊ2 for the dataset π1 of Table 1. We note how both π΄π·πΊ1 and π΄π·πΊ2 have a perfect accuracy when a prediction is generated (this is measured by the decision accuracy equal to 1), however π΄π·πΊ1 is in the unknown status in 4 out of 10 cases and undecided in one case, and therefore its overall accuracy is 0.5. π΄π·πΊ2 is a better predictive model since it makes predictions in more cases than π΄π·πΊ1 without adding any prediction error. This is reflected by its higher f1-score. 4. A critical comparison with Decision Trees In this section we outline some interesting differences and similarities between π΄π·πΊπ and decision trees. We consider again a dataset π with a set of features {π1 , .., ππ , π¦} where π¦ is the target variable. A decision tree (π·π) [3] is a tree-structure where each node is associated with a test on a feature (also referred as a variable in π·π terminology), and each of the terminal nodes of the tree has a value of the target variable associated. Given an input tuple, a decision tree returns a predicted value for the target variable (the decision). In order to take a decision, the tree has to be visited starting from the root node. At each node the test associated with that node is evaluated and the result of the test determines which child node has to be visited next, until a terminal node is reached and a prediction is made. Figure 3 shows an example of two decision trees for the dataset π1 in Table 1. π1 has three internal nodes and four terminal nodes. π2 has two internal and three terminal nodes, and it does not use all the features (for instance windy weather is not used). Decision trees can be learnt from data and many efficient algorithms have been proposed, such as ID3, C4.5 and CART [6]. The fundamental idea is similar, a tree is learnt in a top-down, recursive, divide-and-conquer approach. Starting from the root node, the algorithm selects the best variable to split the tree in two or more sub-trees. The best variable is the one that splits the tree in branches such that the target variable becomes easier to predict in each branch. Different measures for the concept of easier to predict are used by decision tree algorithms, such as the Gini index, the information gain (defined as the change in the entropy of the target variable conditioned to knowing the value of the variable tested for the split) and gain ratio. We now discuss some key differences and similarities between π΄π·πΊπ and decision trees (π·π). Figure 3: Two decision trees for the dataset π of Table 1. Entry point. A decision tree has a single entry point - the root of the tree - and a specific order in which nodes are tested to generate a prediction. On the contrary, an π΄π·πΊ has no entry point and arguments can be evaluated in any order. The consequence is that, in case of partial information - that is the value of one or more features is missing or unknown - an π΄π·πΊ has a higher chance to return a prediction. For instance, if the information about the feature sore knee is not available π1 and π2 cannot return a prediction, while π΄π·πΊ1 and π΄π·πΊ2 could. Adding new knowledge. Let us suppose that new knowledge needs to be added to the dataset in the form of a new feature/variable. An argumentation graph is easier to be expanded; in order to add a new argument, it is enough to define the new set of attacks involving the new node. The previous structure of the graph is unchanged. On the other side, adding a variable to a decision tree without changing its previous structure can be done only by adding new terminal nodes testing the new variable or use the variable as the new root. Retraining the tree would in general generates a very different tree where the new variable might appear in different part of the graph. It can be observed that also for an π΄π·πΊ in order to find the optimal graph it might be necessary to modify the previous nodes and links. However, an π΄π·πΊ has more options when it comes to accommodate a new variable without changing the existing graph due to a less constrained topology. Representation of knowledge. The topology of a π·π and the one of an π΄π·πΊ are certainly different. However, despite the differences, both of the two models represent knowledge as a set of logical rules in disjunctive normal form. A tree is an acyclic graph where every path is disjoint from the other. Every path represents a distinct rule to reach a conclusion, and once a path is taken it is not possible to move back to another path. Multiple paths are usually present to reach the same conclusion. Therefore a π·π represents knowledge as a set of mutually exclusive rules and each split represents a specialization of the support of a rule. On the contrary an π΄π·πΊ represents knowledge as a set of default rules, potentially defining an inconsistent (conflicting) set and subjects to exceptions. Every unidirectional attack provides an exception to a rule (=an argument) and therefore specializing further when the rule can be used. Depending on the context and the data, the conflict-based representation of the π΄π·πΊ could generate more understandable and compact models than the mutually exclusive rules generated by the π·π or vice versa. Explanations. Decision Trees are regarded as one of the most explainable data mining models. The tree structure, if not too complex, can be easily understood by a human. Each branch of the tree represents a specific input case, and each branch identifies a logical rule consisting of the conjunction of all the tests used to reach the terminal node from the root. For instance, sore knee and not windy weather is an explanation for Paul going fishing in the tree π2 . The length of an explanation is the length of the path from the root node to the terminal node. The average length of all the paths - often weighted by the number of tuples covered by each path - is a measure of the average length of the explanations generated by the decision tree. Regarding an π΄π·πΊ, the structure can be also understood by a human if the argumentation graph has a small size, but it could be argued that the free topology of an argumentation graph makes it harder to be understood. However, even when the graph is large, the application of the chosen semantics reduces the number of arguments that are necessary to explain a decision. In general, the arguments defeated by an accepted argument are irrelevant to define the decision and, if we are using grounded semantics and the set of accepted arguments is not empty, the undecided arguments are also irrelevant. Moreover, we do not need all the accepted arguments since the acceptability of some arguments might depend on the acceptability of other arguments. The explanations could therefore be compact also for a large graph. The analysis of explanations is an important issue that will be covered in future works. Variable Importance. In a decision tree, the importance of a variable is measured by the information gain, that is the reduction in the conditional entropy of the target variable. In other words, a variable is important if, by knowing its value, the target variable can be predicted with more certainty. In an π΄π·πΊ, various measures of the importance of a variable could be defined. For instance, the importance of an argument could be proportional to the number of input tuples for which the argument is necessary to obtain a prediction. Note how, since an argument refers to a specific feature-value pair, an π΄π·πΊ has a more fine-grained notion of variable importance. An aggregated value for each feature can then be provided. Semantics. Another difference between π·π and π΄π·πΊ is the way their outputs are produced. An π΄π·πΊ is evaluated using argumentation semantics to solve the conflicts encoded in the graph. Given an input tuple, an π΄π·πΊ is evaluated by mean of an argumentation semantics that labels the arguments as accepted, rejected or undecided. The undecided label provides an π΄π·πΊ with a built-in way of quantifying the uncertainty deriving from conflicting information. Moreover, semantics provide a way of representing different strategies to accept arguments. While the grounded semantics is the most skeptical semantics maximizing the set of undecided arguments, semantics such as the preferred are credulous semantics maximizing the set of accepted arguments. Semantics other than the grounded are often multi-status, meaning that they generate multiple sets of accepted arguments, all consistent with the semantics used. Multi-status semantics could be used to present multiple consistent scenarios to the decision makers, each of them with valid reasons to be accepted. Expressiveness. Regarding the expressiveness of the two models, they are equivalent. Given an π΄π·πΊ, it is possible to define a π·π computing exactly the same function. Indeed, if we compute the output of an π΄π·πΊ for all the input tuples, we can build an exhaustive π·π where each path represents an input tuple and the value predicted by the terminal nodes is the same as the π΄π·πΊ output. Given a π·π, an equivalent π΄π·πΊ can be built by following these rules: 1. for each directed link π from a node π to a terminal node π of the π·π, create a predictive argument β¨ππ (π£π ), π¦π β©, where ππ is the variable tested at node π and π£π is the value of ππ identifying the link π connecting π to the terminal node π, and π¦π is the value of the target variable predicted at node π. 2. for each link π from a node π to node π where π is non-terminal, create an argument β¨ππ (π£π ), β β©, where ππ is the variable tested at π and π£π is the value of the feature identifying the link π. 3. after having found the arguments of the π΄π·πΊ using rules 1 and 2, for each argument π associated to a link ππ from node π to π, add to the attack relation β the pair of nodes (π, π), where π is any argument associated to a link ππ that is on a directed path π connecting π to a terminal node so that ππ β π, unless the two arguments use the same feature or predict the same value for the target variable. The idea is that terminal nodes provide the arguments to predict the target variable, while the non-terminal nodes provide the non-predictive arguments. Regarding the attack relation, an argument generated by a link π from node π to π attacks all the arguments generated by links that are mutually exclusive with π and belonging to the sub-tree identified by the node π. Figure 4: The π΄π·πΊ equivalent to π1 . As an example, let us consider the decision tree π1 . We build the corresponding π΄π·πΊ. According to rule 1, the π΄π·πΊ has the following predictive arguments: π1 = β¨π€(π¦ππ ), π π’ππ β©, π2 = β¨π€(ππ), π ππ ββ©, π3 = β¨π (π¦ππ ), π ππ ββ©, π4 = β¨π (ππ), π π’ππ β©. According to rule 2, it has the following two non-predictive arguments: π5 = β¨π(ππ), β β©, π6 = β¨π(π¦ππ ), β β©. Regarding the attack relation, according to rule 3 the only attacks are: π5 attacks π3 and π4 , while π6 attacks π1 π2 . Argument π5 does not attack π6 and vice versa since they use the same feature and they are therefore mutually exclusive. The resulting π΄π·πΊ is shown in figure 4. We conclude by observing that, despite the two models are equivalent, this does not guarantee that the two algorithms will learn similar models from data. In fact, the way the relationship between input tuples and target variable is learnt and the way knowledge is represented imply that the two models will in general be quite different. 5. An algorithm to learn ADGs from data In this section we present a first simple algorithm to learn π΄π·πΊ from data. The algorithm is a preliminary attempt, and it can be improved in multiple ways. However, it represents a starting point to study and test the potential of π΄π·πΊπ . Algorithm 2: The function BuildADG to learn an π΄π·πΊ from a dataset π. The dataset π is described by the features π1 , .., ππ and target variable π¦. The target variable is binary and takes the values π¦ + , π¦ β . πΉπ represents the set of all possible values for the feature ππ . The function eval returns the selected performance indicator for a given π΄π·πΊ Inputs: Dataset π, a performance threshold Ξ 1 Outputs: an π΄π·πΊ 2 3 Function BuildADG( π,Ξ) : 4 perf β 0 ; Arg β β 5 for each distinct pair (ππ , π£π ) do 6 Arg β Arg βͺ β¨ππ (π£π ), π¦ + β© βͺ β¨ππ (π£π ), π¦ β β© βͺ β¨ππ (π£π ), β β© 7 π΄π·πΊπ΅ππ π‘ β β¨β , β β© 8 while Arg β β do 9 π΄π·πΊπ΅ππ π‘ β π΄π·πΊπππ 10 for π β Arg do 11 π΄π·πΊπππ€ β AddArg (π, π΄π·πΊπ΅ππ π‘ ) 12 perfπππ€ βeval (ADGπππ€ ) 13 if perfπππ€ > ππππ + Ξ then 14 perf β perfπππ€ 15 π΄π·πΊπππ π‘ β π΄π·πΊπππ€ 16 ππππ π‘ β π 17 if π΄π·πΊπ΅ππ π‘ == π΄π·πΊπππ then 18 return π΄π·πΊπππ π‘ 19 π΄ππ.remove(π) 20 return π΄π·πΊπππ π‘ 21 22 Function AddARG( π = β¨ππ (π£π ), π¦π β©, π΄π·πΊ = β¨π΄ππ, ββ©) ) : 23 for π = β¨ππ (π£π ), π¦π β© β Arg do 24 π΄π·πΊβ β β¨π΄ππ βͺ {π, π}, β βͺ {(π, π)}β© 25 π΄π·πΊβ β β¨π΄ππ βͺ π, β βͺ {(π, π)}β© 26 π΄π·πΊβ β β¨π΄ππ βͺ π, β βͺ {(π, π), (π, π)}β© 27 if ππ β ππ β§ π¦π β π¦π β β then 28 π΄π·πΊ β π β {π΄π·πΊβ , π΄π·πΊβ , π΄π·πΊβ } where eval(π) is maximal 29 else if ππ β ππ then 30 π΄π·πΊ β π β {π΄π·πΊ, π΄π·πΊβ , π΄π·πΊβ , π΄π·πΊβ } where eval(π) is maximal 31 return π΄π·πΊ The algorithm, called BuildADG , builds an π΄π·πΊ incrementally by adding an argument at the time and by expanding the attack relation in order to maximize a performance indicator, such as the overall accuracy of the π΄π·πΊ. BuildADG is shown in algorithm 2. We consider a dataset π with features {π1 , .., ππ , π¦} where π¦ is the target variable, each feature ππ takes value from its corresponding set of values πΉπ , while the target variable is binary and it takes the two values π¦ + or π¦ β . The algorithm has two inputs: the dataset π and a tuning parameter Ξ. The algorithm starts by identifying the arguments that will be used to build the π΄π·πΊ. For each pair feature-value β¨ππ , π£β© three arguments are added, one predicting π¦ + , one π¦ β and the neutral argument with empty conclusion β¨ππ (π£), β β©. Then, for each argument in π΄ππ the algorithm adds to the π΄π·πΊ the argument π that increased the performance of the resulting π΄π·πΊ by the highest interval, but only if the addition of π increased the performance by at least Ξ compared to the previous π΄π·πΊ. Argument π is removed from the list of arguments and the procedure is repeated until all the arguments have been tested or it is not possible to increase the performance of the π΄π·πΊ by Ξ. The function AddArg is responsible for adding a new argument π to the π΄π·πΊ. For each argument π already in the π΄π·πΊ, we need to decide how the new argument π interacts with π. The resulting π΄π·πΊ must be well-formed to avoid logical inconsistencies. If argument π and π are mutually exclusive there is no need to add an attack link between them. The same is for the situation in which π and π are predicting the same value for the target variable π¦. If both π and π are predictive but they predict different values for π¦ and they do not have mutually exclusive supports, an attack must be present to keep conflict-freeness. There are three possibilities: π attacks π, π attacks π or they mutually attack each other. One of these three attacks must be present to keep the π΄π·πΊ well-formed, and the one generating the best π΄π·πΊ is kept. If either π or π are non-predictive argument, there is also the fourth possibility that there is no attack relation between the arguments (line 30). 6. Evaluation In this section, we provide a first evaluation of the BuildADG algorithm and we compare its results to a C4.5 Decision Tree. The evaluation compares the total accuracy of the two models using three benchmark datasets. An evaluation considering other performance indicators or the explainability of the models is left for future works. The datasets used for the evaluation are well-known, publicly available datasets [7] for binary classification where all the features are categorical. We used Ξ = 0 as parameters, meaning that any improvement to the π΄π·πΊ is retained. Table 2 shows the results for the three datasets. Performance was computed by randomly dividing each dataset into a training and testing set using an 80%-20% split ratio. We report the accuracy, balanced accuracy and the size of both the decision tree and the π΄π·πΊ, measured by the number of nodes and links of the graph. Table 3 shows the most important features considered by the decision tree and by the π΄π·πΊ. Information gain is used to rank features in the decision tree, while the proportion of times that an argument is necessary to make a prediction is used to rank features in the π΄π·πΊ. Table 2 Results obtained by the BuildADG and the C.45 algorithm. The column size reports the number of nodes and links. Decision Tree π΄π·πΊ Dataset Records Features Acc CI 95% B. Acc size Acc B.Acc size Bank 600 11 0.82 [0.74,0.89] 0.82 28/27 0.75 0.75 6/7 US Census 32561 14 0.84 [0.83,0.85] 0.73 14/13 0.83 0.75 21/78 Car Price 1782 6 0.94 [0.91,0.96] 0.92 20/19 0.88 0.88 8/11 In both the bank and car price datasets, the accuracy of the decision tree was statistically higher than the π΄π·πΊ accuracy. The gap was smaller for the car price dataset, where π΄π·πΊ registered a balanced accuracy of 0.88. For the US census dataset we obtained positive results: the accuracy of the π΄π·πΊ was in the 95% confidence interval of the decision tree accuracy and only marginally lower (83.4% versus 84.2%), and the balanced accuracy of the π΄π·πΊ was higher than the one of the decision tree. Regarding the size of the models learnt, we used the number of nodes and links to measure it (note how the number of links in a tree is π β 1, where π is the number of nodes). The π΄π·πΊ graph was smaller for the car price and bank dataset, but more complex for the US census dataset. Table 3 shows the most important variables (ranked by importance) for decision tree and for π΄π·πΊ. There is a good degree of overlapping for all the three datasets, meaning that the two algorithms substantially agreed on the most important variables. Discussion. The results obtained are promising but they also expose some of the weakness of the preliminary algorithm proposed. Indeed, the algorithm is naive in several aspects. For instance, the addition of a new argument is accepted if the absolute number of correctly classified instances is increased by an interval Ξ even if the accuracy (i.e. the percentage of correct predictions) could decrease, showing how the algorithm has a bias in favour of increasing the coverage of the instances rather than maximizing the accuracy. The algorithm also needs a pruning procedure, similar to the ones used in a decision tree. The result of the US census dataset shows that an π΄π·πΊ can became very complex, harming the understandability of the outputs but also increasing the chance of model overfitting. A pruning strategy should reduce the complexity of the model, keeping its accuracy high. One idea could be to introduce a regularization parameter similar to the one present in the cost-based pruning mechanism of a decision tree, parameter that will penalize complex π΄π·πΊπ , forcing the algorithm to find a trade-off between complexity and performance. Finally, our preliminary algorithm is not computationally efficient, and it can be optimized in multiple ways, including an approximation using Monte Carlo simulation. Table 3 Most important variables for decision tree and π΄π·πΊ Dataset Decision Tree π΄π·πΊ β© Bank age, children, income children, income, married 2/3 relation, marital status, cap gain, cap gain, degree, ed num, US Census 5/7 degree, ed num, gender, occupation hours, relations, age, gender persons, safety, price, Car Price persons, safety, maintenance, lug boot 3/4 maintenance 7. Related Works Recently, there has been an increasing number of studies mixing argumentation theory and machine learning methods. However, very little studies directly face the problem of learning an argumenta- tion graph from data. The large majority of applications are in the field of argumentation mining, where arguments are extracted automatically from text using NLP techniques either in a data-driven fashion or in a mixed approach where an explicit structure of arguments has to be matched on the text [8],[9],[10]. In both cases, learning an argumentation graph from data is not part of the problem. In [11] the authors learnt the strength of a probabilistic argumentation frameworks using the Bayesian inference rule, while in [12] the authors proposed an algorithm to learn a probabilistic argumentation graph given a set of extensions. A different approach is the one where machine learning techniques have been adopted to predict the acceptability of arguments under a given semantics. In [13] this problem was tackled by modelling it as a multinomial classification task and by using convolutional neural networks to learn the acceptability of arguments. The work by Craandijk and Bex [14] had a similar aim. The authors proposed to use special neural networks, called argumentation graph neural network (AGNN), to learn a binary classification model predict- ing whether an argument is accepted or rejected. The difference from our approach is that in those approaches an argumentation framework already exists, and the problem is to learn the output of a semantics applied to the argumentation framework. On the contrary, our problem is to learn such argumentative framework from data. Moreover, our aim is to learn a Dung-like argumentation framework that could generate understandable justifications, while the aim of both [13] and [14] is not the intelligibility of the model, but rather to train a black-box deep neural network to compute a semantics accurately. In [15] the authors proposed to use genetic algorithms to learn a gradual argumentation graph, considered as an instance of a sparse multi-layer neural network. To obtain a well-interpretable model, the authors proposed to use a fitness function balancing sparseness and accuracy of the classifier. The paper presents experimental results on standard benchmark datasets from the UCI machine learning repository [7]. The results obtained showed an accuracy comparable to decision trees across the three datasets evaluated. In [16] the authors proposed a method to equip autonomous agents with the ability to argue and explain their decisions. Similar to our approach, arguments and attack relations between arguments are built from a set of training examples. The generation of arguments is also based on all the pairs feature-values found in the training dataset. Two types of attacks are defined, symmetrical rebuttal attacks and unidirectional undercutting attacks. Differently from our work, these attacks are explicitly identified by the structure of the arguments rather than being decided based on how well they fit the dataset given. 8. Conclusions In this paper, we presented a novel data-mining algorithm called argumentative decision graphs (π΄π·πΊ). An π΄π·πΊ is a special argumentation framework where arguments have a rule-based structure and an attack relation is defined among arguments. π΄π·πΊπ are learnt from data in a supervised way and they can be used for classification tasks. We have discussed the main differences and similarities with similar models such as decision trees, showing a translation between the two formalisms. Unlike decision trees, the output of an π΄π·πΊ can be also an undecided status, where the graph does not have enough reasons to predict a value for the target variable. This is due to the use of argumentation semantics to identify the arguments of an π΄π·πΊ that are accepted and consequently make a prediction on the target variable. We evaluated a preliminary greedy algorithm to learn an π΄π·πΊ from data using benchmark datasets and we compared our results with the C4.5 decision tree algorithm. Our results showed how π΄π·πΊ had an accuracy lower or comparable to decision trees, a generally less complex model and a good agreement on the importance of the variables between the two models. The algorithm presented is naive in some of its assumptions, and it can be improved in many aspects, including how arguments interact, the way the impact of an argument is evaluated and how to reduce its computational time. Overall, we believe to have provided enough evidence to justify further research into π΄π·πΊπ . References [1] P. M. Dung, On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games, Artificial intelligence 77 (1995) 321β357. [2] M. W. Caminada, D. M. Gabbay, A logical account of formal argumentation, Studia Logica 93 (2009) 109β145. [3] J. R. Quinlan, Induction of decision trees, Machine learning 1 (1986) 81β106. [4] F. V. Jensen, T. D. Nielsen, Bayesian networks and decision graphs, volume 2, Springer, 2007. [5] J. L. Pollock, Cognitive carpentry, a blueprint for how to build a person, Mit Press, 1995. [6] B. Hssina, A. Merbouha, H. Ezzikouri, M. Erritali, A comparative study of decision tree id3 and c4. 5, International Journal of Advanced Computer Science and Applications 4 (2014) 13β19. [7] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml. [8] O. Cocarascu, A. Stylianou, K. Δyras, F. Toni, Data-empowered argumentation for dialectically explainable predictions, in: ECAI 2020, IOS Press, 2020, pp. 2449β2456. [9] O. Cocarascu, F. Toni, Detecting deceptive reviews using argumentation, in: Proceedings of the 1st International Workshop on AI for Privacy and Security, 2016, pp. 1β8. [10] M. Lippi, P. Torroni, Argument mining: A machine learning perspective, in: International Workshop on Theory and Applications of Formal Argumentation, Springer, 2015, pp. 163β176. [11] K. Noor, A. Hunter, A bayesian probabilistic argumentation framework for learning from online reviews, in: 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), IEEE, 2020, pp. 742β747. [12] R. Riveret, G. Governatori, On learning attacks in probabilistic abstract argumentation, in: Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 2016, pp. 653β661. [13] I. Kuhlmann, M. Thimm, Using graph convolutional networks for approximate reasoning with abstract argumentation frameworks: A feasibility study, in: International Conference on Scalable Uncertainty Management, Springer, 2019, pp. 24β37. [14] D. Craandijk, F. Bex, Deep learning for abstract argumentation semantics, arXiv preprint arXiv:2007.07629 (2020). [15] J. Spieler, N. Potyka, S. Staab, Learning gradual argumentation frameworks using genetic algorithms, arXiv preprint arXiv:2106.13585 (2021). [16] L. Amgoud, M. Serrurier, Agents that argue and explain classifications, Autonomous Agents and Multi-Agent Systems 16 (2008) 187β209.