Introduction

Foundations for Solving Classi cation Problems with Quantitative Abstract Argumentation

0 Background on Abstract Argumentation 1 University of Stuttgart , Germany

Introduction

Our understanding of an argument in this work follows Dung's notion of abstract argumentation: "an argument is an abstract entity whose role is solely determined by its relations to other arguments" [7]. In particular, we abstract from the content of an argument and are only interested in its acceptability dependent on the acceptability of its attackers and supporters. We consider bipolar argumentation graphs (BAGs) of the form (A; Att; Sup), where A is a nite set of arguments, Att A A is the attack relation and Sup A A is the support relation. With a slight abuse of notation, we let Att(A) = fB 2 A j (B; A) 2 Attg denote the attackers of A and let Sup(A) denote its supporters. Graphically, we denote attack relations by solid and support relations by dashed edges. Figure 1 illustrates the de nition. The BAG models part of a decision problem from [17], where we want to decide whether to buy new or to sell existing stocks of a company. A1 corresponds to the statement of an expert that recommends selling. A2 and A3 correspond to statements by experts who contradict A1's premises and recommend buying. Since we do not want to accept both the selling and the buying decision, the corresponding decision arguments attack each other.

We will not talk about the classical interpretation of bipolar argumentation frameworks and move directly to quantitative approaches. Our focus is on two bipolar approaches that seem interesting for classi cation tasks. Both share a BAG as a common data structure. They enhance the BAG in di erent ways in order to derive numerical degrees of acceptance. Probabilistic epistemic argumentation [11] builds up on basic probability theory. Probabilities are assigned to arguments by means of probability functions P : 2A ! [0; 1] such that Pw22A P (w) = 1. Each w 2 2A can be seen as one possible world, in which the arguments in w are accepted and the remaining ones are rejected. The probability of an argument A 2 A under P is then de ned by adding the probabilities of all worlds in which A is accepted, that is, P (A) = Pw22A;A2w P (w). In order to restrict the set of all probability functions to those that respect the BAG, di erent constraints have been introduced [11]. We give a few examples here: COH: P is called coherent if for all A; B 2 A with (A; B) 2 Att, we have

P (B) 1 P (A).

SFOU: P is called semi-founded if P (A) 0:5 for all A 2 A with Att(A) = ;.

SOPT: P is called semi-optimistic if P (A) with Att(A) 6= ;. 1 PB2Att(A) P (B) for all A 2 A Coherence encodes one possible meaning of attack relations: if A attacks B, then the probability of A bounds the probability of B from above (and vice versa). The intuition behind the de nition is that we do not want to accept both an argument and its attacker. Semi-foundedness encodes the intuition that an argument should not be rejected if there is no reason for doing so. Formally, this means that the degree of belief in this argument should not be lower than 0:5. Semi-optimism gives another lower bound for the degree of belief. If there are no attackers, the argument must be accepted (probability 1). Attackers will decrease the lower bound based on their own degree of belief. These constraints have been de ned for attack-only graphs, but they can be easily extended to bipolar graphs. For example, dual to coherence, we could de ne a lower bound for support edges as follows.

S-COH: P is called s-coherent if for all A; B 2 A with (A; B) 2 Sup, we have

P (B) P (A).

The previous examples are all special cases of linear atomic constraints [19] that are generally written in the normalized form n X ci i=1 (Ai) c0; where Ai 2 A , ci 2 R and is a syntactic symbol for the probability of an argument. A probability function P satis es such a linear atomic constraint i Pin=1 ci P (Ai) c0. Note that all constraints above can be brought into this form. For example, the Coherence constraint can be rewritten as 1 (A) + 1 (B) 1 and the S-Coherence constraint as 1 (A) + ( 1) (B) 0. In general, we can also consider more general epistemic constraints that allow nonlinear combinations of probabilities and probabilities of complex formulas over arguments [10]. However, in order to keep things simple, we do not discuss the most general form. For future reference, we de ne a P-BAG as follows. De nition 1. A P-BAG is a tuple (A; Att; Sup; C), where (A; Att; Sup) is a BAG and C is a set of epistemic constraints over the BAG.

Given a P-BAG (A; Att; Sup; C), we are interested in solving the following entailment problem: compute tight upper and lower bounds on the probability of an argument A 2 A among all probability distributions that satisfy the constraints in C. If we restrict to linear atomic constraints, this problem can be solved in polynomial time [19]. To illustrate the entailment problem, consider again the BAG in Figure 1. We consider a new constraint that we call Balance. BAL: P is called balanced if P (A) = 12 +

A 2 A.

PB21S+upm(Aax)fPjS(Bup)(AP)j;BjA2Attt(tA(A))jgP (B) for all

To improve readability, we do not present Balance in the normal form of linear atomic constraints. Based on our previous discussion, the reader hopefully recognizes that Balance can indeed be written in this form. Note, in particular, that an equality can be represented by two inequalities (x = y i x y and y x). Intuitively, Balance enforces probability 0:5 if attackers and supporters are equally strong and moves the probability towards 0 (1) if the attackers are stronger (weaker) than the supporters. The entailment results are shown in Figure 2 on the left. In general, we may get interval probabilities rather than point probabilities. To illustrate this, we can replace BAL with the constraints SFOU, COH and S-COH that we discussed before. The entailed probabilities are shown in Figure 2 on the right.

A polynomial-time implementation for solving the entailment problem for linear atomic constraints can be found in the Java library Probabble1. Gradual argumentation frameworks evaluate arguments by numerical strength values. We will focus on bipolar frameworks that use the interval [0; 1] to give a uniform presentation. A discussion of more general frameworks can be found in [ 3 ]. The basic building block is again a BAG. Now, additionally, every argument has an initial weight that can be seen as an apriori belief in the argument when ignoring all the others. For future reference, we de ne a G-BAG as follows. De nition 2. A G-BAG is a tuple (A; Att; Sup; w), where (A; Att; Sup) is a BAG and w : A ! [0; 1] is a weight function.

The main computational problem is again to assign acceptance values to arguments. In this context, they are often called strength values. Strength values are often computed by applying an update function to the G-BAG repeatedly until the values converge. The update function often has a simple modular structure consisting of an aggregation and an in uence function [14]. The aggregation function takes the strength values of all parents and combines them to a numerical value. Intuitively, an attacker should decrease the value based on its strength, whereas a supporter should increase the value based on its strength.

1 https://sourceforge.net/projects/probabble/

The in uence function then adapts the initial weight based on the aggregate. For example, the aggregation function of the DF-QuAD algorithm uses multiplication [23]. While it has some nice properties, its particular de nition has the disadvantage that the aggregate is necessarily (close to) 0 if both an attacker and a supporter have strength (close to) 1. The Euler-based semantics introduced in [ 1 ], uses addition instead of multiplication to overcome this problem. However, its in uence function is not symmetric about 0, which causes a counterintuitive imbalance between attacks and supports. The quadratic-energy model from [16] uses a symmetric in uence function in order to overcome this problem. Its aggregation function is de ned by (A) =

X B2Sup(A) s(B)

X where s : A ! [0; 1] assigns the current strength value to every argument. Its in uence function is de ned by (A) = w(A) w(A) h( where h(x) = 1+mmaxafx0f;0x;gx2g2 squashes its input to the interval [0; 1]. Intuitively, a positive aggregate will move the weight towards 1, while a negative aggregate will move the weight towards 0. The strength values are initialized with the initial weights. Then the aggregation and in uence function are applied repeatedly until the values converge.

Figure 3 shows, at the top, possible initial weights for the example BAG on the left and the resulting nal strength values on the right. At the bottom of Figure 3, the left graph shows how the nal strength values evolve over time. For acyclic graphs, they can be computed in linear time by a single forward pass [16]. In acyclic BAGs, the convergence theory is actually not complete. [14] gave some examples for cyclic G-BAGs where the strength values under di erent approaches start cycling and do not converge. In the known cases, these convergence problems can be overcome by continuizing the discrete update approach [16] as illustrated in the right graph at the bottom in Figure 3. In all cases, where convergence guarantees are known, it is actually guaranteed that the discrete and continuized algorithm will converge to the same strength values [18]. Empirically, the continuous algorithm still converges in subquadratic time [16]. The plots of the evolution of strength values may also add transparency and explainability to gradual argumentation frameworks.

Implementations for computing strength values with di erent gradual argumentation approaches and plotting the evolution of their strength values can be found in the Java library Attractor2. 3 We will now look at how classi cation problems can be solved by means of argumentation frameworks. The goal of classi cation is to map inputs x to outputs y.

2 https://sourceforge.net/projects/attractorproject/

We think of the inputs as feature tuples x = (x1; : : : ; xk), where the i-th value is taken from some domain Di. The output y is taken from a nite set of class labels L. A classi cation problem P = ((D1; : : : ; Dk); L; E) consists of the domains, the class labels and a set of training examples E = f(xi; yi) j 1 i N g.

By a numerical classi er, we mean a function c : ik=1 Di L ! R that assigns to every pair (x; y) a numerical value. An important special case is a probabilistic classi er p : ik=1 Di L ! [0; 1] where PjjL=j1 p(x; yi) = 1. Then p(x; y) 2 [0; 1] can be understood as the con dence of the classi er that an example with features x belongs to the class y. Let us note that every numerical classi er c can be turned into a probabilistic classi er pc by normalizing the label outputs by a softmax function. That is, pc(x; y) = PjjLe=xj1pe(xcp(x(c;y(xi);)yj)) .

Figure 4 shows the high-level architecture of our classi ers. The input is rst encoded in a BAG that models the classi cation problem. If the resulting model is not a probabilistic classi er, the acceptance degrees of arguments corresponding to the labels are then normalized by a softmax function. We will discuss these steps in more detail in the following sections. In order to solve a classi cation problem P = ((D1; : : : ; Dk); L; E) with argumentation technology, we rst transform the domains and class labels into arguments. A categorical feature with domain D = fd1; : : : ; dlg can be transformed into l arguments AD;1; : : : ; AD;l. Intuitively, we can identify the value di with accepting AD;i and rejecting the remaining arguments AD;j , j 6= i, of the feature. A continuous feature with domain D R can be discretized by partitioning D into l intervals that can then be treated like discrete features. We denote the resulting input arguments by Ain.

We proceed analogously for the class labels. For multiclass classi cation problems, we create one argument for every label analogous to discrete features. However, if the classi cation problem is binary, L = fc0; c1g, we create a single output argument. We denote the resulting output argument(s) by Aout.

To illustrate the idea, we consider a small toy classi cation problem inspired by the Census dataset from the UCL machine learning repository3. We consider three features that correspond to age (continuous), education (3 categories) and work class (3 categories) and two class labels that correspond to an 'average or below' or 'above average' salary. Figure 5 shows one potential BAG built up from the corresponding input (bottom) and output (top) arguments. For simplicity, we chose a coarse discretization into three bins. Of course, determining the number and boundaries of the bins can also be made part of the learning process with the usual advantages ( exibility) and disadvantages (learning complexity). The most straightforward way to build a classi er is to take the input arguments and output arguments and to connect them via edges in such a way that a good classi cation performance on the examples is obtained.

De nition 3 (Naive Classi cation BAG). A Naive Classi cation BAG for a classi cation problem P = ((D1; : : : ; Dk); L; E) is a BAG (A; Att; Sup) such that A = Ain [ Aout is the set of input and output arguments for P , Att; Sup Ain Aout and Att \ Sup = ;.

Figure 5 shows one possible naive classi cation BAG for our census example. While a naive classi cation BAG is easy to interpret, it may not have su cient degrees of freedom to capture more complicated relationships. In order

3 https://archive.ics.uci.edu/ml/datasets/Census+Income

to overcome the problem, we propose adding hidden layers of arguments between the input and the output layer inspired by the architecture of feedforward neural networks [8]. Similar to the intuition of deep feedforward neural networks, the hope is that deeper layers will form more sophisticated patterns from the patterns detected in earlier layers. Interpretability will probably su er with increasing depth. However, due to the simple mechanics of the introduced argumentation approaches, a deep abstract argumentation classi er may still be easier to interpret than 'black box approaches' like neural networks or support vector machines. Figure 6 shows a possible deep classi cation BAG for our census example. The meaning of hidden arguments in early layers can still be intuitively explained. For example, H1;1 can be roughly interpreted as saying that an above average salary is unlikely in the age group from 20 to 60 and if the education category is 1 unless the person is in working class 3.

De nition 4 (Deep Classi cation BAG). A Deep Classi cation BAG with k layers for a classi cation problem P = ((D1; : : : ; Dk); L; E) is a BAG (A; Att; Sup) such that A = Sik=+01 Ahii consists of the input arguments Ah0i = Ain and output arguments Ahk+1i = Aout for P and additional layers of arguments Ahii such that Ahii \ Ahji = ; for i 6= j. Furthermore, Att \ Sup = ; and Att; Sup Sik=0 Sjk=+i1+1 Ahii Ahji , that is, edges can only be directed towards deeper layers.

In order to classify an example with a classi cation BAG, we have to specify how to generate an output label from the input features. We will discuss some ideas for P-BAGs and G-Bags in the next sections. 3.3

Classi cation P-BAGs

In order to solve the classi cation problem with a P-BAG, we have to add a set of constraints to our classi cation BAG. We divide the constraints into Classi cation Constraints: encode the meaning of support and attack edges. Instance Constraints: encode the input features of an example. The classi cation constraints are example independent, whereas the instance constraints change for every example (they correspond to the input transformation in Figure 4). We start discussing the instance constraints. The idea is simple: for every feature, we constrain the probability of the input argument that corresponds to the input value to 1 and the probability of the remaining input arguments for this feature to 0. For example, if Di = fdi;1; : : : ; di;lg and xi = di;2, we add the constraints (Ai;2) = 1 and (Ai;j ) = 0 for j 6= 2.

There are various ways to de ne the classi cation constraints. De ning them could even be made part of the learning process. However, this would complicate the learning problem further. We therefore discuss some concrete options. Our initial proposal is a variation of the Balance constraint that we discussed before. We slightly generalize it by adding weights to edges. We will consider these weights as parameters that have to be learnt. For every argument A 2 A n Ain that is not an input argument, we introduce one constraint of the form (A) =

+ where we demand 0 < B;A, PB2Sup(A) B;A = 1 and PB2Att(A) B;A = 1. Let us note that, due to the simple structure of the BAG, the probabilities of the arguments can be computed in a single forward pass in linear time. The probabilities of the input arguments can be set immediately according to the instance constraints. We then go to the next layer. Since edges can only be directed towards deeper layers, all probabilities in this layer are determined by the probabilities of the previous layer. We can continue in this way until we set the probabilities in the output layer.

However, we note that the resulting classi er is necessarily a linear classi er. By induction, we can see that the probability at every argument is just a linear combination of the inputs: For the rst layer, this is obvious. For the subsequent layers, it follows by induction, since a linear combination of linear combinations is a linear combination again. Therefore, the output probability is a linear function of the inputs. As a consequence, the classi er is only able to learn linearly separable functions. For example, it is not possible to learn to classify the XOR (exclusive or) function correctly when using the linear balance constraints in (1).

Fortunately, if we keep the structure su ciently simple, we can still deal with non-linear constraints e ciently by performing a single forward pass through the BAG as before. We therefore propose the following non-linear variant of Balance. (A) = A + (1

A) maxf A maxf

X B2Att(A)

X B2Sup(A)

B;A

B;A (B) (B)

X B2Sup(A)

X where again 0 < B;A , PB2Sup(A) B;A = 1, PB2Att(A) B;A = 1 and furthermore 0 A 1. The new parameter A replaces 0:5 and allows learning a bias for the probability of this argument. The rst term moves the probability towards 1 if the supporters are stronger, the second term moves the probability towards 0 if the attackers are stronger. To illustrate that this is su cient to capture non-linear relationships, Figure 7 shows, on the left, a P-BAG for the XOR function. The graph on the right illustrates the classi cation process for a positive example. The output is 0:75 (the positive label is accepted). By going backward through the graph, the result can be explained. Y is accepted because its supporter H1;1 is accepted. H1;1, in turn, is accepted because X1 = 0 and X2 6= 0 (the inputs di er). When using the BAG in Figure 8 with edge weights N = 1, the XOR function will actually be perfectly reproduced. That is, in the table in Figure 7, we will have 1 instead of 0:75 and 0 instead of 0:25. However, the P-BAG in Figure 7 also classi es every example correctly (when the decision threshold for a positive example is 0:5) and is easier to interpret. For G-BAGs, we have to decide how to de ne the initial weights and which update function we use. For the update function, we propose a variant of the quadratic energy model that we explained before. Our variant again adds edge weights that are supposed to be learnt during the training process. The aggregation function is de ned by where s : A ! [0; 1] assigns the current strength value to every argument as before. The in uence function is de ned by with h(x) = 1+mmaxafx0f;0x;gx2g2 . We demand 0 <

The weights of the input arguments can again be set based on the input. Since they do not have any ingoing edges, this weight will also be their nal strength. For setting the weights, we proceed similar as before. For every feature, we set the weight of the argument corresponding to the input value to 1 and the weight of the remaining input arguments for this feature to 0.

Due to the acyclicity of the classi cation BAG, the nal strength values can again be computed by a single forward pass through the graph in linear time. The result is guaranteed to be equal to the result of the iterative and continuous update approach [16]. Figure 8 shows a G-BAG for the XOR function. The more compact BAG in Figure 7 could also used for a G-BAG for XOR. In particular, as we increase the parameter values at the edges from 1 to 1, the outputs for positive (negative) examples will move to 1 (0). We will come back to the BAG in Figure 8 later because it illustrates an idea to classify arbitrary discrete functions with classi cation BAGs. 4

Learning

We nally discuss some ideas for learning classi cation BAGs from data. We divide the learning problem into parameter and structure learning. For parameter learning, we suppose that the classi cation BAG is already given and we only have to learn the weights. For structure learning, both the classi cation BAG (hidden arguments, edges) and the parameters have to be learnt. 4.1

Parameter Learning

One common way to learn the parameters of a model is to minimize a loss function that measures the discrepancy between the desired label and the output of the classi er. Recall from Figure 4 that our classi ers are probabilistic classiers (if not by default, then due to the softmax normalization). A standard loss function for such classi ers is the logistic loss (a.k.a. cross-entropy loss). Given a classi er c with parameter vector and examples E = f(xi; yi) j 1 i N g, the logistic loss of the current parameters is L( ) = N1 PiN=1 log c (xi; yi). Intuitively, it simply takes the negative logarithm of the classi er's con dence for the right label for every example. If the con dence for the right label is 1 (and, thus, the con dence for the wrong labels is 0), the loss is 0. As the con dence goes to 0 the loss goes to in nity. If we do not use the softmax normalization, we have to be careful that the output for the desired label is non-zero, but this can be guaranteed by some modi cations of the actual implementation.

The loss is often minimized by using optimized variants of gradient descent. In general, the loss function for classi cation BAGs may be non-di erentiable. However, due to the simple structure of our proposed models, we can guarantee di erentiability in several cases. The key observation is that the degrees of acceptance can be computed by a simple forward pass for both the P-BAGs and G-BAGs that we discussed. Therefore, the acceptance degree at the label arguments is just a composed function of the inputs. If the involved functions are di erentiable, the loss functions are di erentiable and the gradient can be computed by automatic di erentiation as implemented in libraries like Tensor ow. For our proposed P-BAGs, the loss function is non-di erentiable at some points due to the use of the maximum. This may actually not cause any problems, but we could also make the loss function di erentiable by replacing the maximum in our constraints with a squared maximum. For our G-BAGs, the loss function is already di erentiable.

In general, we cannot apply gradient methods naively because our parameter ranges are constrained. However, the constraints are rather simple and can be dealt with by doing the following after every update step: 1. If a parameter exceeds a threshold b, set the parameter to 02+b , where 0 is the previous value of the parameter. 2. Normalize parameters that have to be normalized (attack and support parameters for P-BAGs).

For the classi cation BAGs that we introduced here, we can actually get rid of the non-negativity constraint by considering edges with negative weights as attacks and edges with positive weights as supports. This is also bene cial because it allows to learn the nature of an edge (attack or support) from data. However, for other instantiations, like a G-BAG with product for aggregation, this may not be possible as easily. Furthermore, for more general classi cation BAGs, it may not be possible to obtain a di erentiable loss functions. We are planning to look at two strategies for these cases: Heuristic Gradients: compute a heuristic gradient of the loss function by replacing the partial derivative for the i-th parameter with the approximation L( 0) L( ) , where 0 is obtained from by increasing the i-th parameter by a small constant > 0.

Meta-heuristics: apply meta-heuristics for solving numerical optimization problems that do not require gradient information like Di erential Evolution or Particle Swarm Optimization [ 6 ]. 4.2 In principle, we could just build a huge classi er BAG and apply a parameter learning algorithm in order to solve a classi cation problem. Similar to the idea of the approximation theorems for neural networks [8], a classi cation BAG with a single hidden layer can learn to approximate arbitrary discrete functions. The intuitive explanation is that we can just generate one hidden argument for every possible input (or region of input values) that is supported by the input arguments that agree with this input and attacked by the remaining input arguments. This argument then supports the desired label and attacks all other labels. The classi cation BAG for XOR in Figures 8 illustrates the idea (we eliminated some redundancy, though). However, of course, this is not a very meaningful model and will probably over t the noisy dataset after training. It will also be very di cult to interpret such a model.

Since interpretability is our main motivation, we want to learn a sparse classi cation BAG. The situation is similar for Bayesian networks, where one usually wants to learn a compact network with as few spurious edges as possible [13]. Bayesian networks have some local decomposition properties that we cannot exploit. However, there are some general learning ideas that we can immediately apply. One way to learn the structure of Bayesian networks is by means of a local search prodedure that starts from some random graph and then repeatedly 1. creates a neighborhood of the graph by means of search operators, 2. evaluates (a subset of) the graphs in the neighborhood by minimizing the loss function for this graph, 3. picks the best neighbor for the next iteration until the graph cannot be improved anymore [13]. Step 2 can be easier for Bayesian networks because often closed-form solutions for the optimal parameters exist. We will usually have to perform a parameter search instead and may end up with parameters that are only locally optimal. An interesting question is therefore if we can set up classi cation BAGs such that we obtain closed-form solutions or, at least, loss minimization problems with a unique minimum. For example, for P-BAGs with the linear constraints from (1), this may be possible.

Common search operators for Bayesian networks are adding, deleting and reversing edges. Reversing edges currently does not play a role for us, but may be interesting in order to obtain more expressive classi ers. Turning an attack edge into a support edge may be another useful operator. We also want to add hidden arguments. Since just adding arguments without any edges cannot change the classi cation outcome, operators may introduce new arguments between existing connections and summarize these connections.

The search can be guided by local search heuristics like simulated annealing or beam search [ 6 ]. In order to simplify the search process, to avoid over tting and to improve interpretability, it is reasonable to restrict the possible structures. In addition to the forward structure that we assumed throughout this paper, it seems also reasonable to restrict the number of layers, the number of arguments per layer and the number of ingoing and outgoing edges. We are currently working on an implementation to evaluate di erent strategies.

Related Work

There has been growing interest in combining argumentation technology and machine learning in recent years. [25] proposed some ideas for solving classi cation problems by means of structured argumentation. As opposed to abstract argumentation, structured argumentation explicitly takes the structure of arguments like their premises and conclusion into account. The idea in [25] is to apply a rule mining algorithm rst in order to learn structured arguments. A structured argumentation solver can then be applied in order to derive a label for given inputs and to explain the outcome. While this is a very interesting idea for explainable classi cation, it relies very much on the success of the underlying rule mining algorithm. In particular, it is currently unclear how to train such a model in an end-to-end fashion such that the rule mining process is guided by the classi cation outcome of the reasoner.

Gradual argumentation frameworks have been combined successfully with machine learning methods in order to add explainability to problems like product recommendation [22] or review aggregation [ 5 ]. However, the argumentation framework and the machine learning method are again not trained in an end-to-end fashion. Instead, a machine learning method is applied rst and the argumentation framework is applied on top of the result.

[9] recently proposed some ideas for learning epistemic constraints for PBAGs from data. The constraints are more general than what we considered here and have the form of rules. The premise of these rules consists of a conjunction of atomic probability statements and the conclusion is an atomic probability statement as well. Using a constraint learning algorithm like that may be an alternative to learn the structure of a classi cation BAG without repeatedly calling a parameter learning algorithm.

There also has been some work on learning classical argumentation frameworks from sets of accepted arguments [24, 12]. However, the motivation is very di erent from our motivation. While argumentation usually asks, given an argumentation graph, which arguments can be accepted, the authors in these works ask, given arguments accepted by users, what is the underlying argumentation graph? Therefore, these approaches do not allow for a distinction between input and output arguments and the investigated frameworks are restricted to attack-relations only. In particular, the training procedure is currently based on argumentation rationales, rather than based on a classi cation outcome.

In principle, we could also instantiate classi cation BAGs with classical bipolar frameworks [ 2, 15, 20 ]. The degree of acceptance at an argument can then be de ned as the relative frequency of labellings that accept the argument, where we restrict to such labellings that accept the input arguments. The resulting loss function will be non-di erentiable, however, and computing all labellings can be very expensive. It is interesting to note, though, that the relative frequencies can often be computed by encoding the argumentation problems as a Markov network [21]. This relationship may also be helpful to apply learning algorithms for Markov networks for learning classical classi cation BAGs and other probabilistic classi cation BAGs discussed in [21].

Conclusion

We presented some conceptual ideas for solving classi cation problem by means of abstract argumentation technology. One important di erence to previous combinations of argumentation and machine learning is that our framework can be trained in an end-to-end fashion.

Our classi cation BAGs are structurally similar to feedforward neural networks and Bayesian networks and we took a lot of inspiration from these elds. Naive classi cation BAGs are to deep classi cation BAGs as logistic regression is to deep multilayer perceptrons and as a naive Bayes classi er is to complex Bayesian networks. As we argued before, from a classi cation perspective, deep classi cation BAGs can be seen as universal function approximators that can theoretically approximate arbitrary discrete functions.

Our hope is that Classi cation BAGs can o er better transparency and explainability than other numerical classi ers. For example, for deep neural networks, transparency is often lost due to the deep and dense structure of the network. Bayesian networks look very intuitive, but are frequently misinterpreted. For example, practitioners often assume that edges encode causal relationships, while the actual theory only assumes that missing edges encode independencies [13]. Classi cation BAGs are less susceptible to misinterpretations because the meaning of attack and support edges is very intuitive. The calculations of degrees of acceptance seem also easier to grasp than the marginal probability computations that underlie Bayesian networks.

Of course, the applicability of classi cation BAGs will depend on the availability of reliable learning algorithms and an experimental evaluation is necessary in order to evaluate their performance. We are currently implementing di erent methods in a Python library and will start an experimental evaluation soon. 7. Dung, P.M.: On the acceptability of arguments and its fundamental role in nonmonotonic reasoning, logic programming and n-person games. Arti cial intelligence 77(2), 321{357 (1995) 8. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016) 9. Hunter, A.: Learning constraints for the epistemic graphs approach to argumentation. In: Computational Models of Argument (COMMA). p. to appear. IOS Press (2020) 10. Hunter, A., Polberg, S., Thimm, M.: Epistemic graphs for representing and reasoning with positive and negative in uences of arguments. Arti cial Intelligence 281, 103236 (2020). https://doi.org/10.1016/j.artint.2020.103236 11. Hunter, A., Thimm, M.: On partial information and contradictions in probabilistic abstract argumentation. In: International Conference on Principles of Knowledge Representation and Reasoning (KR). pp. 53{62. AAAI Press (2016) 12. Kido, H., Okamoto, K.: A bayesian approach to argument-based reasoning for attack estimation. In: Sierra, C. (ed.) International Joint Conference on Arti cial Intelligence (IJCAI). pp. 249{255 (2017) 13. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques.

MIT press (2009) 14. Mossakowski, T., Neuhaus, F.: Modular semantics and characteristics for bipolar weighted argumentation graphs. arXiv preprint arXiv:1807.06685 (2018) 15. Oren, N., Norman, T.J.: Semantics for evidence-based argumentation. In: Computational Models of Argument (COMMA). vol. 172, pp. 276{284. IOS Press (2008) 16. Potyka, N.: Continuous dynamical systems for weighted bipolar argumentation. In: International Conference on Principles of Knowledge Representation and Reasoning (KR). pp. 148{157 (2018) 17. Potyka, N.: A tutorial for weighted bipolar argumentation with continuous dynamical systems and the java library attractor. International Workshop on NonMonotonic Reasoning (NMR) (2018) 18. Potyka, N.: Extending modular semantics for bipolar weighted argumentation. In: International Conference on Autonomous Agents and MultiAgent Systems (AAMAS). pp. 1722{1730 (2019) 19. Potyka, N.: A polynomial-time fragment of epistemic probabilistic argumentation. International Journal of Approximate Reasoning 115, 265{289 (2019). https://doi.org/10.1016/j.ijar.2019.10.005 20. Potyka, N.: Abstract argumentation with markov networks. In: European Conference on Arti cial Intelligence (ECAI). p. to appear (2020) 21. Potyka, N.: Bipolar abstract argumentation with dual attacks and supports. In: International Conference on Principles of Knowledge Representation and Reasoning (KR). p. to appear (2020) 22. Rago, A., Cocarascu, O., Toni, F.: Argumentation-based recommendations: Fantastic explanations and how to nd them. In: International Joint Conference on Arti cial Intelligence (IJCAI). pp. 1949{1955 (2018) 23. Rago, A., Toni, F., Aurisicchio, M., Baroni, P.: Discontinuity-free decision support with quantitative argumentation debates. In: International Conference on Principles of Knowledge Representation and Reasoning (KR). pp. 63{73 (2016) 24. Riveret, R., Governatori, G.: On learning attacks in probabilistic abstract argumentation. In: Jonker, C.M., Marsella, S., Thangarajah, J., Tuyls, K. (eds.) Proceedings of AAMAS'16. pp. 653{661. IFAAMAS (2016) 25. Thimm, M., Kersting, K.: Towards argumentation-based classi cation. In: Logical Foundations of Uncertainty and Machine Learning Workshop. vol. 17 (2017)

1. Amgoud , L. , Ben-Naim , J. : Evaluation of arguments in weighted bipolar graphs . In: European Conference on Symbolic and Quantitative Approaches to Reasoning and Uncertainty (ECSQARU) . pp. 25 { 35 . Springer ( 2017 )

2. Amgoud , L. , Cayrol , C. , Lagasquie-Schiex , M.C. : On the bipolarity in argumentation frameworks . In: International Workshop on Non-Monotonic Reasoning (NMR) . vol. 4 , pp. 1 { 9 ( 2004 )

3. Baroni , P. , Rago , A. , Toni , F. : How many properties do we need for gradual argumentation ? In: AAAI Conference on Arti cial Intelligence (AAAI) . pp. 1736 { 1743 . AAAI ( 2018 )

4. Baroni , P. , Romano , M. , Toni , F. , Aurisicchio , M. , Bertanza , G. : Automatic evaluation of design alternatives with quantitative argumentation . Argument & Computation 6 ( 1 ), 24 { 49 ( 2015 )

5. Cocarascu , O. , Rago , A. , Toni , F. : Extracting dialogical explanations for review aggregations with argumentative dialogical agents . In: International Conference on Autonomous Agents and MultiAgent Systems (AAMAS) . pp. 1261 { 1269 ( 2019 )

6. Du , K.L. , Swamy , M. : Search and optimization by metaheuristics . Techniques and Algorithms Inspired by Nature; Birkhauser: Basel, Switzerland ( 2016 )