Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning

Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning CorSteging Bernoulli Institute of Mathematics Computer Science and Artificial Intelligence University of Groningen SiljaRenooij Department of Information and Computing Sciences Utrecht University BartVerheij Bernoulli Institute of Mathematics Computer Science and Artificial Intelligence University of Groningen Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning EE03536A60B034B75693B452C23C5EC2 GROBID - A machine learning software for extracting information from scholarly documents Responsible AI Explainable AI Machine Learning

In AI and law, systems that are designed for decision support should be explainable when pursuing justice. In order for these systems to be fair and responsible, they should make correct decisions and make them using a sound and transparent rationale. In this paper, we introduce a knowledge-driven method for model-agnostic rationale evaluation using dedicated test cases, similar to unit-testing in professional software development. We apply this new method in a set of machine learning experiments aimed at extracting known knowledge structures from artificial datasets from fictional and non-fictional legal settings. We show that our method allows us to analyze the rationale of black box machine learning systems by assessing which rationale elements are learned or not. Furthermore, we show that the rationale can be adjusted using tailor-made training data based on the results of the rationale evaluation.

Introduction

In AI and Law, explainability is a key requirement in system design, due to the need for the justification of decisions. For machine-supported decisions, this is nowadays encoded in the GDPR's right to explanation. Four types of explanations are distinguished [Miller, 2019] and have been applied in AI and Law [Atkinson et al., 2020b]: Contrastive explanations show why a decision is made and others are not. Examples include HYPO's counterexamples and hypothetical situations [Rissland andAshley, 1987, Ashley, 1990] and argument diagrams [Verheij, 2003a]. In selective explanations, the focus is on the most salient elements needed, for instance by the use of the critical questions of argumentation schemes [Atkinson et al., 2020a, Verheij, 2003b]. Probabilistic explanations are grounded in statistical correlations, and are less applicable in law with its focus on specific circumstances. An example is the explanation of evidential Bayesian networks [Vlek et al., 2016] in terms of scenarios and the evidence for and against them. Lastly, social explanations emphasise the transfer of knowledge between individuals, as in models of the dialogue between parties and in courts, specifying shared and unshared commitments Hage et al. [1993], Gordon [1995], Atkinson et al. [2020a].

This requirement of explainability is problematic for the application of central machine learning techniques in law. Neural networks, for example, are known to perform well, but behave like a black box algorithm. Hence, explanation techniques have been developed to 'open the black box' (cf. LIME [Ribeiro et al., 2016], SHAP [Lundberg and Lee, 2017]). Even in the domain of vision (where the successes of neural networks are especially significant), the necessity of such methods is underpinned by studies regarding adversarial attacks that show that slight perturbations of images, invisible to the human observer, can radically change the outcome of a classifier [Goodfellow et al., 2015].

In this paper, we expand upon the method introduced in [Steging et al., 2021], where we investigate black box machine learning methods with a focus on proper explainability, and not only in terms of accuracy as in the standard machine learning protocol. We are in particular interested in the discovery of the rationale underlying decisions, where the rationale is the knowledge structure that can justify a decision, such as the rule applied. We aim to measure the quality of rationale discovery, with an eye on the possibility of improving rationale discovery.

To measure and possibly improve rationale discovery, we create dedicated test datasets, on which a machine learning system can only perform well if it has learned a particular component of the knowledge structure that defined the data. The idea is similar to how unit testing works in professional software development: we define a set of cases, targeting a specific component, in which we know what the answer should be, and compare that to the output that the system gives.

To be able to focus on what is methodologically feasible, we do not use natural language corpora (as for instance in argument mining [Mochales Palau andMoens, 2009, Wyner et al., 2010], conceptual retrieval [Grabmair et al., 2015] or case prediction [Ashley, 2019, Medvedeva et al., 2019, Brüninghaus and Ashley, 2003]). Instead we work with datasets of artificial decisions with known underlying generating rationale.

Our work builds on a study investigating whether neural networks are able to tackle open texture problems [Bench-Capon, 1993]. The study used a fictional legal domain (also investigated in [Wardeh et al., 2009, Možina et al., 2005]), in which the eligibility for a welfare benefit for elderly citizens is determined based on six conditions. Artificial datasets were generated specifying personal information of elderly citizens with their eligibility for the welfare benefit. Multilayer perceptrons were trained and tested on these datasets, and managed to perform with high accuracy scores (above 98%). It was shown that the neural networks were unable to properly learn two of the six conditions. By making adjustments to the training dataset, the neural networks were able to learn conditions more adequately, while maintaining similar accuracy scores. But also after adjustment, the conditions that defined the data were not learned fully correctly. Other ear-lier discussions of neural networks in law are [Philipps and Sartor, 1999, Hunter, 1999, Stranieri et al., 1999].

In the upcoming sections, the study by Bench-Capon [1993] will first be replicated as closely as possible, using modern, widely-used neural network methods, in order to affirm whether the claims made in 1993 still hold today. Then a simplified version of the welfare benefit domain is examined to see how well the networks are able to extract a simplified rationale. Lastly, we study a real legal setting, namely Dutch tort law. That domain uses only Boolean variables, but allows for exceptions to underlying rules.

Domains and Datasets

For each of the three domains considered in this paper, this section describes the underlying knowledge structure using logic, from which we will generate datasets to train a series of neural networks. These networks will subsequently be analysed using a method we propose for assessing the quality of their rational discovery. To this end we need two types of datasets for the purpose of testing. The first are standard test sets sampled from the complete domain to evaluate the accuracy of the networks. The second type is a dedicated test set designed to target a specific aspect of the domain knowledge. This section describes all datasets we use.

Domains

Welfare benefit domain This fictional domain introduced in [Bench- Capon, 1993] concerns the eligibility of a person for a welfare benefit to cover the expenses for visiting their spouse in the hospital, and can be formalised as follows:

Eligible(x) ⇐⇒ C 1 (x) ∧ C 2 (x) ∧ C 3 (x) ∧ C 4 (x) ∧ C 5 (x) ∧ C 6 (x) C 1 (x) ⇐⇒ (Gender(x) = f emale ∧ Age(x) ≥ 60)∨ (Gender(x) = male ∧ Age(x) ≥ 65) C 2 (x) ⇐⇒ |Con 1 (x), Con 2 (x), Con 3 (x), Con 4 (x), Con 5 (x)| ≥ 4 C 3 (x) ⇐⇒ Spouse(x) C 4 (x) ⇐⇒ ¬Absent(x) C 5 (x) ⇐⇒ ¬Resources(x) ≥ 3000 C 6 (x) ⇐⇒ (T ype(x) = in ∧ Distance(x) < 50)∨ (T ype(x) = out ∧ Distance(x) ≥ 50)

That is, a person is eligible iff he/she is of pensionable age (60 for a woman, 65 for a man), paid four out of the last five contributions Con i , is the patient's spouse, is not absent from the UK, has capital resources not amounting to more than £3,000, and lives at a distance of less than 50 miles from the hospital if the relative is an in-patient, or beyond that for an out-patient.

The six independent conditions for eligibility are defined in terms of 12 variables, which are the features of the generated datasets. These features and their possible values are shown in Table 1. In addition to these 12 features, the datasets will contain 52 noise features unrelated to eligibility, just as in the original experiment, giving a total of 64 features plus an eligibility label for each instance. All datasets are valid in the sense that the given eligibility labels follow from evaluating the 6 conditions above.

Simplified domain Experiments with different models for the welfare domain all concluded that it was not possible to extract all six conditions for eligibility [Bench-Capon, 1993, Wardeh et al., 2009, Možina et al., 2005]. The complexity of the original problem with 6 different conditions and 64 features, complicate a proper analysis of the networks' rationale, since each condition and feature could potentially influence it. To facilitate this analysis, we simplified the original problem in two ways. First, the 52 noise variables, which did not seem to affect the performance of the networks [Bench-Capon, 1993], are removed. Secondly, we define eligibility solely by the age-gender (C 1 ) and patient-distance (C 6 ) conditions that were examined in the original experiment to justify the rationale of the network:

Eligible(x) ⇐⇒ C 1 (x) ∧ C 6 (x)

Eligibility is thus determined through a combination of a XOR-like function (C 6 ) and a nuanced threshold function (C 1 ).

Tort law domain Our third domain concerns Dutch tort law: articles 6:162 and 6:163 of the Dutch civil code that describe when a wrongful act is committed and resulting damages must be repaired. This 'duty to repair' (dut) can be formalised as follows:

dut(x) ⇐⇒ c 1 (x) ∧ c 2 (x) ∧ c 3 (x) ∧ c 4 (x) ∧ c 5 (x) c 1 (x) ⇐⇒ cau(x) c 2 (x) ⇐⇒ ico(x) ∨ ila(x) ∨ ift(x) c 3 (x) ⇐⇒ vun(x) ∨ (vst(x) ∧ ¬jus(x)) ∨ (vrt(x) ∧ ¬jus(x)) c 4 (x) ⇐⇒ dmg(x) c 5 (x) ⇐⇒ ¬(vst(x) ∧ ¬prp(x))

Fig. 1: Arguments and attacks (A) and their elementary propositions (B) in Dutch tort law [Verheij, 2017].

where the elementary propositions are provided alongside an argumentative model of the law in Figure 1 [Verheij, 2017], and conditions c 2 and c 3 capture the legal notions of unlawfulness (unl ) and imputability (imp), respectively.

Compared to the fictional welfare domain, the Dutch tort law domain is captured in 5 conditions for duty to repair (dut), based upon 10 Boolean features. Each condition is a disjunction of one or more features, possibly with exceptions. The feature capturing a violation of a statutory duty (vst) is present in both condition c 3 and c 5 , rendering these dependent.

Datasets

For each experiment, we generate datasets of different types, for different purposes 3 For most types of datasets, the generating process is at least partly stochastic and repeated for every repetition of an experiment. Using the same type of dataset, for example in training and testing a neural network, does therefore not mean using the exact same dataset. Table 2 shows an overview of the domains and their datasets, and illustrates their differences and similarities.

Welfare benefit datasets Within this domain, four types of datasets are generated, following the original study [Bench-Capon, 1993] as closely as possible: type A, type B, Age-Gender and Patient-Distance datasets. Each dataset contains the 12 features as defined in Table 1, as well as 52 noise features with integer values ranging from 0 to 100. The original study used training sets with 2,400 instances, which is quite small by todays standards [Atkinson et al., 2020b]. To make sure conclusions are not the result of using too little data, we will also include training sets with more data (50,000 instances).

Type A datasets are generated with either 2,400 instances or 50,000 instances. Exactly half of the instances are eligible, creating a balanced label distribution, as is common practice in machine learning problems. For the eligible instances, feature values are generated (randomly where possible) such that they satisfy the conditions C 1 − C 6 . For each condition, 1 6 th of the ineligible instances is designed to fail on that specific condition; where possible the values of the features involved are generated randomly such that the condition fails. All remaining features in these instances are generated randomly across their full range of values (see Table 1); as a result, it is possible for ineligible instances to fail on multiple conditions, and some conditions will fail more often than others.

In the original 1993 study it was argued that it was too easy to achieve high accuracy scores with networks trained and tested on type A datasets, which contained an average of 4.1 conditions that were not satisfied for ineligible cases. Using only 4 out of 6 conditions was shown to be sufficient for classifying 98.95% of the instances correctly.

Type B datasets were subsequently introduced to make the problem more challenging. These datasets differ from type A datasets only in that ineligible instances fail on exactly one condition, rather than at least one condition; the other five conditions are always satisfied. Type B datasets again contain either 2,400 instances or 50,000 instances.

The original study investigated whether "an acceptable rationale can be uncovered by an examination of the net" [Bench-Capon, 1993]. To this end a set of test cases was constructed in which all conditions except one were guaranteed to be satisfied. From these it was concluded that the age-gender condition (C 1 ) and the patient-distance condition (C 6 ) are not learned by network. The original paper does not specify how many test cases were constructed, nor exactly how they were constructed. We will generate dedicated datasets that are tailor made to evaluate whether the trained networks have actually uncovered these same two conditions.

The Age-Gender datasets are generated by sampling the age and gender features across their full range of values, this time considering only multiples of 5 for age. The values for the other features are generated such that every condition is satisfied except for the age-gender condition (C 1 ). As a result, the Table 2: An overview of the three domains and their datasets. Datasets marked with an asterisk are used for testing purposes only. For each type of dataset, the size and label distribution is given. For each domain is indicated the number and type of features, the number of type of conditions to be learned, whether or not all cases are covered by the datasets (complete) and whether the domain is fictional or real. eligibility of an instance in these datasets is solely determined by whether or not condition C 1 is satisfied. The Age-Gender sets contain 40,000 instances, with every possible combination of values for age and gender occurring a 1000 times. This gives a slightly unbalanced label distribution with 42.5% of the instances being eligible, and 57.5% ineligible. Because the dataset is only used to test the networks, rather than to train the network, this is not an issue. The Patient-Distance datasets are similarly generated by sampling the distance and patient type features across their full range of values, this time considering only multiples of 5 for distance. The eligibility of an instance in these datasets is thus determined by whether or not condition C 6 is satisfied. The Patient-Distance datasets also contain 40,000 instances, with every possible combination of values for patient type and distance occurring a 1000 times. In these datasets, exactly 50% of the instances is eligible.

Simplified datasets For the simplified welfare domain, the same type of datasets are generated as above, with the same properties except that all noise features and 8 of the 12 actual features are excluded. For type B datasets, this means that ineligible instances fail on either C 1 or C 6 , but not on both, while in type A datasets the ineligible instances can fail on both conditions. Moreover, in the Age-Gender dataset the patient-distance condition C 6 is always satisfied, and in the Patient-Distance dataset the age-gender condition C 1 is always satisfied. Type A and type B datasets again contain 50,000 instances each. The Age-Gender dataset now contains only two features and 4,242 instances, that is, one unique instance for every possible combination of age and gender. Likewise, the Patient-Distance dataset contains 3,234 unique instances.

Tort law datasets With 10 Boolean features there are 2 10 = 1024 possible unique cases that can be generated from the argumentation structure of the tort law domain in Figure 1. Each case has a corresponding outcome for dut, indicating whether or not there is a duty to repair someone's damages. We will again consider four types of datasets.

The unique dataset contains these 1024 unique instances for the 10 features plus the label. In this dataset, there are 912 instances where dut is false and 112 instances where dut is true (11%).

The regular type datasets are generated such that dut is true in exactly half of the instances. The sets are regular in the sense that balanced label distributions are common in machine learning problems. These regular datasets are generated by sampling uniformly from the subset of cases from the unique dataset, such that each possible case is represented equally within the 50/50 label distribution. In practice, only a subset of the possible cases is typically available and presented to a network, upon which the network will have to learn to generalize to all possible cases. In addition to generating regular type datasets with 5,000 cases, we therefore also generate smaller regular type datasets with only 500 instances; the latter contains 35.35% of the unique instances.

In the tort law domain we focus on the notions of unlawfulness (c 2 ) and imputability (c 3 ) to assess whether the networks are able to discover conditions in the data. For each of the two conditions, we again create a dedicated dataset.

The Unlawfulness dataset is the subset of the unique dataset in which the features for the unlawfulness condition c 2 can take on any of their values, while the other features have values that are guaranteed to satisfy the remaining conditions.Whether or not there is a duty to repair is therefore solely determined by whether or not condition c 2 is satisfied. All combinations of values of the other features are considered. The Unlawfulness dataset therefore consists of 168 unique instances, of which 66.66% have a positive dut value.

The Imputability dataset is a similar subset of the unique dataset, but now the features for the imputability condition (c 3 ) can take on any value, provided that the value of vst is such that condition c 5 is satisfied. The value of dut(x) now completely depends on whether or not condition c 3 evaluates to true. Due to the interdependency of conditions c 3 and c 5 , the Imputability dataset only has 128 unique instances, with 87.5% of them having a positive dut value.

Experimental setup and results

In this section we describe and motivate the experiments we performed and report on their results.

Experiments

In order to demonstrate our method for assessing and improving rationale discovery of models learned from data, we first need such models. Though our method is model agnostic, we choose to use neural networks, like in [Bench-Capon, 1993]. We assume that assessing and improving rationale discovery is relevant only for models that are considered to be a good match with the data they were learned from. Our first step, after training the above mentioned neural networks, is therefore to evaluate their performance on typical test sets in terms of the standard accuracy measure. Subsequently we will evaluate the performance of the networks on the dedicated, knowledge-driven test sets that were specifically designed for assessing the networks' quality of rationale discovery.

Neural network architectures In the original experiments in 1993, three multilayer perceptrons were used with one, two and three hidden layers, respectively [Bench-Capon, 1993]. These networks were created using the Aspirin software [Leighton and Wieland, 1994], but the exact details regarding the networks and its parameters (e.g. the learning rate, activation function, gradient descent method) were left out of the original publication. The networks all had 64 input nodes (one for each feature in the datasets), a varying number of nodes in the hidden layers, and one output node that determines the eligibility.

In this paper we will use a similar set-up and network architecture for all three domains. The output is always a single node, representing either eligibility or duty to repair, depending on the domain under consideration. The number of input nodes corresponds to the number of features and is therefore dependent on the domain (see Table 2). More specifically, the welfare benefit domain will have 64 input nodes, the simplified domain will have 4, and the tort law domain will have 10 input nodes. The node configuration (i.e. number of nodes per layer) of each network is as follows, where input represents the number of nodes in the input layer:

-One hidden layer network: input-12-1 -Two hidden layer network: input-24-6-1 -Three hidden layer network: input-24-10-3-1

In the replication of the 1993 experiment, the MLPClassifier of the scikit-learn package is used [Pedregosa et al., 2011]. The networks use the sigmoid function as their activation function, which was the most common activation function when the original study was done. The networks use the Adam stochastic gradientbased optimizer [Kingma and Ba, 2015], with a constant learning rate of 0.001. A total of 50,000 training iterations are used with a batch size of 50. Recall that the focus of this study is not on creating the best possible classifier, but to demonstrate our method of assessing rationale discovery.

Training and performance testing The three types of neural networks will be trained and tested on a combination of different datasets, from each of the three domains. A complete overview of the datasets used in the experiments is shown in Table 3. This table shows the datasets that the networks will train on, and the datasets that the networks will be tested with. For each domain, every combination of training dataset and testing dataset is evaluated in terms of the accuracy of the resulting network on the test data. Because some of the datasets are stochastic (each generated dataset is slightly different), the whole process of data generation, training and testing is repeated 50 times. The mean classification accuracies along with their standard deviations will be reported.

To assess the rationale discovery capabilities of all the trained networks, we study their performance on the dedicated test sets for the age-gender, patientdistance, unlawfulness and imputability conditions. Performance will be measured both quantitatively, using standard accuracy, and qualitatively by a more detailed comparison of actual and expected outcomes.

Results

We will first report the accuracy scores for all combinations of training and testing datasets in the different domains and subsequently focus on rationale discovery. Results will be discussed in detail in Section 4. Accuracy Tables 5 -8 show the mean classification accuracies over 50 runs, together with their standard deviations, for the different combinations of training and testing sets in the three domains. These tables include the quantitatively measured performance on the various dedicated test sets. Tables 5 and 6 present the results for the replication experiment with the original sizes of the Type A and Type B datasets and for the replication experiment with more data, respectively. The accuracies as reported in the orginal paper are provided in Table 4 [ Bench-Capon, 1993]). Results for the simplified welfare benefit domain are shown in Table 7, and for the tort law domain in Table 8. Rationale discovery Each dedicated test dataset is designed to measure how well a model has learned a specific condition from the domain. Since performance on these test sets in terms of accuracy is comparable for the different neural network architectures used, we present the results for the qualitative evaluation of their rationale discovery capabilities for what is in theory the most sophisticated one: the models with 3 hidden layers.

In the welfare benefit domain, the Age-Gender datasets are used to measure how well condition C 1 is learned. In addition to measuring accuracy on these dedicated datasets, we can plot the actual output of the neural network, which should be 1 (eligible) for an individual of pensionable age and 0 otherwise, against age, for both values of gender. Such plots, showing the mean output of the networks over 50 runs, are shown in Figure 2 for both the welfare benefit domain and its simplified version, for networks trained on each of the training sets under consideration.

Similarly, the Patient-Distance datasets are used to evaluate how well condition C 6 is learned, which can be assessed by plotting the networks' output against the distance to hospital, for both in-patients and out-patients. The plots showing the mean network output over 50 runs for the welfare benefit do- main and its simplified version, and for networks trained on each of the training sets under consideration, are shown in Figure 3.

In the tort law domain, we can similarly evaluate how well conditions c 2 (unlawfullness) and c 3 (imputability) are learned. For these conditions, the network should output 1 in cases of the Unlawfulness dataset where the case is unlawful (c 2 ), or in the Imputability dataset where the case can be imputated to a person (c 3 ); otherwise the output should be 0. Since the tort law domain only contains Boolean features, the outputs of the networks are presented in tables rather than plots. The mean output over 50 runs for the two training sets on the Unlawfulness and Imputability datasets is presented in Table 9.

Discussion

In this section we discuss in detail the results we found and the conclusions we can draw from them. We separately focus on standard classification accuracy and on rational discovery capabilities. We conclude by introducing the approach we took as a general knowledge-driven method for model-agnostic rationale evaluation.

Standard Accuracy

Standard accuracy is measured to see whether the learned models are able to solve the classification problem, regardless of whether or not they discovered the rationale underlying the data.

Welfare benefit

The accuracies obtained in the replication experiment (Table 5) differ from those in the original study (Table 4), but show similar trends. Originally, networks trained on a type A training set performed well on type A test sets (around 99%), but much worse on the type B test set (around 70-76%). When trained on a type B training set, the accuracies on test set A in the original study stayed the same, with accuracies on test set B increasing to around 98%. In the replication experiment, training on a type B test set slightly decreases accuracies on type A test sets, while the accuracy on type B test sets increased less substantially than in the original experiment. In both cases, changing the distribution of the training data from type A to type B served to increase performance on test sets of the latter type while hardly affecting performance on test sets of the former type. Since type B datasets exploited some knowledge of the domain (i.e. benefit is typically denied due to failure on only a single condition), this suggests that overall performance can be improved using tailor made training sets.

Using more data, we find higher accuracies (Table 6). This is not surprising, as more training data generally leads to a better performance. Still, for networks trained on type A data sets, accuracies on a type B test sets (below 85%) are much lower than on type A test sets (around 99.8%). Training the networks on type B training sets with more data shows significantly better results than with fewer datapoints. Accuracies on type A test sets then are still around 99%, whereas the accuracies on type B test sets are around 98%. The above observations therefore still hold, even with more data and modern machine learning methods.

Simplified In the simplified domain, high accuracies are found across all datasets, averaging out at around 99% (Table 7). Accuracies on type B test sets are only slightly lower than on type A test sets, unlike in the other welfare domain experiments. The networks do seem to perform slightly better on both types of test sets when trained on a type B training set as compared to a type A training set. However, this difference is much more nuanced than in the regular welfare benefit dataset. This can be explained by the fact that cases in type A data sets can now fail on at most 2 conditions, rather than the original 6, which is only 1 more than the single failed condition in the type B datasets.

Tort law In the Tort law domain we find accuracies of 100% or near 100% for networks trained on all instances (see Table 8). When presented with all unique instances, the networks with one and two hidden layers are able to perfectly predict the outcome from the Dutch tort law, and the network with three hidden layers can create a very close approximation. Presenting a neural network with all available cases is in practice often infeasible. If it is possible, then a simple lookup table rather than a neural network would most likely suffice. For this reason, we also trained the networks on a subset of only around 35% of the unique instances (see Table 8). As expected, the accuracies of the networks on the general test sets drop, but only slightly (to 98-99%). Even on the unique test set, accuracies remain around 96%. This suggests that it is possible to approximate tort law with a small subset of the unique cases.

Rationale Discovery

Looking at the performance of the networks on the dedicated test sets partially exposes the rationale captured by the network. We designed these test sets such that each one targets a single condition from the domain. In addition to considering the accuracy on these dedicated test sets, we qualitatively evaluate the rational discovery capabilities of the networks by comparing their outputs with the actual outputs we would ideally expect for the different domains.

Welfare benefit In the welfare benefit domain, the Age-Gender dataset is used to measure how well condition C 1 is learned, that is, whether the networks output 1 if the individual is of pensionable age (male and over the age of 65 or female and over the age of 60), and output 0 otherwise. Plotting the age of the individuals from the Age-Gender dataset against the output of the network, for each gender, should ideally result in the graph on the left side of Figure 4. Here the output of the network spikes instantly from 0 to 1 at the age of 60 for women and 65 for men.

Similarly, for cases from the Patient-Distance dataset the networks should only output 1 (eligible) if the relative is an in-patient and the distance to the hospital is less than 50 miles, or if the relative is an out-patient and the distance to the hospital is further than 50 miles (condition C 6 ). Plotting the distance against the output of the network for both types of patients would ideally result in the graph shown on the right in Figure 4.

In our replication experiment the output graphs show a similar pattern as in the original 1993 experiment. In the latter (not shown), the networks trained on a type A training set do not show the expected pattern, for neither condition; in fact for the Patient-Distance test cases always a 1 is returned. Training on a type B training set improved the results, but the turning point at which the networks output 1 is off. For the Age-Gender dataset it occurs at 45 for women, rather than 60, and at 50 for men, instead of 65. For the Patient-Distance dataset the turning point was too gradual, and takes place at 40, rather than 50 miles. In our replication experiment the outputs of the networks trained on a type B Training on a type B dataset seems to have a significant impact on the way the rationale of the networks is formed, as networks are able to internalize condition C 1 and C 6 better when trained on a type B dataset. This is furthermore reflected in the accuracies on the Age-Gender and patient distance dataset as shown in Table 5, which increase by roughly 30% when training on a type B dataset. These accuracies on the Age-Gender and Patient-Distance datasets were not present in the original study.

Upon repeating the replication experiment with more data, this indeed increases the performance of the networks significantly, but we still find performance for networks trained on type B datasets to be better than that of networks trained on type A datasets. Output patterns more closely resemble the ideal ones after training on type B datasets (see Figure 2(B) versus (E) and Figure 3(B) versus (E)) and accuracies also increase (see Table 6). Interestingly, the turning point for condition C 1 does occur at the right place when training on more type B training data: at 60 for females and 65 for males (see Figure 2(E)).

Simplified The simplified domain consists of only the two conditions C 1 and C 6 , without any other conditions or noise variables. In this less complex version of the domain, overall performance is much higher, and conditions C 1 and C 6 are learned quite successfully. Figures 2(C) and (F), and 3(C) and (F), respectively, are very close to the ideal output graphs, with turning points in the correct places. This is also reflected in near perfect accuracy scores on Age-Gender and Patient-Distance datasets in Table 7. As argued before, the difference between type A and type B datasets is much smaller than in the original domain, hence the results found for these to datasets are now quite similar.

Tort law Recall that in the tort law domain, on the Imputability dataset, networks should output 1 if the case can be imputated to the person, and 0 otherwise; on the Unlawfulness dataset, the networks should output 1 if the case is unlawful, and 0 otherwise. Table 9 shows well the networks were able to internalize the notions of unlawfulness and imputability. When trained on all instances, the mean output of the networks is 0 if a case is not unlawful, and 1 if it is, which is exactly what it should do. Networks trained on all instances attain a perfect score on the Imputability dataset as well. This can also be seen in Table 8, where the networks score 100% accuracy on the Unlawfulness and Imputability datasets after training on all instances.

With less data, however, accuracies drop to around 92-95% for the Unlawfulness dataset and 91-94% for the Imputability dataset. This accuracy may still seem high, but we should take into account the label distributions (66.67-33.33% and 87.5-12.5%, respectively). Table 9 shows that networks still perform perfectly on cases in which the unlawfulness and imputability conditions evaluate to true. When the conditions are false, however, mistakes are made. The average output of networks on the Unlawfulness dataset increases to 0.018, which should be 0, meaning that it classifies some lawful cases as unlawful. In the Imputability dataset, the mean output increased more drastically to 0.875 when imputability is false. Meaning that in 87.5% of the instances in which the case cannot be imputed to a person, the network incorrectly decided that it should. This means that despite its high accuracy on the general test set, the networks largely ignored the concept of imputability.

A Method for Rationale Evaluation

Although our experiments and discussion focused on specific example domains and neural networks, our approach for rationale evaluation can be seen as a general method independent of the machine learning algorithm applied. This paper therefore proposes a knowledge-driven method for model-agnostic rationale evaluation, consisting of three distinct steps:

1. Measure the accuracy of a trained system, and proceed if the accuracy is sufficiently high; 2. Design dedicated test sets for rationale evaluation targeting selected rationale elements based on expert knowledge of the domain; 3. Evaluate the rationale through the performance of the trained system on these dedicated test sets.

The first step is based on the assumption that efforts for assessing and possibly improving the rationale discovery capabilities of a learned model are only taken if the general performance of the model is already considered good enough. Here we assume performance is measured using accuracy, but other measures can be employed as well and the threshold of what is considered good enough may vary per domain and application. The second step in our method depends on domain knowledge. Hence the method effectively is a quantitative human-in-the-loop solution for rationale evaluation.

In the third step, performance is again evaluated, by now not only considering accuracy but also examining model output and expected output in terms of the dedicated test sets. Our examples have shown that the latter depend on the type of features involved.

Subsequently, the information gained by using this rationale evaluation method can be used to improve the rationale of the system by adjusting the training data accordingly, imposing sound rationale discovery.

The method does not currently specify how the dedicated test sets are constructed. We aim to further operationalize the rationale evaluation method by using information about the knowledge in the domain, and the distribution of examples, for instance building on Bayesian networks.

Conclusion

The work in this paper was inspired by Bench-Capon's 1993 paper that investigated whether neural networks are able to tackle open texture problems. The conclusions were that neural networks can perform very well on such problems in terms of accuracy, even if some conditions from the domain are not learned [Bench-Capon, 1993].

In this paper we first replicated the original experiments as closely as possible to verify that we can reproduce the results from the 1993 paper. In addition, we repeated the experiments with larger training datasets to ensure that the original conclusions about conditions that were not learned are not due to a lack of data. The idea of constructing test cases to test specific conditions inspired us to propose a method for assessing rationale discovery capabilities by designing dedicated test datasets and to evaluate performance on these knowledge-driven test sets, combining quantitative and qualitative evaluation elements in a hybrid way. Type B datasets served to complicate the problem in the original study, but also demonstrate that training can be improved using knowledge-driven tailor made training sets.

We investigated three legal domains, in which neural networks were trained on labelled cases and tasked with predicting unlabelled cases. We started off with an artificial domain from the literature, followed by a simplified adaptation of that domain. Lastly, we investigated a real life domain as well. The results indicate that the network are able to achieve high accuracies in each of the three domains. The networks are therefore able to make the right decisions in most cases, with accuracies averaging around 99% on type A or regular test sets. This is how machine learning problems are usually evaluated. Using our approach of rationale evaluation, however, we show that the networks do not necessarily learn the conditions, despite their high accuracy scores. Performance on the dedicated test sets, type B, Age-Gender and Patient-Distance dataset show that the networks are unable to learn the conditions C 1 and C 6 . This was suggested in the original experiment [Bench-Capon, 1993] and it holds true in the replication study with modern, commonly used machine learning techniques and more data. By adjusting the distribution of the training data based on expert domain knowledge (training on a type B dataset) these accuracies increase. Simplifying the domain shows that systems are able to learn the conditions C 1 and C 6 , though still not perfectly. Even in the real life tort law domain, with a non-fictional knowledge structure and different characteristics, a similar pattern can be observed. The networks failed to learn the independent condition that defines imputability, despite its high accuracies on the general test set.

This study therefore reaffirms the conclusions from previous work, while simultaneously introducing a model-agnostic method for assessing rationale discovery capabilities of machine learned black box models, using dedicated test datasets designed with expert knowledge of the domain. In future research, we aim to further detail and extend our method such that by employing it, the soundness of the rationale becomes tangible, and its quality can be asserted. Ultimately, based on this evaluation, the training data of the black-box systems can be altered to improve their rationale. Further expanding upon this design method will bring us closer to AI that is both explainable and responsible.

Fig. 2 :2Fig. 2: For all training sets from the (simplified) welfare domain: mean network output vs age on Age-Gender test set when trained on type A training sets (A-C) and on type B training sets (D-F).

Fig. 3 :3Fig. 3: For all training sets from (simplified) welfare domain: network output vs distance on Patient-Distance test set when trained on type A training sets (A-C) and on type B training sets (D-F).

Fig. 4 :4Fig. 4: An idealistic expectation of the outputs of a network on the Age-Gender dataset versus the age for both genders (left) and on the Patient-Distance dataset versus the distance for both patient types (right).

Table 1 :1Features in the welfare benefit domain.FeatureValuesAge0 -100 (all integers)Gendermale or femaleCon1,. . . , Con5true or falseSpousetrue or falseAbsenttrue or falseResources0 -10,000 (all integers)T ype (Patient type)in or outDistance (to the hospital) 0 -100 (all integers)

Table 3 :3Experimental setup. For each domain, every listed training set is used in combination with all listed test sets.Train onTest onWelfareType A (2,400 instances)Type A (2,400 instances)benefitType B (2,400 instances)Type B (2,400 instances)Type A (50,000 instances) Age-GenderType B (50,000 instances) Patient-DistanceSimplified Type AType AwelfareType BType BbenefitAge-GenderPatient-DistanceTort lawRegular (5,000 instances) Regular (5,000 instances)Regular (500 instances)UniqueUnlawfulnessImputability

Table 4 :4The accuracies obtained by the neural networks in the original study[Bench-Capon, 1993].Trained on training set A Trained on training set BTest set ATest set BTest set ATest set B1 hidden layer99.2572.2599.2597.912 hidden layers 98.9076.6799.0098.083 hidden layers 98.7574.33unconverged unconverged

Table 5 :5The accuracies obtained by the neural networks in the replication study.Trained on training set ATest set A Test set B Age-Gender Patient-Distance1 hidden layer 98.97±0.19 72.39±1.66 52.14±4.0150.05±0.092 hidden layers 98.87±0.21 72.56±1.83 53.19±4.5650.06±0.143 hidden layers 98.92±0.23 70.97±1.74 50.45±3.2350.03±0.07Trained on training set BTest set A Test set B Age-Gender Patient-Distance1 hidden layer 96.13±0.66 90.51±1.25 86.4±1.3385.77±5.212 hidden layers 95.5±0.87 89.4±1.5 85.62±1.2183.09±7.223 hidden layers 93.95±9.03 86.34±7.71 83.81±7.2974.57±12.8

Table 6 :6The accuracies obtained by the neural networks in the replication study with more training data.Trained on training set ATest set A Test set B Age-Gender Patient-Distance1 hidden layer 99.8±0.03 80.98±1.47 60.22±3.87 64.44±2.872 hidden layers 99.79±0.04 83.49±1.86 65.04±5.12 66.5±3.243 hidden layers 99.78±0.05 82.89±2.05 64.45±5.6 64.2±3.1Trained on training set BTest set A Test set B Age-Gender Patient-Distance1 hidden layer 99.64±0.17 98.53±0.15 98.51±0.47 97.17±0.462 hidden layers 99.28±0.36 98.06±0.35 97.75±0.9 96.53±0.31013 hidden layers 98.95±0.55 97.5±0.4 96.72±1.0 95.73±0.86

Table 7 :7The accuracies obtained by the neural networks in the simplified welfare domain.Trained on training set ATest set A Test set B Age-Gender Patient-Distance1 hidden layer 99.12±0.06 98.2±0.15 99.68±0.1 97.75±0.052 hidden layers 99.61±0.14 99.2±0.28 99.88±0.1 98.14±0.533 hidden layers 99.48±0.27 99.01±0.47 99.7±0.46 98.06±0.61Trained on training set BTest set A Test set B Age-Gender Patient-Distance1 hidden layer 99.46±0.06 99.04±0.12 99.67±0.06 98.04±0.312 hidden 99.77±0.14 99.6±0.25 99.78±0.18 99.09±0.713 hidden layers 99.51±0.78 99.38±0.37 99.63±0.47 98.8±0.66

Table 8 :8The accuracies obtained by the neural networks in the tort law domain.Trained on all instancesGeneralUniqueUnlawfulness Imputability1 hidden layer 100±0100±0100±0100±02 hidden layers 100±0100±0100±0100±03 hidden layers 99.86±0.37 99.76±0.66 99.67±1.83 99.5±1.56Trained on smaller datasetGeneralUniqueUnlawfulness Imputability1 hidden layer 98.45±0.5 97.24±0.89 92.8±3.4791.22±4.042 hidden layers 99.03±0.44 98.27±0.78 95.71±3.194.38±3.843 hidden layers 98.23±0.72 96.83±1.28 92.96±5.33 91.45±3.51

Table 9 :9Mean network output on the Unlawfulness and Imputability datasets versus the logical evaluation of the unlawfulness resp. imputability conditions.Trained on all instances Trained on smaller datasetUnlawfulness OutputUnlawfulness OutputFalse0False0.018True1True1Imputability OutputImputability OutputFalse0False0.875True1True1

The Jupyter notebooks used for data generation can be found in a Github repository: https://github.com/CorSteging/DiscoveringTheRationaleOfDecisions

Modeling Legal Arguments: Reasoning with Cases and Hypotheticals KDAshley 1990 The MIT Press Cambridge (Massachusetts A brief history of the changing roles of case prediction in ai and law KDAshley Law in Context 36 1 2019 In memoriam douglas n. walton: the influence of doug walton on ai and law KAtkinson TBench-Capon FBex TFGordon HPrakken GSartor BVerheij Artificial Intelligence and Law 2020a Explanation in ai and law: Past, present and future KAtkinson TBench-Capon DBollegala Artificial Intelligence 289 103387 2020b Neural networks and open texture TBench-Capon Proceedings of the 4th International Conference on Artificial Intelligence and Law, ICAIL '93 the 4th International Conference on Artificial Intelligence and Law, ICAIL '93

New York

ACM 1993 Predicting outcomes of case based legal arguments SBrüninghaus KDAshley Proceedings of the 9th International Conference on Artificial Intelligence and Law (ICAIL 2003) the 9th International Conference on Artificial Intelligence and Law (ICAIL 2003)

New York; New York

ACM 2003 Explaining and harnessing adversarial examples IJGoodfellow JShlens CSzegedy Proceedings of International Conference on Learning Representations International Conference on Learning Representations 2015 The Pleadings Game: An Artificial Intelligence Model of Procedural Justice TFGordon 1995 Kluwer Dordrecht Introducing LUIMA: an experiment in legal conceptual retrieval of vaccine injury decisions using a uima type system and tools MGrabmair KDAshley RChen PSureshkumar CWang ENyberg VRWalker Proceedings of the 15th International Conference on Artificial Intelligence and Law the 15th International Conference on Artificial Intelligence and Law

New York; New York

ACM 2015 Hard cases: a procedural approach JCHage RLeenes ARLodder Artificial intelligence and law 2 2 1993 Out of their minds: Legal theory in neural networks DHunter Artificial Intelligence and Law 7 2 1999 Adam: A method for stochastic optimization DPKingma JBa Proceedings of 3rd International Conference on Learning Representations 3rd International Conference on Learning Representations 2015 The Aspirin/Migraines Software Package RRLeighton APWieland 1994 Springer New York, Boston, MA A unified approach to interpreting model predictions SMLundberg SLee Advances in Neural Information Processing Systems Curran Associates, Inc 2017 30 Using machine learning to predict decisions of the european court of human rights MMedvedeva MVols MWieling Artificial Intelligence and Law 2019 Explanation in artificial intelligence: Insights from the social sciences TMiller Artificial Intelligence 0004-3702 267 2019 Argumentation mining: the detection, classification and structure of arguments in text RMochalesPalau MFMoens Proceedings of the 12th International Conference on Artificial Intelligence and Law (ICAIL 2009) the 12th International Conference on Artificial Intelligence and Law (ICAIL 2009)

New York; New York

ACM Press 2009 Argument based machine learning applied to law MMožina JŽabkar TBench-Capon IBratko Artificial Intelligence and Law 13 1 2005 Scikit-learn: machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 Introduction: from legal theories to neural networks and fuzzy reasoning LPhilipps GSartor Artificial Intelligence and law 7 2 1999 why should I trust you?": Explaining the predictions of any classifier MTRibeiro SSingh CGuestrin Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

San Francisco, CA, USA

2016 A case-based system for trade secrets law ELRissland KDAshley Proceedings of the 1st International Conference on Artificial Intelligence and Law, ICAIL '87 the 1st International Conference on Artificial Intelligence and Law, ICAIL '87

New York, NY, USA

ACM 1987 Discovering the rationale of decisions: Towards a method for aligning learning and reasoning (accepted) CSteging SRenooij BVerheij Proceedings of the 18th International Conference on Artificial Intelligence and Law, ICAIL '21 the 18th International Conference on Artificial Intelligence and Law, ICAIL '21

New York

ACM 2021 A hybrid rule-neural approach for the automation of legal reasoning in the discretionary domain of family law in australia AStranieri JZeleznikow MGawler BLewis Artificial Intelligence and Law 7 2-3 1999 Artificial argument assistants for defeasible argumentation BVerheij Artificial Intelligence 150 1-2 2003a Dialectical argumentation with argumentation schemes: An approach to legal logic BVerheij Artificial intelligence and Law 11 2-3 2003b Formalizing arguments, rules and cases BVerheij Proceedings of the 16th International Conference on Artificial Intelligence and Law, ICAIL '17 the 16th International Conference on Artificial Intelligence and Law, ICAIL '17

New York

ACM 2017 A method for explaining bayesian networks for legal evidence with scenarios CSVlek HPrakken SRenooij BVerheij Artificial Intelligence and Law 24 3 2016 Padua: a protocol for argumentation dialogue using association rules MWardeh TBench-Capon FCoenen Artificial Intelligence and Law 17 3 2009 Approaches to text mining arguments from legal cases AWyner RMochales-Palau MFMoens DMilward Semantic Processing of Legal Texts

Berlin

Springer 2010