1. Introduction

Taking the Law More Seriously by Investigating Design Choices in Machine Learning Prediction Research

Cor Steging

Silja Renooij

Bart Verheij

0 0 Bernoulli Institute of Mathematics , Computer Science and Artificial Intelligence , University of Groningen 1 Department of Information and Computing Sciences, Utrecht University

Approaches to court case prediction using machine learning difer widely with varying levels of success and legal reasonableness. In part this is due to some aspects of law, such as justification, being inherently dificult for machine learning approaches. Another aspect is the efect of design choices and the extent to which these are legally reasonable, which has not yet been extensively studied. We create four machine learning models tasked with predicting cases from the European Court of Human Rights and we perform experiments in order to measure the role of the following four design choices and efects: the choice of performance metric; the efect of including diferent parts of the legal case; the efect of a more or less specialized legal focus; and the temporal efects of the available past legal decisions. Through this research, we aim to study design decisions and their limitations and how they afect the performance of machine learning models.

eol>Court case prediction design choices machine learning

1. Introduction

performance, or has a better alignment with legal experts [4]. Instead, we investigate the efect of specific design Recently, much work has been done in the field of court choices and efects in machine learning research, in orcase predictions. While automatically determining the der to better analyze performance and alignment with outcome of court cases remains an academic exercise, the characteristics of the legal domain. large variation in the ways that previous research has We focus on research involving cases from the Eurotackled the problem makes it nearly impossible to com- pean Court of Human Rights (ECHR), which has been pare the approaches [1]. The law has unique characteris- used as a benchmark in a number of studies. ECHR data tics, making it dificult to apply machine learning in the is included in the LexGLUE benchmark datasets [5], and legal domain: machine learning is retrospective, assumes forms the basis of the ECHR-OD repository [6]. Previous normally distributed, homogeneous data that is largely studies have applied diferent machine learning systems free of errors, and it often cannot explain its decision- to this dataset, using various methods and achieving difmaking [2]. The law on the other hand is prospective, ferent levels of success [7, 8, 9, 10, 11]. To study the changes over time, contains wrong decisions, and de- efects of design choices, we train four diferent types of mands arguments for the decisions made. These unique machine learning models on cases from the ECHR: an characteristics of the law are not always taken into ac- SVM, a Naive Bayes (NB) Classifier, a Random Forest (RF) count. To take the law more seriously, we must consider and a BERT model. For these four models, we study the these when doing machine learning research in the field choice of performance metrics; the efect of including of AI & Law. diferent parts of the legal case; the efect of a more or

Some requirements of the law, such as justification, are less specialized legal focus; and the temporal efects of inherently dificult for machine learning systems, and the available past legal decisions. machine learning systems have been shown to use un- Our first set of experiments focuses on the replication sound reasoning [3]. However, despite their importance, and expansion of results in the literature. We train and our focus in this paper will not be on justification, re- test our four models on two diferent datasets from the sponsibility or explainibility. Moreover, our goal is not ECHR, using various parts of each case as input, and to create a machine learning system that obtains a better report both the accuracies and Matthew’s Correlation Coeficient (MCC) on each task, model and dataset.

IPnrfoocremedaitniogns oinf tLheegSailxTtehxWt(oArSkAshILop20o2n3A),uJtuonmea2t3e,d2S0e2m3,aBnrtaicgaA,nPaolrytsuigsaolf. The ECHR covers a number of separate articles. Earlier * Corresponding author. work on court case prediction used either single, general $ c.c.steging@rug.nl (C. Steging); s.renooij@uu.nl (S. Renooij); models trained on all articles [8, 10, 5], or a separate, bart.verheij@rug.nl (B. Verheij) specialized classifier for each article [ 7, 9]. In the second (S. 0R0e0n0o-o0i0j0);10-6080807-0-1060817-8(9C2.7S-8te7g5i1n(gB);. 0V0e0r0h-e0i0j)03-4339-8146 set of experiments, we create both a Generalist model and © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License an Ensemble of specialized models in order to investigate CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the diferences in their performances.

The third and final set of experiments studies temporal study used BERT, a state-of-the-art pre-trained transefects. We investigate the efects of training models on former model. While transformers tend to outperform cases from the past to predict future cases, compared to traditional models, BERT yielded a lower accuracy on the models trained on randomly split data. Furthermore, we ECHR task [8], because the ECHR cases greatly exceed explore the efects of training on cases from varying time BERT’s 512 token limit and had to be truncated. Chalkidis windows for a model that predicts future cases. et al. therefore also introduced an hierarchical version

In Section 2, we discuss relevant background infor- of BERT (HIER-BERT), where the words of each fact in mation. Section 3 addresses our experimental setup and the case are first converted to a fact embedding using the Section 4 the experiments themselves. We conclude our base BERT model. This version performed significantly paper in Section 5. better on the binary classification task than their regular BERT model with truncation (F1-scores of 82.0% vs. 17.0%). By pre-training this BERT model on additional le2. Background gal data, a legal-BERT was developed, specifically suited to legal texts [10] (see also [12]), which performed better The current work focuses on the cases made publicly on the ECHR task than the HIER-BERT model (F1-scores available by the ECHR, which is an international court of 88.3.0% vs. 82.0%). It has been noted, however, that spethat deals with cases claiming violations of articles laid cialized transformers in the legal domain (legal-BERT) out by the European Convention on Human Rights. A provide relatively little improvement over a standard case can pertain to multiple articles of the ECHR and transformer, especially when compared to the diference multiple articles can be violated. Each case description between regular and specialized BERT models in other can be divided into the following main parts: ifelds, such as in the biomedical domain[ 13]. Mumford Introduction: general information, such as title, date et al. took a hybrid approach to the court case predicand details about the section of the Court. tion task, opting to combine HIER-BERT models with Abstract Dialectical Frameworks. While it is dificult to Procedure: the course of action taken from lodging and compare the performance of this hybrid model to other reapplication until the final Court judgement. search, it did outperform a HIER-BERT model trained on Facts: the circumstances of the case, such as the rele- the same subset of ECHR data. Additionally, the hybrid vant background information of the applicant and model is more explainable and can provide justifications other events and circumstances; and the relevant for its predictions.

law from documents other than the ECHR.

Law: the legal arguments of the Court. Judgement: the Court’s decision. Dissenting/Concurring opinions: judges’ opinions

and why they voted for or against a violation. In court case prediction, the case text acts as the features and the judgement as the label. Three variations of the prediction task have been studied: • In the Binary classification task (BC) , there is one dataset that contains all cases. Models are tasked with predicting whether any article was violated for each case [7, 9, 8, 10]. • In the Multi-label classification task (MLC) , there is one dataset that contains all cases. Models are tasked with predicting which articles were violated for each case [8, 10]. • In the Article classification task (AC) , there are multiple datasets, one for each article. Models are tasked with predicting whether a specific article was violated for each case [7, 9].

3. Experimental setup Here we describe our datasets, machine learning models,

preprocessing steps, and performance metrics used. All of the code used to run the experiments can be found in a public repository1. 3.1. Datasets

We train machine learning models on cases from the

ECHR and use these models to predict new case decisions. We use the dataset from the ECHR Open Data project (ECHR-OD)2 [6]. This repository contains formatted and standardized data from the ECHR that is automatically updated every month, establishing a public shared baseline for machine learning models. Each case in this dataset contains the text of the case and the outcome, i.e. which articles were considered violated. A single case can violate multiple articles.

We set up our datasets for the article classification (AC) task and the binary classification (BC) task. For the AC task, there are 9 datasets, one for each article.

The first models applied to the ECHR classification 1https://github.com/CorSteging/InvestigatingDesignChoices

task were Support Vector Machines (SVM) [7, 9]. A later 2https://echr-opendata.eu/. Accessed 21 Nov. 2022

Each dataset contains all of the cases pertaining to that

specific article, and the binary label indicates whether that specific article was violated. For the BC task, there is one dataset that contains all cases. The binary label of these cases indicates whether any article was violated. The number of cases in each dataset can be found in Table 1, alongside the percentage of cases that evaluate to a violation of their respective article. Note again that multiple articles can be considered for a single case. The sum of all datasets for each individual article in Table 1 is therefore greater than the number of cases in the ’All’ dataset. The outcome in most cases is a violation. The label distribution is therefore skewed towards violation.

To train a model, we balance the dataset used such that half of the cases evaluate to violation and the other half evaluate to non-violation. To balance a dataset, we randomly remove violation cases from the dataset until their number equals the number of non-violation cases.

The version of the dataset that we use from the ECHROD contains 14910 cases from 1968 up to and including 2022. The distribution of the cases across the years is skewed heavily towards the more recent years, however. This is clearly visible in Figure 1, where we plot the number of cases per year. Since some of the earlier years do not contain any cases, we only include cases from 1978 until 2022 in our experiments. 3.2. Models

In our study we use four diferent types of models: an

SVM, a Naive Bayes (NB) classifier, a Random Forest (RF) classifier and a BERT model. These are all commonly used models known for their efectiveness in text classification tasks [ 14]. For the SVM, we use the exact same parameters as reported in [9]. The parameters of the other models are tuned using a grid search for each experiment, where we validate the performance on an unseen part of the training set. For the SVM, NB and RF models, we use the scikit-learn library [15]. We use the BERT transformer from the open-source Hugging Face library [16] and limit the number of tokens to 512 using the default Tokenizer from that library. 3.3. Preprocessing We train and test our four models on the ECHR-OD dataset of case texts, preprocessed to remove unnecessary information from the text and to reduce the token size for the BERT model, by applying the following heuristics: • Change all characters to lowercase • Remove all punctuation except for ’?’ • Remove special characters, newlines and trailing white spaces • Change ’t to not ("don’t" becomes "do not") • Change all fact numbers to ’>’ • Remove stop words using nltk [17] • Remove unnecessary words that occur in every case (such as subheadings)

For the SVM, NB and RF models, the texts are then converted to n-grams and normalized using TF-IDF. The parameters for preprocessing and TF-IDF are fine-tuned using a grid search.

3.4. Performance Metrics Most models in the literature report the classification accuracy or F1-scores of their model. While this was common practice in machine learning, more recent studies have steered away from using accuracy in favor of the Matthew‘s correlation coeficient (MCC) [ 18], which ranges from − 1 (worst) to 1 (best). Contemporary measures like accuracy or even F1-scores have been shown to yield inflated results on binary classification tasks [ 19], especially on imbalanced datasets such as the ECHR cases.

The MCC, on the other hand, is only high if all four confusion matrix categories are accurate: high true positives and negatives, and low false positives and negatives. For training data as Medvedeva et al. (2020). These are the example, a model that always predicts ‘violation‘ will procedure, the facts or both, depending on the article. score an accuracy of 81.55% on the entire ECHR dataset, since 81.55% of the cases have a violation label. From a Discussion When we compare the accuracies of our legal point of view, this is an extremely irresponsible and models, trained and tested on the Medvedeva et al. (2020) poorly designed model. However, the accuracy is high dataset, we can see that our models perform similarly and the F1-score of this model would even be 89.83%, to those in the literature on the AC task. Note that the beating the state of the art. The MCC of such a model, accuracies of SVM by Aletras et al. (2016) were obtained however, will be 0, indicating that its predictive power by training the SVM on parts of the case that were not is equal to random guessing. While a macro-averaged available before the judgement was made, and this work F1-score can be used for unbalanced data, it is known should therefore be classified as outcome identification, to be biased and does not take true negative predictions rather than outcome prediction [1]. Performances on into account [19]. To take the law more seriously, we the BC task are lower than the ones achieved by the therefore choose to use the MCC to evaluate our models, SVM, HIER-BERT and Legal-BERT, but higher than the even though we work with balanced datasets. We in gen- one achieved by the regular BERT model. However, it is eral advocate the use of the MCC for binary classification dificult to compare exact performances without using as a best practice. We report the accuracy of our models the exact same datasets. For example, BERT and HIERwhen comparing their results with the results from the BERT were trained on 7100 cases and their F1-score was literature. In the rest of our study, we will report MCC calculated on predictions on unbalanced test sets (66% values only. violation) [8], while we trained on 3133 cases and used balanced test sets. Across the four models that we use, 4. Experiments there is no clear best model, and performance is dependent on the dataset, task and article.

We now discuss our three sets of ECHR-OD experiments. Accuracies are generally slightly lower when training on the larger and more recent ECHR-OD dataset. We also see that the MCC ranks the performance of the mod4.1. Experiment 1: Extended replication els diferently than accuracy does (except for Articles To evaluate the performance of each model, we apply a 10- 3 and 5). The MCC more accurately depicts the actual fold cross validation to the model for each article (the AC performance of the models as it accounts for the rate of task), and for all articles at once (the BC task). We balance true positives, false positives, true negatives and false each of the datasets such that exactly half of the cases negatives, which leads to a more reasonable evaluation. evaluate to ’violation’. We compare the performance The last two sets of rows in Table 2 therefore show that on the ECHR-OD dataset to that of models from the accuracies or F1-scores can show inflated results and can literature. For comparison, we also train and test our incorrectly suggest a high performance. Therefore, if we models on the subset of data used by Medvedeva et al. wish to take AI seriously in AI&Law, we should use more (2020; see [9] for details), which contains only 3133 cases reliable metrics. from 1968 to 2017.

Performance Results Table 2 shows four sets of rows containing performances of models on the AC task and the BC task (All). In the first set of rows, we show results from previous research. Note that results marked with an asterisk ( * ) are F1-scores rather than accuracies. In the second set of rows, we show the accuracies of the 10-fold cross validation of our SVM, NB, RF and BERT models using the data from [9]. The third set of rows lists the accuracies of our four models using the ECHR-OD data, and the last set of rows contains the MCC values of our four models on the ECHR-OD dataset. Note that we decided to round of our results values to one decimal, just as in the more recent studies. The first two studies only reported accuracies without decimals. For this general performance, we used the same parts of the cases in the

Investigating what Parts to use The cases from the

ECHR consist of 6 diferent parts, including the judgement. Previous studies have used diferent parts of the cases to train their models, with mixed results. As discussed in the background section, only the introduction, procedure and facts are known before a case is tried.

Some of the earlier literature has used parts that were made available after the case has been judged, however.

If our goal is to take the law seriously, we should only include parts that are available before the judgement was made. We therefore do not use the law section, the judgement and the dissenting and concurring opinions.

Additionally, the introduction, which contains only general information about a case will also not be used, as this should not have any predictive value. We therefore focus on the procedure, the facts, and a combination of both to look into which of these yields the best results. We should note that all cases in the ECHR dataset were information regarding the circumstances, background, published after the cases were tried;their texts can there- applicant and relevant law from other documents. We fore potentially contain implicit or explicit information see that adding the procedure can improve performance, that was not available before the case was tried, even in but this is dependent on the combination of article and the introduction, procedure and facts sections [20]. model. This supports the method used in [9], where

To investigate which parts of the case are useful in diferent parts are used for each article. We also show court case prediction, we train each of the four model that the performance is dependent on the combination types (SVM, NB, RF, BERT) on the facts, procedure and of the parts used and the model used, and we base our both the facts and the procedure. We use the ECHR-OD conclusions on the MCC rather than the accuracy. dataset. We report the average MCC across a 10 fold cross validation for each classifier, trained on every individual 4.2. Experiment 2: Specialist vs. article and all articles at once. This experiment expands upon the research done by Medvedeva et al. (2020) by Generalist models exhaustively reporting the performances, in terms of In previous research, models were either trained on each MCC instead of accuracy, of each combination of our four individual article (AC task) or on all articles at once (BC models trained on all possible parts. This comparison task); in the latter case, the model is tasked with precan be seen in Table 3, where the best results for each dicting whether there has been any violation, regardless classifier on a given article is shown in bold. of what article was violated. The performance of our own models and models from the literature on this task Discussion There is quite some variation between the can be seen in the rightmost column of Table 2. This MCC of models using diferent parts in Table 3. Determin- approach can be compared to a human legal generalist, ing which part to use is therefore important to obtain the who has knowledge of all articles. Instead of just a single highest possible performance. The facts and the combi- generalist, however, one could opt to use a team of legal nation of facts and procedure yield the best results across specialists, where each person of the team is specialized the combinations of part, model and article. The proce- in a diferent article. In this experiment, we examine dure alone ranks the worst. This means that the facts are these two diferent approaches to the BC task. an essential part when doing court case predictions. This The first approach is to create a single Generalist model is unsurprising, as this part contains all of the relevant that is trained on all cases of the ECHR. In the second

Discussion From Table 4 we see that for most Gener

alist models the MCC is much higher than that of the We generate the 90% - 10% split in training and test data Ensemble. The Generalist models therefore outperform randomly, preserving the 50% balance in classes. We use the Ensembles for most types of classifiers. The excepthese three datasets to train and test a Generalist model tion is the NB classifier, where there is little diference and an Ensemble. The Generalist model is trained on in MCC. This suggests that, in predicting ECHR court the Generalist Training Set and evaluated using the Test- cases, a larger problem space combined with more training Set. The Ensemble consists of nine specialist models, ing data results in better performance than a reduced approach, we train an Ensemble of models, wherein each model is specialized in a diferent article of the ECHR, akin to the team of legal specialists. Each model of this Ensemble is trained on identifying violations for just a single article, thus reducing the problem space and potentially increasing performance. Additionally, such an Ensemble would be able to tell what article was violated, thus providing explanations for its decisions. Each model of the Ensemble would, however, have less data than the Generalist model, which might decrease performance.

We perform an experiment to determine which approach yields the best performing model. We create three types of datasets: • nine Ensemble Training Sets, one for each article, containing 90% of all of the cases that consider that specific article; the features are the facts of the case, and the label is whether or not there is a violation of the respective article in the case. • the Generalist Training Set contains all of the cases from all nine Ensemble Training Sets; the features are again the facts, and the label is whether or not any article was violated in a case. • the Testing Set contains the 10% of cases not used in the Ensemble Training Set and Generalist Training Set; features and label are the same as in the latter. each trained on a diferent Ensemble Training Set. Each of these specialist models will be tasked with predicting the labels of the cases from the Test Set. The predictions of each specialist model will be combined in a disjunctive manner to form the final prediction of the Ensemble. In other words, the output will be violation if any specialist model predicts violation, and non-violation otherwise.

We compare the performance of the Generalist model to that of the Ensemble. The experiment is performed for every one of our four model types: the SVM, NB, RF and BERT. We also repeat every experiment 10 times for each type of model, using diferent cases for the training and testing sets in each iteration. We report the average MCC in Table 4. The best results are shown in bold. that specific article. The features in these datasets are the facts of the case, and the labels are whether or not there is a violation of the respective article in the case. Additionally, we add cases from other articles to this dataset, where each additional case has the ’non-violation’ label. Since almost all articles contain more violation than non-violation cases (see Table 1), we add these additional cases to each Improved Ensemble Training Set until their number of violation and non-violation cases is equal. The Improved Ensemble is set up in the same way as the earlier Ensemble, but each specialist model of this Improved Ensemble is trained on the Improved Ensemble Training Sets. The results of the Improved Ensemble are shown in the rightmost column of Table 4.

Discussion The Improved Ensemble performs better

than the initial Ensemble when using the SVM, RF and BERT models. This supports our idea that the specialist problem space with less data. models generally perform better when including addi

Each of the specialist models in an Ensemble is trained tional cases from other articles with a ’non-violation’ on cases that pertain to a single article. The Test Set used label. This informs the model to predict ’non-violation’ for these specialist models, however, considers all articles, for cases pertaining to other articles. most of which the individual specialist models will not The NB models, however, seem to perform worse in have seen during training. Ideally, if a specialist models the ’Improved Ensemble’ scenario. Analysing the data is presented with a case that considers an article that it shows that the specialist models in the ’Improved’ NB is not trained for, it should predict ’non-violation’. How- Ensemble now predict ’violation’ in 99.5% of all test cases, ever, it is not explicitly trained to give that prediction thus yielding a low MCC in Table 4. Our hypothesis is and, as a result, provides a random prediction. Given that that the problem space might have become too large the final prediction of the Ensembles is an OR-function, for the NB specialist models by providing them with a this leads to many incorrect ’violation’ predictions. This relatively small number of additional cases pertaining can be seen in Table 5, which displays the confusion ma- to other articles. Investigating this idea is left for future trix of all of the predictions done by the SVM Ensemble. research. Note that the accuracy of this Ensemble is The Ensemble predicted ’violation’ in a total of 18,739 91.2%, incorrectly suggesting a high performance of this cases, out of which 17,601 were correct and it predicted irresponsible system, which further advocates for the use ’non-violation’ in a total of 1,761 out of which only 623 of the MCC as a performance metric. were correct. This initially might seem like a great perfor- Table 4 shows that the Improved SVM Ensemble permance, and would lead to an accuracy of 89.9%. However, forms better than the initial SVM Ensemble. The confuthe Test Set is heavily skewed towards violation (91.14% sion matrix of this Improved SVM Ensemble is shown in of all cases). The performance on violation cases is there- Table 6. Here, we see that the performance on the nonfore relatively good, accurately predicting the outcome violation cases is almost identical to that of the initial of 93.9% of all violation cases. However, the model also Ensemble on the same cases, as shown in Table 5. By incorrectly assigns violation to 64.6% of all non-violation including the additional cases, our aim was to instruct the cases. specialist models to predict non-violation for cases that did not pertain to its specific article. However, while the Improved Ensemble A potential solution to this is- Improved Ensemble does perform better, Table 6 shows sue is to present the specialist models with cases that do us that the Improved Ensemble did not improve its pernot pertain to the article that they are focused on dur- formance on the non-violation cases. The diference in ing training. For example, a specialist model trained on performance is therefore due to the Improved Ensemble’s article 6 cases could also explicitly be trained to predict predictions on the violation cases. We see that the Im’non-violation’ for all cases that do not pertain to article proved Ensemble correctly predicts the outcome of 95% 6. To create this new Improved Ensemble, we alter the of all violation cases. This is 1.1 percent point higher training datasets as follows. We create nine Improved than the performance of the initial Ensemble. Ensemble Training Sets, one for each article. Just as the By including additional cases pertaining to other arearlier Ensemble Training Set, each Improved Ensemble ticles in each of the specialist models, we are able to Training Set contains 90% of all of the cases that consider somewhat improve the Ensemble’s performance in most cases. It should be noted here that we only include a small subset of additional cases pertaining to other articles. The number of additional cases equals the number of violation cases of a given article, minus the number of the non-violation cases of that article. This is to ensure a 50% violation rate in the training dataset of each model.

The Ensemble could potentially be improved further by oversampling violation cases for each article and including more of these additional cases. This could potentially change the results of the ’Improved’ NB Ensemble as well, as it would then have more cases per article. However, oversampling has downsides as well, such as overfitting.

Future research could investigate this idea. 4.3. Experiment 3: Temporal efects training dataset as a potential cause of diferences in performance. All training sets are balanced, such that half of the cases are violation cases and the other half are non-violation cases. We use all four model types (SVM, NB, RF, BERT) and train on the facts of the ECHR-OD dataset. The results can be seen in Table 7.

In the previous experiments, we have split our training and testing sets randomly, or used 10-fold cross validation to train and test our models, just as much previous research has done. By splitting the data randomly, we might select training cases that occurred more recently than our test cases. In other words, we might use future Discussion We performed the experiment using all cases to predict cases from the past. When it comes to possible Test Years from 1979 until 2022. Across all these court case predictions, an argument can be made for se- Test Years, we see in Table 7 that the MCC of models lecting the most recent cases as the test set and to use trained on random cases is generally higher than the the older cases as a training set. This way, we use past MCC of models trained on past cases. This would imply cases to forecast the future. Some models in previous that learning from past cases is more dificult than learnresearch were trained in this way [8], but efects of this ing from random cases from both past and future. The design choice have not yet been studied. To evaluate how exception here is the NB classifier, which performs better models perform under these diferent temporal circum- when trained on past cases. In Figure 2, we plot the MCC stances, we narrow our scope and focus on article 6 of per Test Year for each of the four model types trained on the ECHR. This article contains the most cases and has either Past Cases (blue) or Random Cases (orange). Note therefore been investigated in other work as well [? 11]. that the y-axis of each subplot is scaled diferently. Here In this experiment, we use the facts of the cases from we see that the MCC of models trained on both random article 6 as the features, and whether or not article 6 was cases and past cases fluctuates a lot for earlier Test Years. violated as the label. We train models on cases from the This could be due to the limited number of available cases past and evaluate how their performance compares to in those years that the models are used to train on (see models trained on randomly selected data. We use three the case distribution per year in Figure 1). The difertypes of datasets in this experiment, each containing only ence between the two (the shaded area in Figure 2) also cases from article 6: decreases with more recent Test Years.

If we look only at the recent years (2000-2022) in Ta• test sets consist of cases from a single Test Year. ble 7, we get a more nuanced comparison. Table 7 still • the models trained on Past Cases will be trained shows higher MCCs for models trained on random cases on all cases that occurred before the Test Year. over models trained on past cases, with the exception of • the models trained on Random Cases are trained the NB model, but the diference between the two is much on randomly selected cases that occurred either lower. The absolute mean MCC is also higher across the before or after the Test Year. years. While the diferences between the two approaches may be smaller, they still exist, as seen in Figure 2. Not For each year, we therefore generate 2 types of models, only is it legally more reasonable to train on past cases to one trained on Past Cases (cases from years before the predict future cases, a random split of the data into a train year that we use to test the model) and one trained on and test set can also have an impact on the performance. Random Cases. Note that we ensure that the size of the If we wish to take the law more seriously in this type of Random Cases training set is the same as the size of the research, to ensure realistic results we should train on Past Cases training set for each respective Test Year. This past cases and test on future cases. way, we can disregard discrepancies in the size of the Time Window We know that machine learning sys- be varied between 5 and 35 years in steps of 5. We test tems tend to perform better with more data, granted that on only a single Test Year of cases. For example, training the data is a proper reflection of the problem space. We the model to predict cases from 2022 using a window also know, however, that the interpretation of the law of 5 years means that the model is trained on the cases is subject to change over time and precedent may be from 2016 up to and including 2021. Because we only overturned. When this happens, the older cases can be have cases from 1978 until 2022, we will test our models overruled by newer cases. Looking too far into the past on cases from 2013 to 2022. This way, we can train our may therefore not be optimal when trying to predict new model using windows of up to 35 years for each given court cases. We therefore also investigate the diference Test Year. between using all cases or only a subset of the more We create a model for each combination of type (SVM, recent cases. NB, RF and BERT), Test Year (from 2013 to 2022), and

We train our models on cases from the past but only on window of training data (5 to 35 in steps of 5). There a limited number of recent years. This window of years are therefore 4 * 10 * 7 = 280 diferent setups in this represents how far we look back into the past, and will experiment. To represent the results, we average the MCC of each of these models across the years, as to show the efects of the window size on the performance Table 8 of each of the model types. We train on the facts of Mean MCC of the window experiment. cases from Article 6 and use the ECHR-OD dataset. The results of this experiment are shown in Table 8, where the best window for each model is shown in bold. The last column in Table 8 reports the standard deviation across the diferent windows.

Discussion Based on Table 8 we cannot extrapolate appeared to still be more important. By including some a clear relationship between the number of past years additional cases of other articles in the training phase of cases in training and the performance of the models. of each specialist model of the Ensemble, we are able to The NB and BERT models perform best with 35 years increase performance for most model types. While we worth of cases. However, there does not seem to be a only included a small sample of these additional cases, clear positive relationship between the window size and future research could investigate whether including more MCC, as the MCC increases and decreases slightly across additional cases in the training data of specialist models the window sizes. A similar observation holds for the of the Ensemble could increase performance further. SVM and RF models, which do not perform much better Experiment 3 shows that training on past cases to preor worse with more years worth of cases from the past. dict the future is more dificult than training on randomly The overall impact of the window is therefore small, as selected instances from both the past and the future. Takalso indicated by the low standard deviations. The BERT ing into account the efects of time may therefore have model is impacted most by the window size, but the stan- an efect on performance, especially when you consider dard deviation in MCC is still only 2.3. In this scenario, much older cases. For that reason, randomly splitting using cases from further in the past does therefore not data into a training and test set, or running a k-fold cross seem to have a significant impact on the performance validation, might show unrealistic results. In these sceof the models. This can, of course, be diferent if the nario’s, it is more than likely that the model is predicting interpretation of the law has changed significantly. If past cases using future cases, which is impossible in realwe want to take the law more seriously, we should in- ity and does not account for the temporal aspects of the vestigate the temporal efects of the legislation and, if law. precedent is overturned, adjust and evaluate our training We show that using only a limited number of years data accordingly. worth of cases, rather than all cases, does not seem to have an impact on the performance of our models, as shown in Table 8. This suggests that the interpretation 5. Conclusion of the law, in particular regarding article 6 of the European Court of Human Rights, remained stable enough for machine learning predictions. There are, however, legal considerations that might suggest the removal of certain older cases, especially after certain landmark cases or changes in society. In those cases, we should investigate the efects that this has on the legislation and adjust our training data accordingly.

We have explored legally reasonable design choices and efects in court case predictions, and have shown their impact on performance. We conclude that, taking the law more seriously in machine learning research requires that the relevant, unique characteristics of the law are taken into account. Our findings are by no means enough to address inherent limitations (in particular with respect to justification), and future research has to remain critical of the choices that are being made in order to remain legally reasonable.

The approaches to court case predictions are diverse and dificult to compare [ 1]. While some methods yield better results, they may also raise concerns about how reasonably they align with the characteristics of legal decisionmaking. For a proper analysis of court case prediction research, we should consider the unique characteristics of the law and the efects that it can have on the models.

While justification, explainability and responsibility are major issues in machine learning and law, our scope did not include these aspects and focused instead on design choices.

If we want to take the law more seriously in machine learning research, we should measure the efect of relevant design choices and efects. We therefore propose to use the Matthew’s Correlation Coeficient rather than the accuracy or F1-score, as the latter two metrics tend to yield inflated results and can incorrectly attribute a much higher performance to a model.

Based on our results of Experiment 1, the facts are the most important of a case when it comes to court case predictions (see Table 3). Including the procedure of the case can increase performance, but this is dependent on the article and on the model used. For the best results, the parts used should therefore be included in the parameter optimization pipeline.

In Experiment 2, our Generalist model, trained on all articles at once, outperforms our Ensemble of specialist models each trained on a specific article. While the specialist models had a reduced problem space, more data

Acknowledgments This research was funded by the Hybrid Intelligence Cen

ter, a 10-year programme funded by the Dutch Ministry of Education, Culture and Science through the Netherlands Organisation for Scientific Research, https://hybridintelligence-centre.nl.

D. Ho, When Does Pretraining Help? Assessing

Self-Supervised Learning for Law and the Case[1] M. Medvedeva, M. Wieling, M. Vols, Rethinking HOLD Dataset of 53,000+ Legal Holdings, Associathe field of automatic prediction of court decisions, tion for Computing Machinery, New York, NY, USA, Artificial Intelligence and Law (2022) 1–18. 2021, p. 159–168. [2] T. Bench-Capon, The need for good old fashioned [13] B. Clavié, M. Alphonsus, The unreasonable efecAI and Law, International Trends in Legal Infor- tiveness of the baseline: Discussing SVMs in legal matics: A Festschrift for Erich Schweighofer (2020) text classification, in: S. Erich (Ed.), Legal Knowl22–36. edge and Information Systems - JURIX 2021: The [3] C. Steging, S. Renooij, B. Verheij, Discovering the Thirty-fourth Annual Conference, Vilnius, Lithuarationale of decisions: towards a method for align- nia, 8-10 December 2021, volume 346 of Frontiers ing learning and reasoning, in: J. Maranhão, A. Z. in Artificial Intelligence and Applications , IOS Press, Wyner (Eds.), ICAIL ’21: Eighteenth International 2021, pp. 58–61.

Conference for Artificial Intelligence and Law, São [14] T. Pranckevičius, V. Marcinkevičius, Comparison of Paulo Brazil, June 21 - 25, 2021, ACM, 2021, pp. naive bayes, random forest, decision tree, support 235–239. vector machines, and logistic regression classifiers [4] T. Santosh, S. Xu, O. Ichim, M. Grabmair, Decon- for text reviews classification, Baltic Journal of founding legal judgment prediction for European Modern Computing 5 (2017) 221. Court of Human Rights cases: Towards better align- [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, ment with experts, in: Proceedings of the 2022 B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Conference on Empirical Methods in Natural Lan- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, guage Processing, Association for Computational D. Cournapeau, M. Brucher, M. Perrot, E. DuchLinguistics, Abu Dhabi, United Arab Emirates, 2022, esnay, Scikit-learn: Machine learning in Python, pp. 1120–1138. URL: https://aclanthology.org/2022. Journal of Machine Learning Research 12 (2011) emnlp-main.74. 2825–2830. [5] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, [16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. DeI. Androutsopoulos, D. Katz, N. Aletras, LexGLUE: langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. FunA benchmark dataset for legal language understand- towicz, J. Brew, Huggingface’s transformers: Stateing in English, 2021. arXiv:2110.00976. of-the-art natural language processing, CoRR [6] A. Quemy, R. Wrembel, ECHR-OD: On building abs/1910.03771 (2019).

an integrated open repository of legal documents [17] S. Bird, E. Klein, E. Loper, Natural language processfor machine learning applications, Information ing with Python: analyzing text with the natural Systems 106 (2022) 101822. language toolkit, " O’Reilly Media, Inc.", 2009. [7] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, [18] B. Matthews, Comparison of the predicted and obV. Lampos, Predicting judicial decisions of the Eu- served secondary structure of T4 phage lysozyme, ropean Court of Human Rights: A natural language Biochimica et Biophysica Acta (BBA)-Protein Strucprocessing perspective, PeerJ Computer Science 2 ture 405 (1975) 442–451.

(2016) e93. [19] D. Chicco, G. Jurman, The advantages of the [8] I. Chalkidis, I. Androutsopoulos, N. Aletras, Neu- Matthews correlation coeficient (MCC) over F1 ral legal judgment prediction in English, 2019. score and accuracy in binary classification evaluaarXiv:1906.02059. tion, BMC Genomics 21 (2020) 1–13. [9] M. Medvedeva, M. Vols, M. Wieling, Using machine [20] M. Medvedeva, A. Üstün, X. Xu, M. Vols, M. Wieling, learning to predict decisions of the European Court Automatic judgement forecasting for pending apof Human Rights, Artificial Intelligence and Law plications of the European Court of Human Rights., 28 (2020) 237–266. in: ASAIL/LegalAIIA@ ICAIL, 2021, pp. 12–23. [10] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL-BERT: The muppets straight out of law school, ArXiv preprint arXiv:2010.02559 (2020). [11] J. Mumford, K. Atkinson, T. Bench-Capon, Reasoning with legal cases: A hybrid ADF-ML approach, in: Legal Knowledge and Information Systems, volume 362 of Frontiers in Artificial Intelligence and

Applications, IOS Press, 2022, pp. 93–102. [12] L. Zheng, N. Guha, B. Anderson, P. Henderson,