<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Taking the Law More Seriously by Investigating Design Choices in Machine Learning Prediction Research</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cor Steging</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silja Renooij</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bart Verheij</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bernoulli Institute of Mathematics</institution>
          ,
          <addr-line>Computer Science and Artificial Intelligence</addr-line>
          ,
          <institution>University of Groningen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Information and Computing Sciences, Utrecht University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Approaches to court case prediction using machine learning difer widely with varying levels of success and legal reasonableness. In part this is due to some aspects of law, such as justification, being inherently dificult for machine learning approaches. Another aspect is the efect of design choices and the extent to which these are legally reasonable, which has not yet been extensively studied. We create four machine learning models tasked with predicting cases from the European Court of Human Rights and we perform experiments in order to measure the role of the following four design choices and efects: the choice of performance metric; the efect of including diferent parts of the legal case; the efect of a more or less specialized legal focus; and the temporal efects of the available past legal decisions. Through this research, we aim to study design decisions and their limitations and how they afect the performance of machine learning models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Court case prediction</kwd>
        <kwd>design choices</kwd>
        <kwd>machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>performance, or has a better alignment with legal experts
[4]. Instead, we investigate the efect of specific design
Recently, much work has been done in the field of court choices and efects in machine learning research, in
orcase predictions. While automatically determining the der to better analyze performance and alignment with
outcome of court cases remains an academic exercise, the characteristics of the legal domain.
large variation in the ways that previous research has We focus on research involving cases from the
Eurotackled the problem makes it nearly impossible to com- pean Court of Human Rights (ECHR), which has been
pare the approaches [1]. The law has unique characteris- used as a benchmark in a number of studies. ECHR data
tics, making it dificult to apply machine learning in the is included in the LexGLUE benchmark datasets [5], and
legal domain: machine learning is retrospective, assumes forms the basis of the ECHR-OD repository [6]. Previous
normally distributed, homogeneous data that is largely studies have applied diferent machine learning systems
free of errors, and it often cannot explain its decision- to this dataset, using various methods and achieving
difmaking [2]. The law on the other hand is prospective, ferent levels of success [7, 8, 9, 10, 11]. To study the
changes over time, contains wrong decisions, and de- efects of design choices, we train four diferent types of
mands arguments for the decisions made. These unique machine learning models on cases from the ECHR: an
characteristics of the law are not always taken into ac- SVM, a Naive Bayes (NB) Classifier, a Random Forest (RF)
count. To take the law more seriously, we must consider and a BERT model. For these four models, we study the
these when doing machine learning research in the field choice of performance metrics; the efect of including
of AI &amp; Law. diferent parts of the legal case; the efect of a more or</p>
      <p>Some requirements of the law, such as justification, are less specialized legal focus; and the temporal efects of
inherently dificult for machine learning systems, and the available past legal decisions.
machine learning systems have been shown to use un- Our first set of experiments focuses on the replication
sound reasoning [3]. However, despite their importance, and expansion of results in the literature. We train and
our focus in this paper will not be on justification, re- test our four models on two diferent datasets from the
sponsibility or explainibility. Moreover, our goal is not ECHR, using various parts of each case as input, and
to create a machine learning system that obtains a better report both the accuracies and Matthew’s Correlation
Coeficient (MCC) on each task, model and dataset.</p>
      <p>IPnrfoocremedaitniogns oinf tLheegSailxTtehxWt(oArSkAshILop20o2n3A),uJtuonmea2t3e,d2S0e2m3,aBnrtaicgaA,nPaolrytsuigsaolf. The ECHR covers a number of separate articles. Earlier
* Corresponding author. work on court case prediction used either single, general
$ c.c.steging@rug.nl (C. Steging); s.renooij@uu.nl (S. Renooij); models trained on all articles [8, 10, 5], or a separate,
bart.verheij@rug.nl (B. Verheij) specialized classifier for each article [ 7, 9]. In the second
(S. 0R0e0n0o-o0i0j0);10-6080807-0-1060817-8(9C2.7S-8te7g5i1n(gB);. 0V0e0r0h-e0i0j)03-4339-8146 set of experiments, we create both a Generalist model and
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License an Ensemble of specialized models in order to investigate
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) the diferences in their performances.</p>
      <p>The third and final set of experiments studies temporal study used BERT, a state-of-the-art pre-trained
transefects. We investigate the efects of training models on former model. While transformers tend to outperform
cases from the past to predict future cases, compared to traditional models, BERT yielded a lower accuracy on the
models trained on randomly split data. Furthermore, we ECHR task [8], because the ECHR cases greatly exceed
explore the efects of training on cases from varying time BERT’s 512 token limit and had to be truncated. Chalkidis
windows for a model that predicts future cases. et al. therefore also introduced an hierarchical version</p>
      <p>In Section 2, we discuss relevant background infor- of BERT (HIER-BERT), where the words of each fact in
mation. Section 3 addresses our experimental setup and the case are first converted to a fact embedding using the
Section 4 the experiments themselves. We conclude our base BERT model. This version performed significantly
paper in Section 5. better on the binary classification task than their
regular BERT model with truncation (F1-scores of 82.0% vs.
17.0%). By pre-training this BERT model on additional
le2. Background gal data, a legal-BERT was developed, specifically suited
to legal texts [10] (see also [12]), which performed better
The current work focuses on the cases made publicly on the ECHR task than the HIER-BERT model (F1-scores
available by the ECHR, which is an international court of 88.3.0% vs. 82.0%). It has been noted, however, that
spethat deals with cases claiming violations of articles laid cialized transformers in the legal domain (legal-BERT)
out by the European Convention on Human Rights. A provide relatively little improvement over a standard
case can pertain to multiple articles of the ECHR and transformer, especially when compared to the diference
multiple articles can be violated. Each case description between regular and specialized BERT models in other
can be divided into the following main parts: ifelds, such as in the biomedical domain[ 13]. Mumford
Introduction: general information, such as title, date et al. took a hybrid approach to the court case
predicand details about the section of the Court. tion task, opting to combine HIER-BERT models with
Abstract Dialectical Frameworks. While it is dificult to
Procedure: the course of action taken from lodging and compare the performance of this hybrid model to other
reapplication until the final Court judgement. search, it did outperform a HIER-BERT model trained on
Facts: the circumstances of the case, such as the rele- the same subset of ECHR data. Additionally, the hybrid
vant background information of the applicant and model is more explainable and can provide justifications
other events and circumstances; and the relevant for its predictions.</p>
      <p>law from documents other than the ECHR.</p>
      <sec id="sec-1-1">
        <title>Law: the legal arguments of the Court.</title>
      </sec>
      <sec id="sec-1-2">
        <title>Judgement: the Court’s decision.</title>
        <sec id="sec-1-2-1">
          <title>Dissenting/Concurring opinions: judges’ opinions</title>
          <p>and why they voted for or against a violation.
In court case prediction, the case text acts as the features
and the judgement as the label. Three variations of the
prediction task have been studied:
• In the Binary classification task (BC) , there is
one dataset that contains all cases. Models are
tasked with predicting whether any article was
violated for each case [7, 9, 8, 10].
• In the Multi-label classification task (MLC) ,
there is one dataset that contains all cases. Models
are tasked with predicting which articles were
violated for each case [8, 10].
• In the Article classification task (AC) , there are
multiple datasets, one for each article. Models are
tasked with predicting whether a specific article
was violated for each case [7, 9].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experimental setup</title>
      <sec id="sec-2-1">
        <title>Here we describe our datasets, machine learning models,</title>
        <p>preprocessing steps, and performance metrics used. All
of the code used to run the experiments can be found in
a public repository1.
3.1. Datasets</p>
      </sec>
      <sec id="sec-2-2">
        <title>We train machine learning models on cases from the</title>
        <p>ECHR and use these models to predict new case
decisions. We use the dataset from the ECHR Open Data
project (ECHR-OD)2 [6]. This repository contains
formatted and standardized data from the ECHR that is
automatically updated every month, establishing a
public shared baseline for machine learning models. Each
case in this dataset contains the text of the case and the
outcome, i.e. which articles were considered violated. A
single case can violate multiple articles.</p>
        <p>We set up our datasets for the article classification
(AC) task and the binary classification (BC) task. For
the AC task, there are 9 datasets, one for each article.</p>
      </sec>
      <sec id="sec-2-3">
        <title>The first models applied to the ECHR classification</title>
      </sec>
      <sec id="sec-2-4">
        <title>1https://github.com/CorSteging/InvestigatingDesignChoices</title>
        <p>task were Support Vector Machines (SVM) [7, 9]. A later 2https://echr-opendata.eu/. Accessed 21 Nov. 2022</p>
      </sec>
      <sec id="sec-2-5">
        <title>Each dataset contains all of the cases pertaining to that</title>
        <p>specific article, and the binary label indicates whether
that specific article was violated. For the BC task, there
is one dataset that contains all cases. The binary label of
these cases indicates whether any article was violated.
The number of cases in each dataset can be found in
Table 1, alongside the percentage of cases that evaluate
to a violation of their respective article. Note again that
multiple articles can be considered for a single case. The
sum of all datasets for each individual article in Table 1
is therefore greater than the number of cases in the ’All’
dataset. The outcome in most cases is a violation. The
label distribution is therefore skewed towards violation.</p>
        <p>To train a model, we balance the dataset used such
that half of the cases evaluate to violation and the other
half evaluate to non-violation. To balance a dataset, we
randomly remove violation cases from the dataset until
their number equals the number of non-violation cases.</p>
        <p>The version of the dataset that we use from the
ECHROD contains 14910 cases from 1968 up to and including
2022. The distribution of the cases across the years is
skewed heavily towards the more recent years, however.
This is clearly visible in Figure 1, where we plot the
number of cases per year. Since some of the earlier years
do not contain any cases, we only include cases from
1978 until 2022 in our experiments.
3.2. Models</p>
      </sec>
      <sec id="sec-2-6">
        <title>In our study we use four diferent types of models: an</title>
        <p>SVM, a Naive Bayes (NB) classifier, a Random Forest
(RF) classifier and a BERT model. These are all
commonly used models known for their efectiveness in text
classification tasks [ 14]. For the SVM, we use the exact
same parameters as reported in [9]. The parameters of
the other models are tuned using a grid search for each
experiment, where we validate the performance on an
unseen part of the training set. For the SVM, NB and RF
models, we use the scikit-learn library [15]. We use the
BERT transformer from the open-source Hugging Face
library [16] and limit the number of tokens to 512 using
the default Tokenizer from that library.
3.3. Preprocessing
We train and test our four models on the ECHR-OD
dataset of case texts, preprocessed to remove unnecessary
information from the text and to reduce the token size for
the BERT model, by applying the following heuristics:
• Change all characters to lowercase
• Remove all punctuation except for ’?’
• Remove special characters, newlines and trailing
white spaces
• Change ’t to not ("don’t" becomes "do not")
• Change all fact numbers to ’&gt;’
• Remove stop words using nltk [17]
• Remove unnecessary words that occur in every
case (such as subheadings)</p>
      </sec>
      <sec id="sec-2-7">
        <title>For the SVM, NB and RF models, the texts are then converted to n-grams and normalized using TF-IDF. The parameters for preprocessing and TF-IDF are fine-tuned using a grid search.</title>
        <p>3.4. Performance Metrics
Most models in the literature report the classification
accuracy or F1-scores of their model. While this was
common practice in machine learning, more recent
studies have steered away from using accuracy in favor of
the Matthew‘s correlation coeficient (MCC) [ 18], which
ranges from − 1 (worst) to 1 (best). Contemporary
measures like accuracy or even F1-scores have been shown
to yield inflated results on binary classification tasks [ 19],
especially on imbalanced datasets such as the ECHR cases.</p>
        <p>The MCC, on the other hand, is only high if all four
confusion matrix categories are accurate: high true positives
and negatives, and low false positives and negatives. For training data as Medvedeva et al. (2020). These are the
example, a model that always predicts ‘violation‘ will procedure, the facts or both, depending on the article.
score an accuracy of 81.55% on the entire ECHR dataset,
since 81.55% of the cases have a violation label. From a Discussion When we compare the accuracies of our
legal point of view, this is an extremely irresponsible and models, trained and tested on the Medvedeva et al. (2020)
poorly designed model. However, the accuracy is high dataset, we can see that our models perform similarly
and the F1-score of this model would even be 89.83%, to those in the literature on the AC task. Note that the
beating the state of the art. The MCC of such a model, accuracies of SVM by Aletras et al. (2016) were obtained
however, will be 0, indicating that its predictive power by training the SVM on parts of the case that were not
is equal to random guessing. While a macro-averaged available before the judgement was made, and this work
F1-score can be used for unbalanced data, it is known should therefore be classified as outcome identification,
to be biased and does not take true negative predictions rather than outcome prediction [1]. Performances on
into account [19]. To take the law more seriously, we the BC task are lower than the ones achieved by the
therefore choose to use the MCC to evaluate our models, SVM, HIER-BERT and Legal-BERT, but higher than the
even though we work with balanced datasets. We in gen- one achieved by the regular BERT model. However, it is
eral advocate the use of the MCC for binary classification dificult to compare exact performances without using
as a best practice. We report the accuracy of our models the exact same datasets. For example, BERT and
HIERwhen comparing their results with the results from the BERT were trained on 7100 cases and their F1-score was
literature. In the rest of our study, we will report MCC calculated on predictions on unbalanced test sets (66%
values only. violation) [8], while we trained on 3133 cases and used
balanced test sets. Across the four models that we use,
4. Experiments there is no clear best model, and performance is
dependent on the dataset, task and article.</p>
        <p>We now discuss our three sets of ECHR-OD experiments. Accuracies are generally slightly lower when training
on the larger and more recent ECHR-OD dataset. We
also see that the MCC ranks the performance of the
mod4.1. Experiment 1: Extended replication els diferently than accuracy does (except for Articles
To evaluate the performance of each model, we apply a 10- 3 and 5). The MCC more accurately depicts the actual
fold cross validation to the model for each article (the AC performance of the models as it accounts for the rate of
task), and for all articles at once (the BC task). We balance true positives, false positives, true negatives and false
each of the datasets such that exactly half of the cases negatives, which leads to a more reasonable evaluation.
evaluate to ’violation’. We compare the performance The last two sets of rows in Table 2 therefore show that
on the ECHR-OD dataset to that of models from the accuracies or F1-scores can show inflated results and can
literature. For comparison, we also train and test our incorrectly suggest a high performance. Therefore, if we
models on the subset of data used by Medvedeva et al. wish to take AI seriously in AI&amp;Law, we should use more
(2020; see [9] for details), which contains only 3133 cases reliable metrics.
from 1968 to 2017.</p>
        <p>Performance Results Table 2 shows four sets of rows
containing performances of models on the AC task and
the BC task (All). In the first set of rows, we show results
from previous research. Note that results marked with
an asterisk ( * ) are F1-scores rather than accuracies. In
the second set of rows, we show the accuracies of the
10-fold cross validation of our SVM, NB, RF and BERT
models using the data from [9]. The third set of rows lists
the accuracies of our four models using the ECHR-OD
data, and the last set of rows contains the MCC values of
our four models on the ECHR-OD dataset. Note that we
decided to round of our results values to one decimal, just
as in the more recent studies. The first two studies only
reported accuracies without decimals. For this general
performance, we used the same parts of the cases in the</p>
        <sec id="sec-2-7-1">
          <title>Investigating what Parts to use The cases from the</title>
          <p>ECHR consist of 6 diferent parts, including the
judgement. Previous studies have used diferent parts of the
cases to train their models, with mixed results. As
discussed in the background section, only the introduction,
procedure and facts are known before a case is tried.</p>
          <p>Some of the earlier literature has used parts that were
made available after the case has been judged, however.</p>
          <p>If our goal is to take the law seriously, we should only
include parts that are available before the judgement
was made. We therefore do not use the law section, the
judgement and the dissenting and concurring opinions.</p>
          <p>Additionally, the introduction, which contains only
general information about a case will also not be used, as
this should not have any predictive value. We therefore
focus on the procedure, the facts, and a combination of
both to look into which of these yields the best results.
We should note that all cases in the ECHR dataset were information regarding the circumstances, background,
published after the cases were tried;their texts can there- applicant and relevant law from other documents. We
fore potentially contain implicit or explicit information see that adding the procedure can improve performance,
that was not available before the case was tried, even in but this is dependent on the combination of article and
the introduction, procedure and facts sections [20]. model. This supports the method used in [9], where</p>
          <p>To investigate which parts of the case are useful in diferent parts are used for each article. We also show
court case prediction, we train each of the four model that the performance is dependent on the combination
types (SVM, NB, RF, BERT) on the facts, procedure and of the parts used and the model used, and we base our
both the facts and the procedure. We use the ECHR-OD conclusions on the MCC rather than the accuracy.
dataset. We report the average MCC across a 10 fold cross
validation for each classifier, trained on every individual 4.2. Experiment 2: Specialist vs.
article and all articles at once. This experiment expands
upon the research done by Medvedeva et al. (2020) by Generalist models
exhaustively reporting the performances, in terms of In previous research, models were either trained on each
MCC instead of accuracy, of each combination of our four individual article (AC task) or on all articles at once (BC
models trained on all possible parts. This comparison task); in the latter case, the model is tasked with
precan be seen in Table 3, where the best results for each dicting whether there has been any violation, regardless
classifier on a given article is shown in bold. of what article was violated. The performance of our
own models and models from the literature on this task
Discussion There is quite some variation between the can be seen in the rightmost column of Table 2. This
MCC of models using diferent parts in Table 3. Determin- approach can be compared to a human legal generalist,
ing which part to use is therefore important to obtain the who has knowledge of all articles. Instead of just a single
highest possible performance. The facts and the combi- generalist, however, one could opt to use a team of legal
nation of facts and procedure yield the best results across specialists, where each person of the team is specialized
the combinations of part, model and article. The proce- in a diferent article. In this experiment, we examine
dure alone ranks the worst. This means that the facts are these two diferent approaches to the BC task.
an essential part when doing court case predictions. This The first approach is to create a single Generalist model
is unsurprising, as this part contains all of the relevant that is trained on all cases of the ECHR. In the second</p>
        </sec>
      </sec>
      <sec id="sec-2-8">
        <title>Discussion From Table 4 we see that for most Gener</title>
        <p>alist models the MCC is much higher than that of the
We generate the 90% - 10% split in training and test data Ensemble. The Generalist models therefore outperform
randomly, preserving the 50% balance in classes. We use the Ensembles for most types of classifiers. The
excepthese three datasets to train and test a Generalist model tion is the NB classifier, where there is little diference
and an Ensemble. The Generalist model is trained on in MCC. This suggests that, in predicting ECHR court
the Generalist Training Set and evaluated using the Test- cases, a larger problem space combined with more
training Set. The Ensemble consists of nine specialist models, ing data results in better performance than a reduced
approach, we train an Ensemble of models, wherein each
model is specialized in a diferent article of the ECHR,
akin to the team of legal specialists. Each model of this
Ensemble is trained on identifying violations for just a
single article, thus reducing the problem space and
potentially increasing performance. Additionally, such an
Ensemble would be able to tell what article was violated,
thus providing explanations for its decisions. Each model
of the Ensemble would, however, have less data than the
Generalist model, which might decrease performance.</p>
        <p>We perform an experiment to determine which approach
yields the best performing model. We create three types
of datasets:
• nine Ensemble Training Sets, one for each
article, containing 90% of all of the cases that consider
that specific article; the features are the facts of
the case, and the label is whether or not there is
a violation of the respective article in the case.
• the Generalist Training Set contains all of
the cases from all nine Ensemble Training Sets;
the features are again the facts, and the label is
whether or not any article was violated in a case.
• the Testing Set contains the 10% of cases not
used in the Ensemble Training Set and Generalist
Training Set; features and label are the same as
in the latter.
each trained on a diferent Ensemble Training Set. Each
of these specialist models will be tasked with predicting
the labels of the cases from the Test Set. The predictions
of each specialist model will be combined in a disjunctive
manner to form the final prediction of the Ensemble. In
other words, the output will be violation if any specialist
model predicts violation, and non-violation otherwise.</p>
        <p>We compare the performance of the Generalist model
to that of the Ensemble. The experiment is performed
for every one of our four model types: the SVM, NB, RF
and BERT. We also repeat every experiment 10 times for
each type of model, using diferent cases for the training
and testing sets in each iteration. We report the average
MCC in Table 4. The best results are shown in bold.
that specific article. The features in these datasets are the
facts of the case, and the labels are whether or not there
is a violation of the respective article in the case.
Additionally, we add cases from other articles to this dataset,
where each additional case has the ’non-violation’
label. Since almost all articles contain more violation than
non-violation cases (see Table 1), we add these additional
cases to each Improved Ensemble Training Set until their
number of violation and non-violation cases is equal. The
Improved Ensemble is set up in the same way as the
earlier Ensemble, but each specialist model of this
Improved Ensemble is trained on the Improved Ensemble
Training Sets. The results of the Improved Ensemble are
shown in the rightmost column of Table 4.</p>
      </sec>
      <sec id="sec-2-9">
        <title>Discussion The Improved Ensemble performs better</title>
        <p>than the initial Ensemble when using the SVM, RF and
BERT models. This supports our idea that the specialist
problem space with less data. models generally perform better when including
addi</p>
        <p>Each of the specialist models in an Ensemble is trained tional cases from other articles with a ’non-violation’
on cases that pertain to a single article. The Test Set used label. This informs the model to predict ’non-violation’
for these specialist models, however, considers all articles, for cases pertaining to other articles.
most of which the individual specialist models will not The NB models, however, seem to perform worse in
have seen during training. Ideally, if a specialist models the ’Improved Ensemble’ scenario. Analysing the data
is presented with a case that considers an article that it shows that the specialist models in the ’Improved’ NB
is not trained for, it should predict ’non-violation’. How- Ensemble now predict ’violation’ in 99.5% of all test cases,
ever, it is not explicitly trained to give that prediction thus yielding a low MCC in Table 4. Our hypothesis is
and, as a result, provides a random prediction. Given that that the problem space might have become too large
the final prediction of the Ensembles is an OR-function, for the NB specialist models by providing them with a
this leads to many incorrect ’violation’ predictions. This relatively small number of additional cases pertaining
can be seen in Table 5, which displays the confusion ma- to other articles. Investigating this idea is left for future
trix of all of the predictions done by the SVM Ensemble. research. Note that the accuracy of this Ensemble is
The Ensemble predicted ’violation’ in a total of 18,739 91.2%, incorrectly suggesting a high performance of this
cases, out of which 17,601 were correct and it predicted irresponsible system, which further advocates for the use
’non-violation’ in a total of 1,761 out of which only 623 of the MCC as a performance metric.
were correct. This initially might seem like a great perfor- Table 4 shows that the Improved SVM Ensemble
permance, and would lead to an accuracy of 89.9%. However, forms better than the initial SVM Ensemble. The
confuthe Test Set is heavily skewed towards violation (91.14% sion matrix of this Improved SVM Ensemble is shown in
of all cases). The performance on violation cases is there- Table 6. Here, we see that the performance on the
nonfore relatively good, accurately predicting the outcome violation cases is almost identical to that of the initial
of 93.9% of all violation cases. However, the model also Ensemble on the same cases, as shown in Table 5. By
incorrectly assigns violation to 64.6% of all non-violation including the additional cases, our aim was to instruct the
cases. specialist models to predict non-violation for cases that
did not pertain to its specific article. However, while the
Improved Ensemble A potential solution to this is- Improved Ensemble does perform better, Table 6 shows
sue is to present the specialist models with cases that do us that the Improved Ensemble did not improve its
pernot pertain to the article that they are focused on dur- formance on the non-violation cases. The diference in
ing training. For example, a specialist model trained on performance is therefore due to the Improved Ensemble’s
article 6 cases could also explicitly be trained to predict predictions on the violation cases. We see that the
Im’non-violation’ for all cases that do not pertain to article proved Ensemble correctly predicts the outcome of 95%
6. To create this new Improved Ensemble, we alter the of all violation cases. This is 1.1 percent point higher
training datasets as follows. We create nine Improved than the performance of the initial Ensemble.
Ensemble Training Sets, one for each article. Just as the By including additional cases pertaining to other
arearlier Ensemble Training Set, each Improved Ensemble ticles in each of the specialist models, we are able to
Training Set contains 90% of all of the cases that consider somewhat improve the Ensemble’s performance in most
cases. It should be noted here that we only include a
small subset of additional cases pertaining to other
articles. The number of additional cases equals the number
of violation cases of a given article, minus the number of
the non-violation cases of that article. This is to ensure a
50% violation rate in the training dataset of each model.</p>
        <p>The Ensemble could potentially be improved further by
oversampling violation cases for each article and
including more of these additional cases. This could potentially
change the results of the ’Improved’ NB Ensemble as well,
as it would then have more cases per article. However,
oversampling has downsides as well, such as overfitting.</p>
        <p>Future research could investigate this idea.
4.3. Experiment 3: Temporal efects
training dataset as a potential cause of diferences in
performance. All training sets are balanced, such that
half of the cases are violation cases and the other half are
non-violation cases. We use all four model types (SVM,
NB, RF, BERT) and train on the facts of the ECHR-OD
dataset. The results can be seen in Table 7.</p>
        <p>In the previous experiments, we have split our training
and testing sets randomly, or used 10-fold cross
validation to train and test our models, just as much previous
research has done. By splitting the data randomly, we
might select training cases that occurred more recently
than our test cases. In other words, we might use future Discussion We performed the experiment using all
cases to predict cases from the past. When it comes to possible Test Years from 1979 until 2022. Across all these
court case predictions, an argument can be made for se- Test Years, we see in Table 7 that the MCC of models
lecting the most recent cases as the test set and to use trained on random cases is generally higher than the
the older cases as a training set. This way, we use past MCC of models trained on past cases. This would imply
cases to forecast the future. Some models in previous that learning from past cases is more dificult than
learnresearch were trained in this way [8], but efects of this ing from random cases from both past and future. The
design choice have not yet been studied. To evaluate how exception here is the NB classifier, which performs better
models perform under these diferent temporal circum- when trained on past cases. In Figure 2, we plot the MCC
stances, we narrow our scope and focus on article 6 of per Test Year for each of the four model types trained on
the ECHR. This article contains the most cases and has either Past Cases (blue) or Random Cases (orange). Note
therefore been investigated in other work as well [? 11]. that the y-axis of each subplot is scaled diferently. Here
In this experiment, we use the facts of the cases from we see that the MCC of models trained on both random
article 6 as the features, and whether or not article 6 was cases and past cases fluctuates a lot for earlier Test Years.
violated as the label. We train models on cases from the This could be due to the limited number of available cases
past and evaluate how their performance compares to in those years that the models are used to train on (see
models trained on randomly selected data. We use three the case distribution per year in Figure 1). The
difertypes of datasets in this experiment, each containing only ence between the two (the shaded area in Figure 2) also
cases from article 6: decreases with more recent Test Years.</p>
        <p>If we look only at the recent years (2000-2022) in
Ta• test sets consist of cases from a single Test Year. ble 7, we get a more nuanced comparison. Table 7 still
• the models trained on Past Cases will be trained shows higher MCCs for models trained on random cases
on all cases that occurred before the Test Year. over models trained on past cases, with the exception of
• the models trained on Random Cases are trained the NB model, but the diference between the two is much
on randomly selected cases that occurred either lower. The absolute mean MCC is also higher across the
before or after the Test Year. years. While the diferences between the two approaches
may be smaller, they still exist, as seen in Figure 2. Not
For each year, we therefore generate 2 types of models, only is it legally more reasonable to train on past cases to
one trained on Past Cases (cases from years before the predict future cases, a random split of the data into a train
year that we use to test the model) and one trained on and test set can also have an impact on the performance.
Random Cases. Note that we ensure that the size of the If we wish to take the law more seriously in this type of
Random Cases training set is the same as the size of the research, to ensure realistic results we should train on
Past Cases training set for each respective Test Year. This past cases and test on future cases.
way, we can disregard discrepancies in the size of the
Time Window We know that machine learning sys- be varied between 5 and 35 years in steps of 5. We test
tems tend to perform better with more data, granted that on only a single Test Year of cases. For example, training
the data is a proper reflection of the problem space. We the model to predict cases from 2022 using a window
also know, however, that the interpretation of the law of 5 years means that the model is trained on the cases
is subject to change over time and precedent may be from 2016 up to and including 2021. Because we only
overturned. When this happens, the older cases can be have cases from 1978 until 2022, we will test our models
overruled by newer cases. Looking too far into the past on cases from 2013 to 2022. This way, we can train our
may therefore not be optimal when trying to predict new model using windows of up to 35 years for each given
court cases. We therefore also investigate the diference Test Year.
between using all cases or only a subset of the more We create a model for each combination of type (SVM,
recent cases. NB, RF and BERT), Test Year (from 2013 to 2022), and</p>
        <p>We train our models on cases from the past but only on window of training data (5 to 35 in steps of 5). There
a limited number of recent years. This window of years are therefore 4 * 10 * 7 = 280 diferent setups in this
represents how far we look back into the past, and will experiment. To represent the results, we average the
MCC of each of these models across the years, as to
show the efects of the window size on the performance
Table 8 of each of the model types. We train on the facts of
Mean MCC of the window experiment. cases from Article 6 and use the ECHR-OD dataset. The
results of this experiment are shown in Table 8, where the
best window for each model is shown in bold. The last
column in Table 8 reports the standard deviation across
the diferent windows.</p>
        <p>Discussion Based on Table 8 we cannot extrapolate appeared to still be more important. By including some
a clear relationship between the number of past years additional cases of other articles in the training phase
of cases in training and the performance of the models. of each specialist model of the Ensemble, we are able to
The NB and BERT models perform best with 35 years increase performance for most model types. While we
worth of cases. However, there does not seem to be a only included a small sample of these additional cases,
clear positive relationship between the window size and future research could investigate whether including more
MCC, as the MCC increases and decreases slightly across additional cases in the training data of specialist models
the window sizes. A similar observation holds for the of the Ensemble could increase performance further.
SVM and RF models, which do not perform much better Experiment 3 shows that training on past cases to
preor worse with more years worth of cases from the past. dict the future is more dificult than training on randomly
The overall impact of the window is therefore small, as selected instances from both the past and the future.
Takalso indicated by the low standard deviations. The BERT ing into account the efects of time may therefore have
model is impacted most by the window size, but the stan- an efect on performance, especially when you consider
dard deviation in MCC is still only 2.3. In this scenario, much older cases. For that reason, randomly splitting
using cases from further in the past does therefore not data into a training and test set, or running a k-fold cross
seem to have a significant impact on the performance validation, might show unrealistic results. In these
sceof the models. This can, of course, be diferent if the nario’s, it is more than likely that the model is predicting
interpretation of the law has changed significantly. If past cases using future cases, which is impossible in
realwe want to take the law more seriously, we should in- ity and does not account for the temporal aspects of the
vestigate the temporal efects of the legislation and, if law.
precedent is overturned, adjust and evaluate our training We show that using only a limited number of years
data accordingly. worth of cases, rather than all cases, does not seem to
have an impact on the performance of our models, as
shown in Table 8. This suggests that the interpretation
5. Conclusion of the law, in particular regarding article 6 of the
European Court of Human Rights, remained stable enough for
machine learning predictions. There are, however, legal
considerations that might suggest the removal of certain
older cases, especially after certain landmark cases or
changes in society. In those cases, we should investigate
the efects that this has on the legislation and adjust our
training data accordingly.</p>
        <p>We have explored legally reasonable design choices
and efects in court case predictions, and have shown
their impact on performance. We conclude that, taking
the law more seriously in machine learning research
requires that the relevant, unique characteristics of the
law are taken into account. Our findings are by no means
enough to address inherent limitations (in particular with
respect to justification), and future research has to remain
critical of the choices that are being made in order to
remain legally reasonable.</p>
        <p>The approaches to court case predictions are diverse and
dificult to compare [ 1]. While some methods yield better
results, they may also raise concerns about how
reasonably they align with the characteristics of legal
decisionmaking. For a proper analysis of court case prediction
research, we should consider the unique characteristics
of the law and the efects that it can have on the models.</p>
        <p>While justification, explainability and responsibility are
major issues in machine learning and law, our scope did
not include these aspects and focused instead on design
choices.</p>
        <p>If we want to take the law more seriously in machine
learning research, we should measure the efect of
relevant design choices and efects. We therefore propose
to use the Matthew’s Correlation Coeficient rather than
the accuracy or F1-score, as the latter two metrics tend
to yield inflated results and can incorrectly attribute a
much higher performance to a model.</p>
        <p>Based on our results of Experiment 1, the facts are the
most important of a case when it comes to court case
predictions (see Table 3). Including the procedure of the
case can increase performance, but this is dependent on
the article and on the model used. For the best results, the
parts used should therefore be included in the parameter
optimization pipeline.</p>
        <p>In Experiment 2, our Generalist model, trained on all
articles at once, outperforms our Ensemble of specialist
models each trained on a specific article. While the
specialist models had a reduced problem space, more data</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This research was funded by the Hybrid Intelligence Cen</title>
        <p>ter, a 10-year programme funded by the Dutch Ministry
of Education, Culture and Science through the
Netherlands Organisation for Scientific Research,
https://hybridintelligence-centre.nl.</p>
      </sec>
      <sec id="sec-3-2">
        <title>D. Ho, When Does Pretraining Help? Assessing</title>
        <p>Self-Supervised Learning for Law and the
Case[1] M. Medvedeva, M. Wieling, M. Vols, Rethinking HOLD Dataset of 53,000+ Legal Holdings,
Associathe field of automatic prediction of court decisions, tion for Computing Machinery, New York, NY, USA,
Artificial Intelligence and Law (2022) 1–18. 2021, p. 159–168.
[2] T. Bench-Capon, The need for good old fashioned [13] B. Clavié, M. Alphonsus, The unreasonable
efecAI and Law, International Trends in Legal Infor- tiveness of the baseline: Discussing SVMs in legal
matics: A Festschrift for Erich Schweighofer (2020) text classification, in: S. Erich (Ed.), Legal
Knowl22–36. edge and Information Systems - JURIX 2021: The
[3] C. Steging, S. Renooij, B. Verheij, Discovering the Thirty-fourth Annual Conference, Vilnius,
Lithuarationale of decisions: towards a method for align- nia, 8-10 December 2021, volume 346 of Frontiers
ing learning and reasoning, in: J. Maranhão, A. Z. in Artificial Intelligence and Applications , IOS Press,
Wyner (Eds.), ICAIL ’21: Eighteenth International 2021, pp. 58–61.</p>
        <p>Conference for Artificial Intelligence and Law, São [14] T. Pranckevičius, V. Marcinkevičius, Comparison of
Paulo Brazil, June 21 - 25, 2021, ACM, 2021, pp. naive bayes, random forest, decision tree, support
235–239. vector machines, and logistic regression classifiers
[4] T. Santosh, S. Xu, O. Ichim, M. Grabmair, Decon- for text reviews classification, Baltic Journal of
founding legal judgment prediction for European Modern Computing 5 (2017) 221.
Court of Human Rights cases: Towards better align- [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
ment with experts, in: Proceedings of the 2022 B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
Conference on Empirical Methods in Natural Lan- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
guage Processing, Association for Computational D. Cournapeau, M. Brucher, M. Perrot, E.
DuchLinguistics, Abu Dhabi, United Arab Emirates, 2022, esnay, Scikit-learn: Machine learning in Python,
pp. 1120–1138. URL: https://aclanthology.org/2022. Journal of Machine Learning Research 12 (2011)
emnlp-main.74. 2825–2830.
[5] I. Chalkidis, A. Jana, D. Hartung, M. Bommarito, [16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
DeI. Androutsopoulos, D. Katz, N. Aletras, LexGLUE: langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
FunA benchmark dataset for legal language understand- towicz, J. Brew, Huggingface’s transformers:
Stateing in English, 2021. arXiv:2110.00976. of-the-art natural language processing, CoRR
[6] A. Quemy, R. Wrembel, ECHR-OD: On building abs/1910.03771 (2019).</p>
        <p>an integrated open repository of legal documents [17] S. Bird, E. Klein, E. Loper, Natural language
processfor machine learning applications, Information ing with Python: analyzing text with the natural
Systems 106 (2022) 101822. language toolkit, " O’Reilly Media, Inc.", 2009.
[7] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, [18] B. Matthews, Comparison of the predicted and
obV. Lampos, Predicting judicial decisions of the Eu- served secondary structure of T4 phage lysozyme,
ropean Court of Human Rights: A natural language Biochimica et Biophysica Acta (BBA)-Protein
Strucprocessing perspective, PeerJ Computer Science 2 ture 405 (1975) 442–451.</p>
        <p>(2016) e93. [19] D. Chicco, G. Jurman, The advantages of the
[8] I. Chalkidis, I. Androutsopoulos, N. Aletras, Neu- Matthews correlation coeficient (MCC) over F1
ral legal judgment prediction in English, 2019. score and accuracy in binary classification
evaluaarXiv:1906.02059. tion, BMC Genomics 21 (2020) 1–13.
[9] M. Medvedeva, M. Vols, M. Wieling, Using machine [20] M. Medvedeva, A. Üstün, X. Xu, M. Vols, M. Wieling,
learning to predict decisions of the European Court Automatic judgement forecasting for pending
apof Human Rights, Artificial Intelligence and Law plications of the European Court of Human Rights.,
28 (2020) 237–266. in: ASAIL/LegalAIIA@ ICAIL, 2021, pp. 12–23.
[10] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N.
Aletras, I. Androutsopoulos, LEGAL-BERT: The
muppets straight out of law school, ArXiv preprint
arXiv:2010.02559 (2020).
[11] J. Mumford, K. Atkinson, T. Bench-Capon,
Reasoning with legal cases: A hybrid ADF-ML approach,
in: Legal Knowledge and Information Systems,
volume 362 of Frontiers in Artificial Intelligence and</p>
        <p>Applications, IOS Press, 2022, pp. 93–102.
[12] L. Zheng, N. Guha, B. Anderson, P. Henderson,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>