Automatic Judgement Forecasting for Pending
Applications of the European Court of Human Rights
Masha Medvedeva1,2 , Ahmet Üstun1 , Xiao Xu1,3 , Michel Vols2 and Martijn Wieling1
1
  Centre for Language and Cognition, University of Groningen, the Netherlands
2
  Department of Legal Methods, University of Groningen, the Netherlands
3
  Netherlands Interdisciplinary Demographic Institute, The Netherlands


                                             Abstract
                                             Judicial decision classification using Natural Language Processing and machine learning has received much attention in the
                                             last decade. While many studies claim to ‘predict judicial decisions‘, most of them only classify already made judgements.
                                             Likely due to the lack of data, there have been only a few studies that discuss the data and the methods to forecast future
                                             judgements of the courts on the basis of data available before the court judgement is known. Besides proposing a more
                                             consistent and precise terminology, as classification and forecasting each have different uses and goals, we release a first
                                             benchmark dataset consisting of documents of the European Court of Human Rights to address this task. The dataset includes
                                             raw data as well as pre-processed text of final judgements, admissibility decisions and communicated cases. The latter are
                                             published by the Court for pending applications (generally) many years before the case is judged, allowing one to forecast
                                             judgements for pending cases. We establish a baseline for this task and illustrate that it is a much harder task than simply
                                             classifying judgements.

                                             Keywords
                                             judicial decisions, machine learning, text classification, datasets, neural networks


1. Introduction                                                                                                       promote accessibility to justice [5], while others suggest
                                                                                                                      that the courts, such as the European Court of Human
Digital access to case law (i.e. court judgements) pro-                                                               Rights (ECtHR), may eventually use it to prioritise vi-
vides us with a unique opportunity to process legal data                                                              olations cases [1, 4]. Additionally, it has been argued
automatically on a large scale using natural language                                                                 that these type of systems will eventually be able reduce
processing techniques. It is, therefore, not surprising                                                               human error of the judges [6]. While each of these sug-
that using machine learning for judicial outcome classi-                                                              gestions can be scrutinised from the legal perspective, it
fication has seen a substantial increase in recent years.                                                             is still clear there are a large number of potential applica-
If we rely on the presumption that legal systems and                                                                  tions for a successful classification system.
legal decision-making are consistent and predictable, we                                                                 While many of the currently proposed systems show
should be able to ultimately create a system that would                                                               promising results with a classification performance of
be able to automatically predict judicial decisions cor-                                                              about 80 percent correct, this is an overly optimistic view
rectly. Consequently, such a system could also be used                                                                of their performance. One of the reasons for this is that
to identify patterns which might be less consistent and                                                               classification performance is generally evaluated by pre-
perhaps reveal biases in the legal system and judicial                                                                dicting the outcome for a random subset of cases which
decision-making.                                                                                                      were already known but not considered when creating
   At present, much work has been done on classifying                                                                 the model. While this may seem fair, an arguably more
the outcomes of final judgements [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].                                                     interesting task is to predict future judgements.1
Classification of final judgements is in principle a use-                                                                Importantly, however, all of the aforementioned stud-
ful task, as it may be used to identify important factors                                                             ies claim to ‘predict judicial decisions‘, which suggests
and arguments of the court, and thereby may provide in-                                                               these systems are able to predict (future) rulings on the
sight into the process of decision-making. Some previous                                                              basis of the available information. Unfortunately, classi-
research even suggests that one day such classification                                                               fying future judgements causes performance to suffer [9].
systems will be able to provide legal assistance [2] and                                                              This lower performance may be caused by, for example,
                                                                                                                      changes in the interpretation of the law, or new social
Proceedings of the Fifth Workshop on Automated Semantic Analysis of
Information in Legal Text (ASAIL 2021), June 25, 2021, São Paulo,
Brazil.                                                                                                                    1
                                                                                                                             It is important to note that predicting court judgements is
Envelope-Open m.medvedeva@rug.nl (M. Medvedeva); a.ustun@rug.nl                                                       a very different task from actual decision-making. The machine
(A. Üstun); xu@nidi.nl (X. Xu); m.vols@rug.nl (M. Vols);                                                              learning systems which are the focus of this study make pattern-
m.b.wieling@rug.nl (M. Wieling)                                                                                       based guesses on the basis of (sequences of) words in the text of a
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     case. We discuss the ethical consideration for making this distinction
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        later in the paper.
phenomena and developments due to changing societies.                   both sets of algorithms on the same cases, we aim to
In addition, almost all classification systems rely on data             compare the relative performance of algorithms previ-
about the case which is made available when the out-                    ously used for classifying court judgements, both for the
come of the case is known. Having knowledge about the                   task of classification as well as forecasting (using the the
outcome of a case may influence how the facts of the case               information published in the communicated cases; see
are described (e.g., irrelevant facts for the outcome may               Section 3.2). We do not introduce any new algorithms, as
be removed, or facts identified after an investigation and              the purpose of this study is to determine the difference
relevant to the outcome may be highlighted), compared                   in performance for the two types of tasks.
to a situation in which the outcome would not have been                    In the following section we will discuss earlier work in-
known. This would mean that systems which use in-                       volving the latest attempts at classifying and forecasting
formation composed when the outcome was not known                       court judgements. Section 3 is dedicated to describing
may be disadvantaged compared to systems which use                      the data we have used for our experiments and the larger
information extracted from documents composed when                      dataset we release with this paper. In Section 4, we dis-
the outcome was known. One goal of this paper is to                     cuss various methods that can be used for forecasting
evaluate whether this indeed is the case.                               decisions, their power and limitations. In Section 5, we
   A further goal of this paper is to propose making a                  report the results of the experiments that we have con-
distinction between forecasting judgements and classify-                ducted for this study. In Section 6, we discuss the results
ing judgements. For the former, available textual data                  and make suggestions regarding future work. Finally,
describing the (facts of the) case is required which was                in Sections 7 and 8, respectively, we make a note about
created before the decision was reached, so that the input              ethical issues when conducting this type of research, and
of the forecasting system is not influenced by the out-                 draw conclusions.
come. For the latter, available textual data about the (facts
of the) case was created after the decision was reached.
Being explicit about this distinction is important, as many             2. Related work
current studies in the field claim to ‘predict judicial de-
                                                                        In this paper, we exclusively focus on the closed-class
cisions‘, which suggests that they are forecasting future
                                                                        (often binary) tasks of outcome classification and out-
judgements, while instead they are classifying previously
                                                                        come forecasting. These tasks are different from charge
made judgements. For example, while Medvedeva et al.
                                                                        prediction, which predicts an open set of outcomes, such
[9] predict the performance for future cases (by training
                                                                        as the duration of a prison term [12, 13, 14, 15, 16, 17, 18].
a model on the basis of data from past cases), this is still a
                                                                           While a growing number of courts share their data
classification task as the input data is (a subset of) textual
                                                                        online, not many courts publish all of their cases online.
data which was created after the decision was reached.
                                                                        Furthermore, for many published cases it may be hard to
   Forecasting thus requires data related to a judgement
                                                                        determine a binary or at least a small set of pre-set out-
that are published before the actual judgement was de-
                                                                        comes, making it hard to use the data from these courts
livered. While the courts publish more and more case
                                                                        for the type of machine learning models discussed in this
law every day [11], only little access is provided to doc-
                                                                        study. The most recent papers that use machine learning
uments that are available before the judgements were
                                                                        approaches for classifying judicial decisions are there-
made. Forecasting future judgements is therefore a task
                                                                        fore generally focusing on a limited number of courts,
which is impossible for many online available datasets.
                                                                        mainly including the US Supreme Court [3, 19, 20], the
For this reason, the large majority of machine learning
                                                                        French Supreme Court [2, 21], and the European Court of
systems for legal data were built to provide a classifi-
                                                                        Human Rights [5, 6, 7, 9]. A few other courts around the
cation of court judgements, as opposed to forecasting
                                                                        world have also been the focus of this type of analysis,
judgements.
                                                                        including courts of the UK [22], Canada [8, 23], India
   In this study, however, we concentrate on the ECtHR,
                                                                        [10], and Thailand [24].
as it publishes all of its final judgements online together
                                                                           There is a tradition of using statistical techniques to
with many supplementary documents, including admissi-
                                                                        analyse the case law of the US Supreme Court (SCOTUS).
bility cases, press-releases, summaries of cases, et cetera.
                                                                        The advantage of working with the SCOTUS database
Several of these documents were created before the de-
                                                                        is that due to the attention it attracts, all trial data has
cision was reached, and therefore this specific dataset
                                                                        systematically and manually been annotated with hun-
enables both classification as well as forecasting of the
                                                                        dreds of variables by legal experts, shortly after the case
judgements.2 Besides evaluating whether forecasting is
                                                                        has been tried. Katz et al. [19] used variables which are
indeed a harder task than classification by evaluating
                                                                        in principle available before SCOTUS reached its deci-
    2
      To enable reproducibility, we provide our dataset containing      sion in an approach called extremely randomised trees
pending ECtHR applications, as well admissibility decisions and final   to forecast the decision of the court. Their approach re-
judgements of the court that can be used for a variety of tasks.
sults in predicting 70% of the cases correctly, which is      directly comparable to the experiments conducted on
a somewhat lower performance than achieved by some            court judgements.
of the state-of-the-art classification approached applied        There are currently only very few studies that focus
to data from other courts. However, as Katz et al. [19]       on forecasting judgements, and most show a lower per-
performed the task of forecasting court decisions, rather     formance level than studies on judgement classification.
than classifying court decisions, their lower performance     Specifically Sharma et al. [20] and Katz et al. [19] forecast
may also be indicative of the potentially higher difficulty   court decisions of the US Supreme Court. They reported
of forecasting.                                               an accuracy of around 70%. For courts in Europe, only
    Most of the courts in Europe, unfortunately, do not       Waltl et al. [27] forecast the outcome of appeal decisions
have the advantage of being able to generate such scrupu-     involving German tax law (reporting a relatively low
lously annotated datasets, and often provide no access        performance, with an average F-score of 0.57). Further-
to all case law. For the European Court of Human Right        more, Medvedeva et al. [28] forecast decisions on the
a baseline model for classifying judgements using a so-       basis of data from the ECtHR with their online system
called Support Vector Machine (SVM) on the basis of           JURI (yielding an accuracy of around 70%).3 The latter
n-grams (i.e. sequences of one or more words extracted        study is the approach we follow and extend in this pa-
from the text) has been put forward by Medvedeva et al.       per. Specifically, we aim to investigate how the more
[9]. Their model classified court decisions with an av-       advanced machine learning approaches of Chalkidis et al.
erage accuracy of 75% for nine articles. Their work ex-       [5] and Chalkidis et al. [29] perform when forecasting
tended and corrected some data extraction issues (i.e. ar-    the ECtHR judgements.
guments of the court referencing the outcome were still
included in the input training data, thereby resulting in
overly optimistic performance) of an earlier study by         3. Data
Aletras et al. [1]. Additional work on the topic has been
conducted by Chalkidis et al. [5], where they tested new      3.1. The Court
methods and additional tasks, such as predicting the im-      The European Court of Human Rights was established in
portance of a court case and identifying the articles that    1959 as an international court that deals with individual
may (not) have been violated. While Chalkidis et al. [5]      and State applications claiming violation of various rights
also trained the system on cases up to 2013 and tested on     laid out in the European Convention on Human Rights
2014-2018 (following the approach of Medvedeva et al.         (ECHR) [30, 31]. Applications are always brought by an
[25]), they extracted their data from the judgements,         individual/institution or multiple individuals/institutions
thereby making their approach a classification task in-       against a State or multiple States that have ratified the
stead of a forecasting task.                                  Convention. No applications are considered between
    To our knowledge, only one study has tried to show        individuals, or from a State against an individual. Only
that using documents from the early stages of the legal       five cases of a State against a State have been judged so
process may not always be as useful and predictive as final   far in the history of the Court. In 2020 the Court pro-
judgements. Specifically, Branting et al. [26] conducted      cessed 41,700 applications, which were added to already
experiments using statements from attorney misconduct         pending applications. A total of 37,289 applications were
complaints submitted to the Bar Association in the USA.       dismissed based on the admissibility criteria, while the
The researchers set up a task of predicting whether the       rest were decided by a Chamber or a Grand Chamber
case would be investigated or closed. Using six different     (762 cases based on 1,901 applications). From those cases,
machine learning systems the authors showed that the          880 were found to represent a violation of human rights.
text of the complaints themselves had very low predic-        The majority of the documents produced by the court
tive accuracy (maximum weighted f1-score: 0.52), and          during the process are published online by the Court.4
also adding additional metadata (i.e. extra information
filled in during the complaint, attorney history, sentiment
score, etc) was not very beneficial (maximum weighted         3.2. Communicated cases
f1-score: 0.55). Only data from later stages in the pro-      In order to describe the data that we use for our system
cess, specifically allegation codes assigned by the intake    it is important to clarify what the application process of
staff substantially improved results (maximum weighted        the Court entails.
f1-score: 0.70). Nevertheless, these scores are still sub-        A resident of a country that ratified the ECHR can
stantially lower than the scores reported by many studies     claim a potential violation within a certain time frame.
classifying final decisions (see above). While Branting       The application is submitted via mail. On arrival, it is
et al. [26] also deal with legal documents, they are not
judicial decisions, but rather disciplinary proceedings           3
                                                                      http://www.jurisays.com
conducted by the Bar Association, and therefore are not           4
                                                                      https://hudoc.echr.coe.int
registered by the Court and sent to the legal division that      but eventually judged together. Each year thousands of
deals with the cases of a particular State, as they are famil-   applications are communicated (i.e. 6,442 in 2019 and
iar with the legislation of the country. Subsequently, the       7,681 in 2020). Only communicated cases from the year
case is allocated to one of the Court’s judicial formations.     2000 and later are available online. The Court decides on
   Most of the cases are found inadmissible without mer-         the order in which the cases are dealt with, based on the
iting an investigation, due to not meeting the formal            importance and urgency of the issues raised (Rule 41 of
admissibility criteria. For example, often the applica-          the Court – Order of Dealing with Cases)5 . Therefore,
tion is dismissed because the applicant did not file the         the cases being judged may be mixed up and do not al-
complaint within the required time frame. A decision             ways respect the chronological order of when they were
regarding these cases is normally rendered by a single           submitted.
judge. If the application was not dismissed directly, the           For the machine learning systems created in our study,
decision on admissibility is taken by a Committee of             we will only use communicated documents that have
three judges (in case the Court has dealt with a number          judgements or have been found inadmissible based on
of similar cases before) or the Chamber of seven judges.         merit for training and testing.
In some cases admissibility decisions may even be made
by the Grand Chamber (consisting of seventeen judges).           3.3. Data collection
Those usually concern the interpretation of the Conven-
                                                          We collected the data for this study in the following way.
tion itself, or if there is a risk of inconsistency compared
to the previous judgements of the Court.                  We scraped the ECtHR’s ‘HUDOC’ website6 and down-
   When an application is judged to be admissible based   loaded all communicated cases. We did the same for
on formal parameters, the Chamber will examine its mer-   the judgements and admissibility decision documents,
its. Before doing so, the Court will communicate the      such as the admissibility cases from the Chamber and the
application to the government that is the potential vio-  Committee. We filtered the cases on the website to only
lator of the rights of the applicant (Rule 60 of the Courtdownload English versions of the documents. As the
– Claims for just satisfaction). This is not done for all filter did not always work adequately, we also filtered us-
applications, but only for a part (approx. 15-20%). Such  ing Google’s language detection (l a n g d e t e c t ) library.7 In
communicated cases contain the summary of the facts of    addition, we extracted all available metadata, such as the
the case, as well as questions to the government pertain- application number, state, importance level, et cetera. We
ing to the applicant’s complaint. This document allows    used the application number of each communicated case
the government concerned to submit its observations on    to link the associated documents to corresponding admis-
the matter of dispute. These documents are often commu-   sibility decisions and judgements. We then extracted the
nicated years before the case is judged, which provides a conclusion of the court proceedings (‘violation’ or ‘no
unique opportunity to use them for predicting the judge-  violation’), as well as the facts of the cases from the judge-
ments of future cases. Moreover, the questions posted to  ment text. We use these facts in a classification model, so
the state often reflect on the Court’s legal characterisa-we can compare its performance to the performance of
tion of the complaint. See, for instance, a question from a forecasting model using data from the communicated
a case of Arki against Hungary (application no. 10755/14, cases.
communicated on June 6, 2014):                                While the facts in communicated cases are the sum-
                                                          mary of the events as described by the applicant, the
        1. Have the applicants been subjected to          facts that end up in the final judgement are compiled
        inhuman or degrading treatment on ac-             after the investigation and therefore also include the side
        count of their cramped prison conditions,         of State. We only use the facts of the case from the fi-
        in breach of Article 3 of the Convention?         nal judgements since these are most comparable to the
                                                          communicated cases. Specifically, these have also been
   As a consequence, these documents can potentially be argued to potentially be available before the outcome was
used to identify the facts or even (parts of) arguments reached [9] and do not contain references to the outcome
related to certain judgements before those judgements [9, 5]. This also mirrors the set up in Chalkidis et al. [5]
are made.                                                 that we follow.
   Cases concerning repetitive issues do not merit a com-     To enable a fair comparison, the cases (but not the ex-
municated case, and not every communicated case cor- tracted information about these cases) used for training
responds directly to a specific judgement. Multiple ap- and testing are identical for both models. We assume
plications concerning the same events can be merged that cases that were found to be inadmissible based on
into a single case during the communication stage, but
                                                               5
may be separated during final decision-making. Similarly,      6
                                                                 https://www.echr.coe.int/Documents/Rules_Court_ENG.pdf
                                                                 https://hudoc.echr.coe.int/
multiple applications can be communicated separately,          7
                                                                 https://pypi.org/project/langdetect/
merit are similar to cases that were judged as having no          average number of words associated with the extracted
violation. From a legal point of view, these cases can be         facts from each judgement are not much higher (i.e. 2000
characterised as simply more clear ‘non-violation’ cases.         words) than the number of words of the associated com-
The court has made judgements on similar applications             municated case (i.e. 1800 words).
many times before, and hence these do not merit a full
judgement. For cases that went though to the final judge-         Table 1
ment stage, we assign the ‘violation’ label for all cases         Distribution of training and testing data for different setups.
that were judged to show a violation of at least one article
                                                                                                setup 1:       setup 2:       setup 3:
of the ECHR.                                                                                      2020           2019           2018
   As we mentioned before, individual communicated                    train (balanced)            2264           1806           1386
cases do not always directly correspond to unique cases               test (no violation)          167            229            210
which received a judgement or admissibility decision, as              test (violation)             342            311            309
communicated cases can be split or merged during the
process. For the split cases, the assigned label of the asso-
ciated communicated case depended on whether any of
the split cases resulted in a violation of at least one article   3.4. Dataset
(‘violation’ label), or not (‘non-violation’ label, i.e. none     In addition to the data used in this study, we have ex-
of the split cases exhibits a violation of any article). To       tracted data for a large set of additional cases, which
ensure the set of cases considered for the classification         were not taken into account in our analysis. This dataset
task and the forecasting task is identical, we randomly           is released together with this paper.8 Specifically, this
selected a single judged case (from the associated split          dataset contains all of the communicated cases, admissi-
cases) where the assigned label matched that of the as-           bility cases and final judgements of the Court published
signed label to the communicated case. For judgements             between 1960 and 2020. We provide raw text, the meta-
associated with multiple merged communicated cases,               data (e.g., date, court-assigned importance, parties, and
we randomly chose one of the communicated cases and               section) as well as the preprocessed text of communi-
removed the rest. Finally, duplicate cases and judged             cated cases (split into facts and questions), admissibility
cases which did not have (correctly formatted) facts were         decisions (extracted facts) and final judgements (split into
excluded from the dataset used for both tasks. In this            sections: Procedure, Facts, Relevant domestic law, Law
way, the set of cases considered for the classification task      - including arguments of the court, Outcome, and Dis-
and the forecasting task is identical.                            senting opinions) in order to facilitate further research in
   Subsequently, we split the data into training and test         ECtHR judgement forecasting and classification. In addi-
sets (on average a 77%-23% split). We trained each sys-           tion, the case numbers are linked throughout each stage
tem three times, with different setups (with a decreasing         of the court proceedings (where applicable). This dataset
amount of training data) to assess the robustness of the          may be suitably used for a number of classification tasks
results. Setup 1 concerns model training with cases that          in legal analysis, including judgement classification based
received judgement in years 2000-2019, whereas model              on facts (using Facts and possibly Procedure sections)
testing was conducted with cases that received judge-             and/or arguments (using the Law sections).
ment in the year 2020. Setup 2 uses 2000-2018 data for
training and 2019 data for testing. Setup 3 uses 2000-
2017 data for training and 2018 data for testing. Each            4. Methodology
setup is used once for forecasting judgements using data
from the communicated cases, and once for classifying             As we mentioned before, the approach most relevant
judgements using data from the facts extracted from the           work for our study is that of Chalkidis et al. [5]. Specifi-
final judgement. As in each setup the number of viola-            cally, in one of their tasks they focused on classifying the
tion cases exceeded the number of non-violation cases,            court judgements depending on whether at least one arti-
we balanced the training set in each setup by removing            cle of the ECHR was violated or not.9 In addition, they ex-
older violation cases until the same number of documents          perimented with using anonymized vs. non-anonymized
was present for each label. Table 1 shows the number                   8
                                                                         https://drive.google.com/drive/folders/1ltIpHlcqcRlT_
of documents available for training and testing for each          JDebHsyLgvgoa4Vbxo8?usp=sharing
setup.                                                                 9
                                                                         The purpose of their Chalkidis et al. [5] second task was to iden-
   The data used for the two different tasks differs some-        tify all of the violated articles for a single court document (i.e. multi-
what. For the communicated cases, we used all data                label classification). However, as the involved articles are known as
                                                                  soon as the application is submitted, it is not clear what the practical
available (i.e. the facts and the questions as they were
                                                                  use is of predicting the list of articles potentially violated. A realistic
presented in the text), whereas for the judgements, we            scenario for the ECtHR would only involve deciding whether or not
only used data from the facts section. In general, the            a given article was violated.
data. While we perform the same task as Chalkidis et al.               texts from different sources.
[5], enabling us to benefit from more data than when                      BERT and many of its variations, including H-BERT,
we would predict (non-)violation per article separately,               have shown to result in substantial improvements com-
we only use non-anonymized data. For the anonymized                    pared to the state-of-the-art in a large variety of text
setup, Chalkidis et al. [5] have removed named entities                classification tasks. Specifically, Chalkidis et al. [5] have
(such as names or locations) from the text to make sure                shown that using H-BERT resulted in a very high perfor-
the model was not biased towards demographic informa-                  mance (macro F-score of 0.82) for the binary task (viola-
tion. While removing this potential bias is understand-                tion of at least 1 article of ECHR vs. no violation), and an
able when building a decision-making system, forecast-                 even higher macro F-score of 0.83 for LEGAL-BERT on
ing or classifying judgements is different. Specifically,              the same dataset [29].
given that locations may offer relevant information about                 In the following, we report the results per class for each
the case (i.e. some countries are notorious violators of               model. Our main evaluation metric is the macro F-score.
specific rights), models used for forecasting or classifica-           This measure can be described as a mean of the average
tion benefit from keeping this information (also known                 precision and recall across all classes (i.e. ‘violation’ and
to judges) in.                                                         ‘no violation’). Precision is the percentage of cases given
   In our study, we implement three systems used by                    a certain label (i.e., ‘violation’ or ‘no violation’) that was
Chalkidis et al. [5] and compare their performance on                  correct. Recall is the percentage of cases having a certain
the classification and forecasting task. Specifically, we              (correct) label, that were identified as such.11
include the SVM model, the Hierarchical-BERT (H-BERT)
model and the LEGAL-BERT model (see below for more
details). All models were re-created on the basis of the               5. Results
description provided by Chalkidis et al. [5] and Chalkidis
                                                                       We started our experiments with setup 1, by testing on
et al. [29]. As not all settings and (hyper)parameters were
                                                                       all data from 2020. To our surprise, results for classi-
specified in their paper, our reproduction of their models
                                                                       fying the final judgements were very low compared to
may be slightly different. However, we believe these
                                                                       Chalkidis et al. [5] (see tables 2 and 3 for the performance
differences to be minor. Our goal is to see how some of
                                                                       per class). In contrast to our expectations, forecasting
the state-of-the-art models which have been shown to
                                                                       final judgements on the basis of communicated cases in-
perform very well when applied to final judgements of
                                                                       stead of on the basis of the facts of the final judgements
the ECtHR perform when they are only being provided
                                                                       yielded better results when using H-BERT. Compared
with data from the applicants to the ECtHR (i.e. victims
                                                                       to Chalkidis et al. [5], however, since not all cases are
of a alleged human right violation).
                                                                       communicated by the court, our training set was much
   Our SVM classifier is a Linear SVC model including
                                                                       smaller (2264 cases vs. 7100 cases, respectively).
1-5 n-grams. For a detailed explanation about text classi-
fication using machine learning (including Linear SVC),
see Medvedeva et al. [9].                                              Table 2
   BERT or Bidirectional Encoder Representations from                  Performance (precision, recall, f1-score and accuracy) for Lin-
Transformers [32] is a popular pre-trained transformer-                ear SVC, H-BERT, and LEGAL-BERT models per class for final
                                                                       judgement classification, trained on cases between 2000 and
based [33] machine-learning technique resulting in a
                                                                       2019 and tested on cases decided in 2020
so-called language model. The method also allows fine-
                                                                                         2020 - Final judgements
tuning the language model for a specific task, i.e. adapt-
                                                                                                       P     R            F1      #
ing the pre-trained model to the target task, in our case
                                                                          SVM            no viol.    0.46   0.93         0.62    167
classifying and forecasting ECtHR judgements.                                           violation    0.93   0.46         0.62    342
   To use BERT on long case documents without hav-                                     macro avg.    0.70   0.70         0.62    509
ing a maximum text length restriction, H-BERT [5] pro-                                  accuracy                         0.62    509
cesses each fact separately and combines them by using                    H-BERT         no viol.    0.42   0.92         0.58    167
a self-attention layer to generate an embedding for a                                   violation    0.91   0.38         0.53    342
case. This resulting embedding is then used for classifi-                              macro avg.    0.66   0.65         0.56    509
cation and forecasting.10 Instead of the standard BERT                                  accuracy                         0.56    509
model (which [5] reported to have sub-par performance),                   LEGAL-         no viol.    0.42   0.90         0.58    167
we used LEGAL-BERT [29] in our experiments. LEGAL-                        BERT          violation    0.89   0.40         0.55    342
BERT is a BERT model which was pre-trained on legal                                    macro avg.    0.66  0.65          0.57    509
                                                                                        accuracy                         0.57    509
   10
      While BERT can process each case by including up to 512
                                                                           11
tokens (i.e. meaningful word parts), our H-BERT implementation                Exact definition of the F-score can be found here:
can use up to 1024 tokens (i.e. 128 tokens for each of the the first   https://scikit-learn.org/stable/modules/generated/sklearn.metrics.
eight facts).                                                          f1_score.html
Table 3                                                           Table 5
Performance (precision, recall, f1-score and accuracy) for Lin-   Performance (precision, recall, f1-score and accuracy) for Lin-
ear SVC, H-BERT, and LEGAL-BERT models per class for fore-        ear SVC, H-BERT, and LEGAL-BERT models per class for fore-
casting judgements, trained on communicated cases between         casting judgements, trained on communicated cases between
2000 and 2019 and tested on communicated cases that received      2000 and 2018 and tested on communicated cases that received
a judgement in 2020.                                              a judgement in 2019.
               2020 - Communicated cases                                         2019 - Communicated cases
                           P      R       F1            #                                    P      R       F1            #
   SVM           no viol. 0.47   0.51    0.49          167           SVM           no viol. 0.62   0.53    0.57          229
                violation 0.75   0.72    0.73          342                        violation 0.69   0.77    0.73          311
               macro avg. 0.61   0.61    0.61          509                       macro avg. 0.66   0.65    0.65          540
                accuracy                 0.65          509                        accuracy                 0.67          540
   H-BERT        no viol. 0.45   0.61    0.52          167           H-BERT        no viol. 0.57   0.67    0.61          229
                violation 0.77   0.63    0.69          342                        violation 0.72   0.63    0.67          311
               macro avg. 0.61   0.62    0.60          509                       macro avg. 0.64   0.65    0.64          540
                accuracy                 0.62          509                        accuracy                 0.65          540
   LEGAL-        no viol. 0.42   0.54    0.47          167           LEGAL-        no viol. 0.55   0.50    0.52          229
   BERT         violation 0.74   0.63    0.68          342           BERT         violation 0.66   0.70    0.68          311
               macro avg. 0.58   0.58    0.57          509                       macro avg. 0.60   0.60    0.60          540
                accuracy                 0.60          509                        accuracy                 0.61          540

   However, when trying setup 2, where we trained using Table 6
less data (i.e. until 2018) and tested on all data of 2019, Performance (precision, recall, f1-score and accuracy) for Lin-
results were as expected. Specifically, macro F-scores ear SVC, H-BERT, and LEGAL-BERT models per class for final
ranged between 0.79 and 0.92 for the classification task judgement classification, trained on cases between 2000 and
(see Table 4), and performance was much lower for the 2017 and tested on cases decided in 2018.
forecasting task with macro F-scores ranging from 0.60                      2018 - Final judgements
to 0.65 (see Table 5).                                                                      P        R      F1       #
                                                                     SVM         no viol.       0.67     0.91    0.77    210
Table 4                                                                          violation      0.92     0.70    0.79    309
Performance (precision, recall, f1-score and accuracy) for Lin-                  macro avg.     0.79     0.80    0.78    519
ear SVC, H-BERT, and LEGAL-BERT models per class for final                       accuracy                        0.78    519
judgement classification, trained on cases between 2000 and          H-BERT      no viol.       0.86     0.72    0.78    210
2018 and tested on cases decided in 2019.                                        violation      0.83     0.92    0.87    309
                                                                                 macro avg.     0.84     0.82    0.83    519
                 2019 - Final judgements
                                                                                 accuracy                        0.84    519
                               P     R          F1      #
                                                                     LEGAL-      no viol.       0.88     0.78    0.83    210
   SVM           no viol.    0.69   0.95       0.80    229
                                                                     BERT        violation      0.86     0.93    0.89    309
                violation    0.95   0.68       0.79    311
                                                                                 macro avg.     0.87     0.85    0.86    519
               macro avg.    0.82   0.81       0.79    540
                                                                                 accuracy                        0.87    519
                accuracy                       0.79    540
   H-BERT        no viol.    0.90   0.92       0.91    229
                violation    0.94   0.93       0.93    311
               macro avg.    0.92   0.92       0.92    540    When running the same experiments using succes-
                accuracy                       0.92    540 sively smaller datasets (i.e. testing on data from 2017, and
   LEGAL-        no viol.    0.87   0.90       0.88    229 2016), the same pattern is visible as for setups 2 and 3.
   BERT         violation    0.92   0.90       0.91    311 That is, performance when classifying final judgements
               macro avg.    0.90   0.91       0.90    540 is much higher than when forecasting final judgements.
                accuracy                       0.90    540 Table 8 shows the macro F-scores for both tasks for all
                                                           years (of the test set) ranging from 2016 to 2020 and all
   To determine which of the two setups resulted in repre- three algorithms. Besides showing that classification per-
sentative results, we conducted a final experiment (setup formance is generally (except for 2020) higher than fore-
3), by training with even less data (i.e. until 2017) and casting performance, these results also show that while
testing on all data of 2018. The results showed a simi- H-BERT and LEGEL-BERT generally outperforms SVM
lar pattern (with slightly better performance, despite the in classification (except for 2020), they do not improve
reduced amount of training data) as the results of setup over SVM in forecasting.
2 (2019). See Tables 6 and 7 for an overview of these
results.
Table 7                                                              The SVM model allows us to inspect the top coeffi-
Performance (precision, recall, f1-score and accuracy) for Lin-   cients (weights) of n-grams assigned by the system. We
ear SVC, H-BERT, and LEGAL-BERT models per class for fore-        observe that for final judgements the system often pri-
casting judgements, trained on communicated cases between         oritises longer n-grams (the average length for the 100
2000 and 2017 and tested on communicated cases that received      top features is 2.475), while for communicated cases it
a judgement in 2018.                                              prioritises unigrams and common collocations consisting
               2018 - Communicated cases                          of two words such as public prosecutor or minor offences
                           P      R       F1            #         (the average length for the 100 top features is 1.405).
   SVM           no viol. 0.62   0.55    0.58          210           We should also take into account that the communi-
                violation 0.72   0.77    0.74          309        cated case is a summary of an applicant’s complaint. As
               macro avg. 0.67   0.66    0.66          519        a result, it only reflects this party’s side of the events,
                accuracy                 0.68          519
                                                                  and may be subjective and incomplete. After sending
   H-BERT        no viol. 0.60   0.63    0.61          210
                                                                  the communicated case to the State involved, the Court
                violation 0.73   0.71    0.72          309
               macro avg. 0.67   0.67    0.67          519        conducts the investigation and inspects the side of the
                accuracy                 0.68          519        State as well. Consequently, the final judgement contains
   LEGAL-        no viol. 0.59   0.52    0.55          210        a more thorough and objective description of the facts
   BERT         violation 0.69   0.75    0.72          309        that takes the sides of both parties into account. This
               macro avg. 0.64   0.63    0.64          519        explains why the facts available in communicated cases
                accuracy                 0.66          519        can differ considerably from the set of facts presented in
                                                                  the final judgement.
                                                                     This bias towards a violation of human rights can also
6. Discussion                                                     be observed in the results. For the forecasting task, all
                                                                  models show a higher performance when predicting the
Our results clearly show that our intuition regarding             ‘violation’ label than when predicting the ‘non-violation’
the increased difficulty of the task of forecasting judge-        label (see Tables 3, 5, and 7). In contrast, the gap in
ments as opposed to classifying judgements is confirmed.          performance when predicting the two labels for the clas-
However, the tasks are conceptually very different, and           sification task is considerably smaller (see Tables 2, 4,
therefore comparing them in terms of accuracy may not             and 6), which confirms the intuition that the description
be entirely fair. Nevertheless, both fall under ‘predicting       of the facts in final judgements are a better representa-
court decisions’ in the existing literature. Our results          tion of the events and therefore better predictors of the
illustrate that predicting court decisions which have not         outcome. Nevertheless, for the 2018 and 2019 data, the
been made yet is a much harder task than current aca-             performance predicting the ‘violation’ label using the
demic research may suggest.                                       communicated cases data (i.e. the forecasting task) is still
   One potential explanation for the higher performance           lower than the overall performance (or the ‘violation’ la-
of the classification approach compared to the forecast-          bel performance) using the extracted facts from the final
ing approach may be the higher amount of data (i.e. an            judgements (i.e. the classification task).
average of 2000 words for the facts part of the judgement            The only case when forecasting judgements shows a
versus 1800 words for the communicated case). Since               higher performance than classifying judgements is on
LEGAL-BERT and H-BERT have a limited input length of              the 2020 test data. However, this is caused by the much
up to 512 or 1024 tokens (respectively), this difference will     lower than usual classification performance. Unfortu-
not play a role for these models. However, this is different      nately, we have no explanation for this pattern, despite
for the SVM which does not have such a limit. Conse-              the effort we spent on trying to investigate whether the
quently, we evaluated an SVM on the ‘shortened’ facts of          2020 data showed deviating patterns compared to the
the final judgements. Specifically, we removed the facts          data from earlier years. For example, the average length
from the middle of the text (under the presumption that           of the 2020 cases, and overall vocabulary is consistent
the most important information is present at the begin-           with the previous years, as well as the distribution of
ning and at the end) until the text was approximately the         cases between different States and therefore different
same length as the text of the the corresponding commu-           Chambers. The court has judged only slightly (4%) fewer
nicated case. This change, however, did not affect the            cases in 2020 than in 2019, and did not adopt any new
performance, as the SVM on this trimmed data yielded              policies compared to the previous years. There is no indi-
macro F-scores of 0.61, 0.83 and 0.77 for 2020, 2019, and         cation that the court used a different selection approach
2018, respectively (compared to 0.62, 0.79 and 0.78). This        for the cases it ruled on. Since the cases originated in the
suggests that the facts are formulated in a way that is           years before 2020, it is also unlikely that this pattern has
affected by the final ruling, rather than that there is a         any relationship with human rights violations related
tangible benefit of the higher amount of data.                    to COVID-19. Finally, the format of case law has also
Table 8
Macro F-scores for Linear SVC, H-BERT, and LEGAL-BERT models for both tasks between 2016 and 2020, including size of
training and testing sets.
                                                   F-score (macro)
                                                        2020   2019      2018    2017   2016
                          Training set size             2264   1806      1386     976    640
                          Test set size                  509    540       519     503    447
                          SVM (forecasting)              0.61   0.65     0.66    0.65   0.64
                          H-BERT (forecasting)           0.60   0.64     0.67    0.66   0.66
                          LEGAL-BERT (forecasting)       0.57   0.60     0.64    0.64   0.58
                          SVM (classification)           0.62   0.79     0.78    0.78   0.75
                          H-BERT (classification)        0.56   0.92     0.83    0.84   0.82
                          LEGAL-BERT (classification)    0.57   0.90     0.86    0.84   0.82


remained the same. For now, we are therefore forced to       itself, it is beneficial to also gain insight into how the
consider performance on 2020 data (as so many other          system reaches a certain outcome, and therefore to take
things in 2020) an anomaly. Whether this deviating pat-      a step toward explainable AI [34, 35] and large-scale au-
tern will continue in 2021 remains to be evaluated.          tomatic legal analysis. This requires, for example, un-
                                                             derstanding which facts lead to which judgement. Par-
6.1. Future work                                             ticularly for the classification task, where determining a
                                                             judgement of the court which is already known is of no
We have discussed a range of approaches to forecast practical use, determining the basis of the classification
outcomes of pending applications. Each of these could be is important.
improved through more careful tuning, pre-processing,           Several methods that are often used in classification
data selection, feature selection, et cetera. Furthermore, tasks allow determining the classification basis (to some
additional classification or forecasting algorithms could extent). Linear SVC, for example, allows the inspection
be used as well. However, this was not the goal of the of its coefficients to evaluate which words and phrases
present paper. By releasing our dataset, together with are more characteristic for assigning one class than an-
a number of baselines reported in this paper, we hope other. Medvedeva et al. [28] also suggest evaluating such
to have provided a new starting point for the task of a system at the sentence level to identify and highlight
forecasting ECtHR judgements.                                the sentences that have the highest probability of belong-
   Regarding future research, it would be interesting to ing to a specific class. Furthermore, the architecture of
assess whether selecting the last tokens, or tokens from H-BERT, for example, allows one to assess which of the
specifically chosen facts would be beneficial for BERT- eight included facts (or questions) had the largest impact
like models. For example, these models might yield better on classification on the basis of so-called attention [33].
results as initial facts generally are about the procedure Unfortunately LEGAL-BERT by itself cannot be used for
and the applicant themselves, while facts from the end of this. While it often produces very high scores, especially
the document often are more closely related to the events for final judgement classification, and it may function as a
relating to the alleged violation of human rights. Due to good reference point for high classification performance,
limited available data, we have only investigated whether one cannot see within the black box.
or not a case violated any article of the ECHR. However,
it would be interesting to assess how the difference in per-
formance between forecasting and classification would 7. Ethical considerations
be affected when individual articles are investigated.
   While we can forecast pending applications using data We believe it is important to emphasise that our goal with
from communicated cases, this does not allow us to fore- this work is only to (try to) forecast and classify court
cast the judgements for any future cases as this data may judgements. Our interest is scientific and is focused on
not always available (e.g., not all cases are communicated assessing whether Natural Language Processing systems
to the State). Forecasting using other data available be- are able to identify certain patterns in legal judgements.
fore the judgement is known (i.e. from other sources) We do not think that any of the models described in this
may likely be even harder, as the uniform documents cre- paper can or should be used for making decisions in courts,
ated by the court for the communicated cases are likely especially those where human rights are at stake (which
beneficial.                                                  concerns the majority of the courts around the world).
   While predicting judgements is an interesting task in Moreover, we are opposed to the use of such models in
                                                             other high-stakes situations, due to the inability of these
types of models to deal with new legal developments                                ceedings of the 57th Annual Meeting of the As-
and interpretations, previously unobserved issues [36,                             sociation for Computational Linguistics, Associa-
37], lacking transparency [38, 39, 40], and cybersecurity                          tion for Computational Linguistics, Florence, Italy,
concerns [41].                                                                     2019, pp. 4317–4323. URL: https://www.aclweb.org/
                                                                                   anthology/P19-1424. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 4 2 4 .
                                                                               [6] A. Kaur, B. Bozic, Convolutional neural network-
8. Conclusion                                                                      based automatic prediction of judgments of the eu-
                                                                                   ropean court of human rights, in: AICS, 2019.
In this paper we have proposed to make a distinction
                                                                               [7] C. Condevaux, Neural legal outcome prediction
between forecasting court judgements and classifying
                                                                                   with partial least squares compression, Stats 3
judgements. Forecasting judgements is based on data
                                                                                   (2020) 396–411.
which is available before the outcome is known (such as
                                                                               [8] O. Salaün, P. Langlais, A. Lou, H. Westermann,
the communicated cases of the ECtHR), whereas classi-
                                                                                   K. Benyekhlef, Analysis and multilabel classifi-
fying judgements is based on (a subset of) data compiled
                                                                                   cation of quebec court decisions in the domain of
when the outcome was known (such as the facts from
                                                                                   housing law, in: International Conference on Ap-
the ECtHR ruling). Making this distinction is impor-
                                                                                   plications of Natural Language to Information Sys-
tant, as earlier research [26], and the experiments con-
                                                                                   tems, Springer, 2020, pp. 135–143.
ducted in this paper show that performance seems to be
                                                                               [9] M. Medvedeva, M. Vols, M. Wieling, Using machine
substantially lower when forecasting future judgements
                                                                                   learning to predict decisions of the european court
compared to classifying decisions which were already
                                                                                   of human rights, Artificial Intelligence and Law 28
made by the court, and the terminology of current papers
                                                                                   (2020) 237–266.
(i.e. ‘predicting court judgements’) suggests a forecasting
                                                                              [10] R. A. Shaikh, T. P. Sahu, V. Anand, Predicting out-
task whereas it actually most often is a task of classify-
                                                                                   comes of legal cases based on legal factors using
ing final judgements. Classification performance should
                                                                                   classifiers, Procedia Computer Science 167 (2020)
therefore not be used as an indication of how well these
                                                                                   2393–2402.
types of systems are able to forecast judgements of the
                                                                              [11] M. Marković, S. Gostojić, Open judicial data: A
court. Interestingly, while more sophisticated models
                                                                                   comparative analysis, Social Science Computer
appeared to be beneficial for the simpler classification
                                                                                   Review (2018).
task, this was not the case for the harder forecasting task.
                                                                              [12] B. Luo, Y. Feng, J. Xu, X. Zhang, D. Zhao, Learn-
                                                                                   ing to predict charges for criminal cases with le-
References                                                                         gal basis, in: Proceedings of the 2017 Conference
                                                                                   on Empirical Methods in Natural Language Pro-
 [1] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro,                             cessing, Association for Computational Linguis-
     V. Lampos, Predicting judicial decisions of the eu-                           tics, Copenhagen, Denmark, 2017, pp. 2727–2736.
     ropean court of human rights: A natural language                              URL: https://www.aclweb.org/anthology/D17-1289.
     processing perspective, PeerJ Computer Science 2                              doi:1 0 . 1 8 6 5 3 / v 1 / D 1 7 - 1 2 8 9 .
     (2016).                                                                  [13] H. Zhong, Z. Guo, C. Tu, C. Xiao, Z. Liu, M. Sun,
 [2] O.-M. Şulea, M. Zampieri, M. Vela, J. van Gen-                                Legal judgment prediction via topological learning,
     abith, Predicting the law area and decisions of                               in: Proceedings of the 2018 Conference on Empir-
     French Supreme Court cases, in: Proceedings of                                ical Methods in Natural Language Processing, As-
     the International Conference Recent Advances in                               sociation for Computational Linguistics, Brussels,
     Natural Language Processing, RANLP 2017, IN-                                  Belgium, 2018, pp. 3540–3549. URL: https://www.
     COMA Ltd., Varna, Bulgaria, 2017, pp. 716–722.                                aclweb.org/anthology/D18-1390. doi:1 0 . 1 8 6 5 3 / v 1 /
     URL: https://doi.org/10.26615/978-954-452-049-6_                              D18- 1390.
     092. doi:1 0 . 2 6 6 1 5 / 9 7 8 - 9 5 4 - 4 5 2 - 0 4 9 - 6 _ 0 9 2 .   [14] X. Jiang, H. Ye, Z. Luo, W. Chao, W. Ma, Inter-
 [3] A. Kaufman, P. Kraft, M. Sen, Machine learning,                               pretable rationale augmented charge prediction sys-
     text data, and supreme court forecasting, Project                             tem, in: Proceedings of the 27th International
     Report, Harvard University (2017).                                            Conference on Computational Linguistics: Sys-
 [4] C. O’Sullivan, J. Beel, Predicting the outcome of                             tem Demonstrations, Association for Computa-
     judicial decisions made by the european court of                              tional Linguistics, Santa Fe, New Mexico, 2018, pp.
     human rights, in: AICS 2019 - 27th AIAI Irish                                 146–151. URL: https://www.aclweb.org/anthology/
     Conference on Artificial Intelligence and Cognitive                           C18-2032.
     Science., 2019.                                                          [15] Y. Li, T. He, G. Yan, S. Zhang, H. Wang, Using
 [5] I. Chalkidis, I. Androutsopoulos, N. Aletras, Neu-                            case facts to predict penalty with deep learning, in:
     ral legal judgment prediction in English, in: Pro-                            International Conference of Pioneering Computer
     Scientists, Engineers and Educators, Springer, 2019,                [24] K. Kowsrihawat, P. Vateekul, P. Boonkwan, Predict-
     pp. 610–617.                                                             ing judicial decisions of criminal cases from thai
[16] H. Ye, X. Jiang, Z. Luo, W. Chao, Interpretable                          supreme court using bi-directional gru with atten-
     charge predictions for criminal cases: Learning to                       tion mechanism, in: 2018 5th Asian Conference on
     generate court views from fact descriptions, in:                         Defense Technology (ACDT), IEEE, 2018, pp. 50–55.
     Proceedings of the 2018 Conference of the North                     [25] M. Medvedeva, M. Vols, M. Wieling, Judicial deci-
     American Chapter of the Association for Compu-                           sions of the european court of human rights: Look-
     tational Linguistics: Human Language Technolo-                           ing into the crystal ball, in: Proceedings of the
     gies, Volume 1 (Long Papers), Association for Com-                       Conference on Empirical Legal Studies, 2018.
     putational Linguistics, New Orleans, Louisiana,                     [26] K. Branting, C. Balhana, C. Pfeifer, J. Aberdeen,
     2018, pp. 1854–1864. URL: https://www.aclweb.org/                        B. BROWN, Judges are from mars, pro se litigants
     anthology/N18-1168. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 1 1 6 8 .        are from venus: Predicting decisions from lay text,
[17] H. Chen, D. Cai, W. Dai, Z. Dai, Y. Ding, Charge-                        in: Legal Knowledge and Information Systems: JU-
     based prison term prediction with deep gating net-                       RIX 2020: The Thirty-third Annual Conference,
     work, in: Proceedings of the 2019 Conference on                          Brno, Czech Republic, December 9-11, 2020, vol-
     Empirical Methods in Natural Language Processing                         ume 334, IOS Press, 2020, p. 215.
     and the 9th International Joint Conference on Nat-                  [27] B. Waltl, G. Bonczek, E. Scepankova, J. Landthaler,
     ural Language Processing (EMNLP-IJCNLP), Asso-                           F. Matthes, Predicting the outcome of appeal de-
     ciation for Computational Linguistics, Hong Kong,                        cisions in germany’s tax law, in: International
     China, 2019, pp. 6361–6366. URL: https://www.                            Conference on Electronic Participation, Springer,
     aclweb.org/anthology/D19-1667. doi:1 0 . 1 8 6 5 3 / v 1 /               2017, pp. 89–99.
     D19- 1667.                                                          [28] M. Medvedeva, X. Xu, M. Wieling, M. Vols, Juri
[18] W. Chao, X. Jiang, Z. Luo, Y. Hu, W. Ma, Inter-                          says: Prediction system for the european court of
     pretable charge prediction for criminal cases with                       human rights, in: Legal Knowledge and Informa-
     dynamic rationale attention, Journal of Artificial                       tion Systems: JURIX 2020: The Thirty-third Annual
     Intelligence Research 66 (2019) 743–764.                                 Conference, Brno, Czech Republic, December 9-11,
[19] D. M. Katz, M. J. Bommarito II, J. Blackman, A                           2020, volume 334, IOS Press, 2020, p. 277.
     general approach for predicting the behavior of the                 [29] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-
     supreme court of the united states, PloS one 12                          tras, I. Androutsopoulos, LEGAL-BERT: The mup-
     (2017).                                                                  pets straight out of law school, in: Findings of the
[20] R. D. Sharma, S. Mittal, S. Tripathi, S. Acharya, Us-                    Association for Computational Linguistics: EMNLP
     ing modern neural networks to predict the decisions                      2020, Association for Computational Linguistics,
     of supreme court of the united states with state-of-                     Online, 2020, pp. 2898–2904. URL: https://www.
     the-art accuracy, in: International Conference on                        aclweb.org/anthology/2020.findings-emnlp.261.
     Neural Information Processing, Springer, 2015, pp.                       doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . f i n d i n g s - e m n l p . 2 6 1 .
     475–483.                                                            [30] S. Greer, J. Gerards, R. Slowe, Human Rights in
[21] O.-M. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P.                     the Council of Europe and the European Union:
     Dinu, J. Van Genabith, Exploring the use of text                         Achievements, Trends and Challenges, Cambridge
     classification in the legal domain, in: In Proceed-                      Studies in European Law and Policy, Cambridge
     ings of the 2nd Workshop on Automated Semantic                           University Press, 2018. doi:1 0 . 1 0 1 7 / 9 7 8 1 1 3 9 1 7 9 0 4 1 .
     Analysis of Information in Legal Texts (ASAIL 2017),                [31] D. J. Harris, M. O’Boyle, E. Bates, C. Buckley, Harris,
     2017.                                                                    O’Boyle & Warbrick: Law of the European conven-
[22] B. Strickson, B. De La Iglesia, Legal judgement                          tion on human rights, Oxford University Press, USA,
     prediction for uk courts, in: Proceedings of the 2020                    2014.
     The 3rd International Conference on Information                     [32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     Science and System, 2020, pp. 204–209.                                   Pre-training of deep bidirectional transformers for
[23] H. Westermann, V. R. Walker, K. D. Ashley,                               language understanding, in: Proceedings of the
     K. Benyekhlef, Using factors to predict and an-                          2019 Conference of the North American Chap-
     alyze landlord-tenant decisions to increase ac-                          ter of the Association for Computational Linguis-
     cess to justice, in: Proceedings of the Seven-                           tics: Human Language Technologies, Volume 1
     teenth International Conference on Artificial In-                        (Long and Short Papers), Association for Com-
     telligence and Law, ICAIL ’19, Association for                           putational Linguistics, Minneapolis, Minnesota,
     Computing Machinery, New York, NY, USA, 2019,                            2019, pp. 4171–4186. URL: https://www.aclweb.org/
     p. 133–142. URL: https://doi.org/10.1145/3322640.                        anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 .
     3326732. doi:1 0 . 1 1 4 5 / 3 3 2 2 6 4 0 . 3 3 2 6 7 3 2 .        [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-
     eit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: I. Guyon,
     U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-
     gus, S. Vishwanathan, R. Garnett (Eds.), Ad-
     vances in Neural Information Processing Sys-
     tems, volume 30, Curran Associates, Inc., 2017.
     URL: https://proceedings.neurips.cc/paper/2017/
     file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[34] T. Bench-Capon, The need for good old fashioned
     ai and law, International Trends in Legal Informat-
     ics: A Festschrift for Erich Schweighofer. Editions
     Weblaw, Bern (2020) 23–36.
[35] J. Collenette, K. Atkinson, T. Bench-Capon, An
     explainable approach to deducing outcomes in eu-
     ropean court of human rights cases using adfs, Fron-
     tiers in Artificial Intelligence and Applications 326
     (2020) 21–32.
[36] R. W. Campbell, Artificial intelligence in the court-
     room: The delivery of justice in the age of machine
     learning, Colo. Tech. LJ 18 (2020) 323.
[37] R. Berk, Berk, Drougas, Machine learning risk as-
     sessments in criminal justice settings, Springer,
     2019.
[38] A. Završnik, Criminal justice, artificial intelligence
     systems, and human rights, in: ERA Forum, vol-
     ume 20, Springer, 2020, pp. 567–583.
[39] F. Thomsen, Iudicium ex machinae – the ethical
     challenges of automated decision-making in crim-
     inal sentencing, in: J. Roberts, J. Ryberg (Eds.),
     Principled Sentencing and Artificial Intelligence,
     Oxford: Oxford University Press, forthcoming.
[40] A. Deeks, N. Lubell, D. Murray, Machine learning,
     artificial intelligence, and the use of force by states,
     J. Nat’l Sec. L. & Pol’y 10 (2019) 1.
[41] P. M. Nichols, Bribing the machine: Protecting
     the integrity of algorithms as the revolution begins,
     American Business Law Journal 56 (2019) 771–814.