Automatic Judgement Forecasting for Pending Applications of the European Court of Human Rights Masha Medvedeva1,2 , Ahmet Üstun1 , Xiao Xu1,3 , Michel Vols2 and Martijn Wieling1 1 Centre for Language and Cognition, University of Groningen, the Netherlands 2 Department of Legal Methods, University of Groningen, the Netherlands 3 Netherlands Interdisciplinary Demographic Institute, The Netherlands Abstract Judicial decision classification using Natural Language Processing and machine learning has received much attention in the last decade. While many studies claim to ‘predict judicial decisions‘, most of them only classify already made judgements. Likely due to the lack of data, there have been only a few studies that discuss the data and the methods to forecast future judgements of the courts on the basis of data available before the court judgement is known. Besides proposing a more consistent and precise terminology, as classification and forecasting each have different uses and goals, we release a first benchmark dataset consisting of documents of the European Court of Human Rights to address this task. The dataset includes raw data as well as pre-processed text of final judgements, admissibility decisions and communicated cases. The latter are published by the Court for pending applications (generally) many years before the case is judged, allowing one to forecast judgements for pending cases. We establish a baseline for this task and illustrate that it is a much harder task than simply classifying judgements. Keywords judicial decisions, machine learning, text classification, datasets, neural networks 1. Introduction promote accessibility to justice [5], while others suggest that the courts, such as the European Court of Human Digital access to case law (i.e. court judgements) pro- Rights (ECtHR), may eventually use it to prioritise vi- vides us with a unique opportunity to process legal data olations cases [1, 4]. Additionally, it has been argued automatically on a large scale using natural language that these type of systems will eventually be able reduce processing techniques. It is, therefore, not surprising human error of the judges [6]. While each of these sug- that using machine learning for judicial outcome classi- gestions can be scrutinised from the legal perspective, it fication has seen a substantial increase in recent years. is still clear there are a large number of potential applica- If we rely on the presumption that legal systems and tions for a successful classification system. legal decision-making are consistent and predictable, we While many of the currently proposed systems show should be able to ultimately create a system that would promising results with a classification performance of be able to automatically predict judicial decisions cor- about 80 percent correct, this is an overly optimistic view rectly. Consequently, such a system could also be used of their performance. One of the reasons for this is that to identify patterns which might be less consistent and classification performance is generally evaluated by pre- perhaps reveal biases in the legal system and judicial dicting the outcome for a random subset of cases which decision-making. were already known but not considered when creating At present, much work has been done on classifying the model. While this may seem fair, an arguably more the outcomes of final judgements [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. interesting task is to predict future judgements.1 Classification of final judgements is in principle a use- Importantly, however, all of the aforementioned stud- ful task, as it may be used to identify important factors ies claim to ‘predict judicial decisions‘, which suggests and arguments of the court, and thereby may provide in- these systems are able to predict (future) rulings on the sight into the process of decision-making. Some previous basis of the available information. Unfortunately, classi- research even suggests that one day such classification fying future judgements causes performance to suffer [9]. systems will be able to provide legal assistance [2] and This lower performance may be caused by, for example, changes in the interpretation of the law, or new social Proceedings of the Fifth Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2021), June 25, 2021, São Paulo, Brazil. 1 It is important to note that predicting court judgements is Envelope-Open m.medvedeva@rug.nl (M. Medvedeva); a.ustun@rug.nl a very different task from actual decision-making. The machine (A. Üstun); xu@nidi.nl (X. Xu); m.vols@rug.nl (M. Vols); learning systems which are the focus of this study make pattern- m.b.wieling@rug.nl (M. Wieling) based guesses on the basis of (sequences of) words in the text of a © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). case. We discuss the ethical consideration for making this distinction CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) later in the paper. phenomena and developments due to changing societies. both sets of algorithms on the same cases, we aim to In addition, almost all classification systems rely on data compare the relative performance of algorithms previ- about the case which is made available when the out- ously used for classifying court judgements, both for the come of the case is known. Having knowledge about the task of classification as well as forecasting (using the the outcome of a case may influence how the facts of the case information published in the communicated cases; see are described (e.g., irrelevant facts for the outcome may Section 3.2). We do not introduce any new algorithms, as be removed, or facts identified after an investigation and the purpose of this study is to determine the difference relevant to the outcome may be highlighted), compared in performance for the two types of tasks. to a situation in which the outcome would not have been In the following section we will discuss earlier work in- known. This would mean that systems which use in- volving the latest attempts at classifying and forecasting formation composed when the outcome was not known court judgements. Section 3 is dedicated to describing may be disadvantaged compared to systems which use the data we have used for our experiments and the larger information extracted from documents composed when dataset we release with this paper. In Section 4, we dis- the outcome was known. One goal of this paper is to cuss various methods that can be used for forecasting evaluate whether this indeed is the case. decisions, their power and limitations. In Section 5, we A further goal of this paper is to propose making a report the results of the experiments that we have con- distinction between forecasting judgements and classify- ducted for this study. In Section 6, we discuss the results ing judgements. For the former, available textual data and make suggestions regarding future work. Finally, describing the (facts of the) case is required which was in Sections 7 and 8, respectively, we make a note about created before the decision was reached, so that the input ethical issues when conducting this type of research, and of the forecasting system is not influenced by the out- draw conclusions. come. For the latter, available textual data about the (facts of the) case was created after the decision was reached. Being explicit about this distinction is important, as many 2. Related work current studies in the field claim to ‘predict judicial de- In this paper, we exclusively focus on the closed-class cisions‘, which suggests that they are forecasting future (often binary) tasks of outcome classification and out- judgements, while instead they are classifying previously come forecasting. These tasks are different from charge made judgements. For example, while Medvedeva et al. prediction, which predicts an open set of outcomes, such [9] predict the performance for future cases (by training as the duration of a prison term [12, 13, 14, 15, 16, 17, 18]. a model on the basis of data from past cases), this is still a While a growing number of courts share their data classification task as the input data is (a subset of) textual online, not many courts publish all of their cases online. data which was created after the decision was reached. Furthermore, for many published cases it may be hard to Forecasting thus requires data related to a judgement determine a binary or at least a small set of pre-set out- that are published before the actual judgement was de- comes, making it hard to use the data from these courts livered. While the courts publish more and more case for the type of machine learning models discussed in this law every day [11], only little access is provided to doc- study. The most recent papers that use machine learning uments that are available before the judgements were approaches for classifying judicial decisions are there- made. Forecasting future judgements is therefore a task fore generally focusing on a limited number of courts, which is impossible for many online available datasets. mainly including the US Supreme Court [3, 19, 20], the For this reason, the large majority of machine learning French Supreme Court [2, 21], and the European Court of systems for legal data were built to provide a classifi- Human Rights [5, 6, 7, 9]. A few other courts around the cation of court judgements, as opposed to forecasting world have also been the focus of this type of analysis, judgements. including courts of the UK [22], Canada [8, 23], India In this study, however, we concentrate on the ECtHR, [10], and Thailand [24]. as it publishes all of its final judgements online together There is a tradition of using statistical techniques to with many supplementary documents, including admissi- analyse the case law of the US Supreme Court (SCOTUS). bility cases, press-releases, summaries of cases, et cetera. The advantage of working with the SCOTUS database Several of these documents were created before the de- is that due to the attention it attracts, all trial data has cision was reached, and therefore this specific dataset systematically and manually been annotated with hun- enables both classification as well as forecasting of the dreds of variables by legal experts, shortly after the case judgements.2 Besides evaluating whether forecasting is has been tried. Katz et al. [19] used variables which are indeed a harder task than classification by evaluating in principle available before SCOTUS reached its deci- 2 To enable reproducibility, we provide our dataset containing sion in an approach called extremely randomised trees pending ECtHR applications, as well admissibility decisions and final to forecast the decision of the court. Their approach re- judgements of the court that can be used for a variety of tasks. sults in predicting 70% of the cases correctly, which is directly comparable to the experiments conducted on a somewhat lower performance than achieved by some court judgements. of the state-of-the-art classification approached applied There are currently only very few studies that focus to data from other courts. However, as Katz et al. [19] on forecasting judgements, and most show a lower per- performed the task of forecasting court decisions, rather formance level than studies on judgement classification. than classifying court decisions, their lower performance Specifically Sharma et al. [20] and Katz et al. [19] forecast may also be indicative of the potentially higher difficulty court decisions of the US Supreme Court. They reported of forecasting. an accuracy of around 70%. For courts in Europe, only Most of the courts in Europe, unfortunately, do not Waltl et al. [27] forecast the outcome of appeal decisions have the advantage of being able to generate such scrupu- involving German tax law (reporting a relatively low lously annotated datasets, and often provide no access performance, with an average F-score of 0.57). Further- to all case law. For the European Court of Human Right more, Medvedeva et al. [28] forecast decisions on the a baseline model for classifying judgements using a so- basis of data from the ECtHR with their online system called Support Vector Machine (SVM) on the basis of JURI (yielding an accuracy of around 70%).3 The latter n-grams (i.e. sequences of one or more words extracted study is the approach we follow and extend in this pa- from the text) has been put forward by Medvedeva et al. per. Specifically, we aim to investigate how the more [9]. Their model classified court decisions with an av- advanced machine learning approaches of Chalkidis et al. erage accuracy of 75% for nine articles. Their work ex- [5] and Chalkidis et al. [29] perform when forecasting tended and corrected some data extraction issues (i.e. ar- the ECtHR judgements. guments of the court referencing the outcome were still included in the input training data, thereby resulting in overly optimistic performance) of an earlier study by 3. Data Aletras et al. [1]. Additional work on the topic has been conducted by Chalkidis et al. [5], where they tested new 3.1. The Court methods and additional tasks, such as predicting the im- The European Court of Human Rights was established in portance of a court case and identifying the articles that 1959 as an international court that deals with individual may (not) have been violated. While Chalkidis et al. [5] and State applications claiming violation of various rights also trained the system on cases up to 2013 and tested on laid out in the European Convention on Human Rights 2014-2018 (following the approach of Medvedeva et al. (ECHR) [30, 31]. Applications are always brought by an [25]), they extracted their data from the judgements, individual/institution or multiple individuals/institutions thereby making their approach a classification task in- against a State or multiple States that have ratified the stead of a forecasting task. Convention. No applications are considered between To our knowledge, only one study has tried to show individuals, or from a State against an individual. Only that using documents from the early stages of the legal five cases of a State against a State have been judged so process may not always be as useful and predictive as final far in the history of the Court. In 2020 the Court pro- judgements. Specifically, Branting et al. [26] conducted cessed 41,700 applications, which were added to already experiments using statements from attorney misconduct pending applications. A total of 37,289 applications were complaints submitted to the Bar Association in the USA. dismissed based on the admissibility criteria, while the The researchers set up a task of predicting whether the rest were decided by a Chamber or a Grand Chamber case would be investigated or closed. Using six different (762 cases based on 1,901 applications). From those cases, machine learning systems the authors showed that the 880 were found to represent a violation of human rights. text of the complaints themselves had very low predic- The majority of the documents produced by the court tive accuracy (maximum weighted f1-score: 0.52), and during the process are published online by the Court.4 also adding additional metadata (i.e. extra information filled in during the complaint, attorney history, sentiment score, etc) was not very beneficial (maximum weighted 3.2. Communicated cases f1-score: 0.55). Only data from later stages in the pro- In order to describe the data that we use for our system cess, specifically allegation codes assigned by the intake it is important to clarify what the application process of staff substantially improved results (maximum weighted the Court entails. f1-score: 0.70). Nevertheless, these scores are still sub- A resident of a country that ratified the ECHR can stantially lower than the scores reported by many studies claim a potential violation within a certain time frame. classifying final decisions (see above). While Branting The application is submitted via mail. On arrival, it is et al. [26] also deal with legal documents, they are not judicial decisions, but rather disciplinary proceedings 3 http://www.jurisays.com conducted by the Bar Association, and therefore are not 4 https://hudoc.echr.coe.int registered by the Court and sent to the legal division that but eventually judged together. Each year thousands of deals with the cases of a particular State, as they are famil- applications are communicated (i.e. 6,442 in 2019 and iar with the legislation of the country. Subsequently, the 7,681 in 2020). Only communicated cases from the year case is allocated to one of the Court’s judicial formations. 2000 and later are available online. The Court decides on Most of the cases are found inadmissible without mer- the order in which the cases are dealt with, based on the iting an investigation, due to not meeting the formal importance and urgency of the issues raised (Rule 41 of admissibility criteria. For example, often the applica- the Court – Order of Dealing with Cases)5 . Therefore, tion is dismissed because the applicant did not file the the cases being judged may be mixed up and do not al- complaint within the required time frame. A decision ways respect the chronological order of when they were regarding these cases is normally rendered by a single submitted. judge. If the application was not dismissed directly, the For the machine learning systems created in our study, decision on admissibility is taken by a Committee of we will only use communicated documents that have three judges (in case the Court has dealt with a number judgements or have been found inadmissible based on of similar cases before) or the Chamber of seven judges. merit for training and testing. In some cases admissibility decisions may even be made by the Grand Chamber (consisting of seventeen judges). 3.3. Data collection Those usually concern the interpretation of the Conven- We collected the data for this study in the following way. tion itself, or if there is a risk of inconsistency compared to the previous judgements of the Court. We scraped the ECtHR’s ‘HUDOC’ website6 and down- When an application is judged to be admissible based loaded all communicated cases. We did the same for on formal parameters, the Chamber will examine its mer- the judgements and admissibility decision documents, its. Before doing so, the Court will communicate the such as the admissibility cases from the Chamber and the application to the government that is the potential vio- Committee. We filtered the cases on the website to only lator of the rights of the applicant (Rule 60 of the Courtdownload English versions of the documents. As the – Claims for just satisfaction). This is not done for all filter did not always work adequately, we also filtered us- applications, but only for a part (approx. 15-20%). Such ing Google’s language detection (l a n g d e t e c t ) library.7 In communicated cases contain the summary of the facts of addition, we extracted all available metadata, such as the the case, as well as questions to the government pertain- application number, state, importance level, et cetera. We ing to the applicant’s complaint. This document allows used the application number of each communicated case the government concerned to submit its observations on to link the associated documents to corresponding admis- the matter of dispute. These documents are often commu- sibility decisions and judgements. We then extracted the nicated years before the case is judged, which provides a conclusion of the court proceedings (‘violation’ or ‘no unique opportunity to use them for predicting the judge- violation’), as well as the facts of the cases from the judge- ments of future cases. Moreover, the questions posted to ment text. We use these facts in a classification model, so the state often reflect on the Court’s legal characterisa-we can compare its performance to the performance of tion of the complaint. See, for instance, a question from a forecasting model using data from the communicated a case of Arki against Hungary (application no. 10755/14, cases. communicated on June 6, 2014): While the facts in communicated cases are the sum- mary of the events as described by the applicant, the 1. Have the applicants been subjected to facts that end up in the final judgement are compiled inhuman or degrading treatment on ac- after the investigation and therefore also include the side count of their cramped prison conditions, of State. We only use the facts of the case from the fi- in breach of Article 3 of the Convention? nal judgements since these are most comparable to the communicated cases. Specifically, these have also been As a consequence, these documents can potentially be argued to potentially be available before the outcome was used to identify the facts or even (parts of) arguments reached [9] and do not contain references to the outcome related to certain judgements before those judgements [9, 5]. This also mirrors the set up in Chalkidis et al. [5] are made. that we follow. Cases concerning repetitive issues do not merit a com- To enable a fair comparison, the cases (but not the ex- municated case, and not every communicated case cor- tracted information about these cases) used for training responds directly to a specific judgement. Multiple ap- and testing are identical for both models. We assume plications concerning the same events can be merged that cases that were found to be inadmissible based on into a single case during the communication stage, but 5 may be separated during final decision-making. Similarly, 6 https://www.echr.coe.int/Documents/Rules_Court_ENG.pdf https://hudoc.echr.coe.int/ multiple applications can be communicated separately, 7 https://pypi.org/project/langdetect/ merit are similar to cases that were judged as having no average number of words associated with the extracted violation. From a legal point of view, these cases can be facts from each judgement are not much higher (i.e. 2000 characterised as simply more clear ‘non-violation’ cases. words) than the number of words of the associated com- The court has made judgements on similar applications municated case (i.e. 1800 words). many times before, and hence these do not merit a full judgement. For cases that went though to the final judge- Table 1 ment stage, we assign the ‘violation’ label for all cases Distribution of training and testing data for different setups. that were judged to show a violation of at least one article setup 1: setup 2: setup 3: of the ECHR. 2020 2019 2018 As we mentioned before, individual communicated train (balanced) 2264 1806 1386 cases do not always directly correspond to unique cases test (no violation) 167 229 210 which received a judgement or admissibility decision, as test (violation) 342 311 309 communicated cases can be split or merged during the process. For the split cases, the assigned label of the asso- ciated communicated case depended on whether any of the split cases resulted in a violation of at least one article 3.4. Dataset (‘violation’ label), or not (‘non-violation’ label, i.e. none In addition to the data used in this study, we have ex- of the split cases exhibits a violation of any article). To tracted data for a large set of additional cases, which ensure the set of cases considered for the classification were not taken into account in our analysis. This dataset task and the forecasting task is identical, we randomly is released together with this paper.8 Specifically, this selected a single judged case (from the associated split dataset contains all of the communicated cases, admissi- cases) where the assigned label matched that of the as- bility cases and final judgements of the Court published signed label to the communicated case. For judgements between 1960 and 2020. We provide raw text, the meta- associated with multiple merged communicated cases, data (e.g., date, court-assigned importance, parties, and we randomly chose one of the communicated cases and section) as well as the preprocessed text of communi- removed the rest. Finally, duplicate cases and judged cated cases (split into facts and questions), admissibility cases which did not have (correctly formatted) facts were decisions (extracted facts) and final judgements (split into excluded from the dataset used for both tasks. In this sections: Procedure, Facts, Relevant domestic law, Law way, the set of cases considered for the classification task - including arguments of the court, Outcome, and Dis- and the forecasting task is identical. senting opinions) in order to facilitate further research in Subsequently, we split the data into training and test ECtHR judgement forecasting and classification. In addi- sets (on average a 77%-23% split). We trained each sys- tion, the case numbers are linked throughout each stage tem three times, with different setups (with a decreasing of the court proceedings (where applicable). This dataset amount of training data) to assess the robustness of the may be suitably used for a number of classification tasks results. Setup 1 concerns model training with cases that in legal analysis, including judgement classification based received judgement in years 2000-2019, whereas model on facts (using Facts and possibly Procedure sections) testing was conducted with cases that received judge- and/or arguments (using the Law sections). ment in the year 2020. Setup 2 uses 2000-2018 data for training and 2019 data for testing. Setup 3 uses 2000- 2017 data for training and 2018 data for testing. Each 4. Methodology setup is used once for forecasting judgements using data from the communicated cases, and once for classifying As we mentioned before, the approach most relevant judgements using data from the facts extracted from the work for our study is that of Chalkidis et al. [5]. Specifi- final judgement. As in each setup the number of viola- cally, in one of their tasks they focused on classifying the tion cases exceeded the number of non-violation cases, court judgements depending on whether at least one arti- we balanced the training set in each setup by removing cle of the ECHR was violated or not.9 In addition, they ex- older violation cases until the same number of documents perimented with using anonymized vs. non-anonymized was present for each label. Table 1 shows the number 8 https://drive.google.com/drive/folders/1ltIpHlcqcRlT_ of documents available for training and testing for each JDebHsyLgvgoa4Vbxo8?usp=sharing setup. 9 The purpose of their Chalkidis et al. [5] second task was to iden- The data used for the two different tasks differs some- tify all of the violated articles for a single court document (i.e. multi- what. For the communicated cases, we used all data label classification). However, as the involved articles are known as soon as the application is submitted, it is not clear what the practical available (i.e. the facts and the questions as they were use is of predicting the list of articles potentially violated. A realistic presented in the text), whereas for the judgements, we scenario for the ECtHR would only involve deciding whether or not only used data from the facts section. In general, the a given article was violated. data. While we perform the same task as Chalkidis et al. texts from different sources. [5], enabling us to benefit from more data than when BERT and many of its variations, including H-BERT, we would predict (non-)violation per article separately, have shown to result in substantial improvements com- we only use non-anonymized data. For the anonymized pared to the state-of-the-art in a large variety of text setup, Chalkidis et al. [5] have removed named entities classification tasks. Specifically, Chalkidis et al. [5] have (such as names or locations) from the text to make sure shown that using H-BERT resulted in a very high perfor- the model was not biased towards demographic informa- mance (macro F-score of 0.82) for the binary task (viola- tion. While removing this potential bias is understand- tion of at least 1 article of ECHR vs. no violation), and an able when building a decision-making system, forecast- even higher macro F-score of 0.83 for LEGAL-BERT on ing or classifying judgements is different. Specifically, the same dataset [29]. given that locations may offer relevant information about In the following, we report the results per class for each the case (i.e. some countries are notorious violators of model. Our main evaluation metric is the macro F-score. specific rights), models used for forecasting or classifica- This measure can be described as a mean of the average tion benefit from keeping this information (also known precision and recall across all classes (i.e. ‘violation’ and to judges) in. ‘no violation’). Precision is the percentage of cases given In our study, we implement three systems used by a certain label (i.e., ‘violation’ or ‘no violation’) that was Chalkidis et al. [5] and compare their performance on correct. Recall is the percentage of cases having a certain the classification and forecasting task. Specifically, we (correct) label, that were identified as such.11 include the SVM model, the Hierarchical-BERT (H-BERT) model and the LEGAL-BERT model (see below for more details). All models were re-created on the basis of the 5. Results description provided by Chalkidis et al. [5] and Chalkidis We started our experiments with setup 1, by testing on et al. [29]. As not all settings and (hyper)parameters were all data from 2020. To our surprise, results for classi- specified in their paper, our reproduction of their models fying the final judgements were very low compared to may be slightly different. However, we believe these Chalkidis et al. [5] (see tables 2 and 3 for the performance differences to be minor. Our goal is to see how some of per class). In contrast to our expectations, forecasting the state-of-the-art models which have been shown to final judgements on the basis of communicated cases in- perform very well when applied to final judgements of stead of on the basis of the facts of the final judgements the ECtHR perform when they are only being provided yielded better results when using H-BERT. Compared with data from the applicants to the ECtHR (i.e. victims to Chalkidis et al. [5], however, since not all cases are of a alleged human right violation). communicated by the court, our training set was much Our SVM classifier is a Linear SVC model including smaller (2264 cases vs. 7100 cases, respectively). 1-5 n-grams. For a detailed explanation about text classi- fication using machine learning (including Linear SVC), see Medvedeva et al. [9]. Table 2 BERT or Bidirectional Encoder Representations from Performance (precision, recall, f1-score and accuracy) for Lin- Transformers [32] is a popular pre-trained transformer- ear SVC, H-BERT, and LEGAL-BERT models per class for final judgement classification, trained on cases between 2000 and based [33] machine-learning technique resulting in a 2019 and tested on cases decided in 2020 so-called language model. The method also allows fine- 2020 - Final judgements tuning the language model for a specific task, i.e. adapt- P R F1 # ing the pre-trained model to the target task, in our case SVM no viol. 0.46 0.93 0.62 167 classifying and forecasting ECtHR judgements. violation 0.93 0.46 0.62 342 To use BERT on long case documents without hav- macro avg. 0.70 0.70 0.62 509 ing a maximum text length restriction, H-BERT [5] pro- accuracy 0.62 509 cesses each fact separately and combines them by using H-BERT no viol. 0.42 0.92 0.58 167 a self-attention layer to generate an embedding for a violation 0.91 0.38 0.53 342 case. This resulting embedding is then used for classifi- macro avg. 0.66 0.65 0.56 509 cation and forecasting.10 Instead of the standard BERT accuracy 0.56 509 model (which [5] reported to have sub-par performance), LEGAL- no viol. 0.42 0.90 0.58 167 we used LEGAL-BERT [29] in our experiments. LEGAL- BERT violation 0.89 0.40 0.55 342 BERT is a BERT model which was pre-trained on legal macro avg. 0.66 0.65 0.57 509 accuracy 0.57 509 10 While BERT can process each case by including up to 512 11 tokens (i.e. meaningful word parts), our H-BERT implementation Exact definition of the F-score can be found here: can use up to 1024 tokens (i.e. 128 tokens for each of the the first https://scikit-learn.org/stable/modules/generated/sklearn.metrics. eight facts). f1_score.html Table 3 Table 5 Performance (precision, recall, f1-score and accuracy) for Lin- Performance (precision, recall, f1-score and accuracy) for Lin- ear SVC, H-BERT, and LEGAL-BERT models per class for fore- ear SVC, H-BERT, and LEGAL-BERT models per class for fore- casting judgements, trained on communicated cases between casting judgements, trained on communicated cases between 2000 and 2019 and tested on communicated cases that received 2000 and 2018 and tested on communicated cases that received a judgement in 2020. a judgement in 2019. 2020 - Communicated cases 2019 - Communicated cases P R F1 # P R F1 # SVM no viol. 0.47 0.51 0.49 167 SVM no viol. 0.62 0.53 0.57 229 violation 0.75 0.72 0.73 342 violation 0.69 0.77 0.73 311 macro avg. 0.61 0.61 0.61 509 macro avg. 0.66 0.65 0.65 540 accuracy 0.65 509 accuracy 0.67 540 H-BERT no viol. 0.45 0.61 0.52 167 H-BERT no viol. 0.57 0.67 0.61 229 violation 0.77 0.63 0.69 342 violation 0.72 0.63 0.67 311 macro avg. 0.61 0.62 0.60 509 macro avg. 0.64 0.65 0.64 540 accuracy 0.62 509 accuracy 0.65 540 LEGAL- no viol. 0.42 0.54 0.47 167 LEGAL- no viol. 0.55 0.50 0.52 229 BERT violation 0.74 0.63 0.68 342 BERT violation 0.66 0.70 0.68 311 macro avg. 0.58 0.58 0.57 509 macro avg. 0.60 0.60 0.60 540 accuracy 0.60 509 accuracy 0.61 540 However, when trying setup 2, where we trained using Table 6 less data (i.e. until 2018) and tested on all data of 2019, Performance (precision, recall, f1-score and accuracy) for Lin- results were as expected. Specifically, macro F-scores ear SVC, H-BERT, and LEGAL-BERT models per class for final ranged between 0.79 and 0.92 for the classification task judgement classification, trained on cases between 2000 and (see Table 4), and performance was much lower for the 2017 and tested on cases decided in 2018. forecasting task with macro F-scores ranging from 0.60 2018 - Final judgements to 0.65 (see Table 5). P R F1 # SVM no viol. 0.67 0.91 0.77 210 Table 4 violation 0.92 0.70 0.79 309 Performance (precision, recall, f1-score and accuracy) for Lin- macro avg. 0.79 0.80 0.78 519 ear SVC, H-BERT, and LEGAL-BERT models per class for final accuracy 0.78 519 judgement classification, trained on cases between 2000 and H-BERT no viol. 0.86 0.72 0.78 210 2018 and tested on cases decided in 2019. violation 0.83 0.92 0.87 309 macro avg. 0.84 0.82 0.83 519 2019 - Final judgements accuracy 0.84 519 P R F1 # LEGAL- no viol. 0.88 0.78 0.83 210 SVM no viol. 0.69 0.95 0.80 229 BERT violation 0.86 0.93 0.89 309 violation 0.95 0.68 0.79 311 macro avg. 0.87 0.85 0.86 519 macro avg. 0.82 0.81 0.79 540 accuracy 0.87 519 accuracy 0.79 540 H-BERT no viol. 0.90 0.92 0.91 229 violation 0.94 0.93 0.93 311 macro avg. 0.92 0.92 0.92 540 When running the same experiments using succes- accuracy 0.92 540 sively smaller datasets (i.e. testing on data from 2017, and LEGAL- no viol. 0.87 0.90 0.88 229 2016), the same pattern is visible as for setups 2 and 3. BERT violation 0.92 0.90 0.91 311 That is, performance when classifying final judgements macro avg. 0.90 0.91 0.90 540 is much higher than when forecasting final judgements. accuracy 0.90 540 Table 8 shows the macro F-scores for both tasks for all years (of the test set) ranging from 2016 to 2020 and all To determine which of the two setups resulted in repre- three algorithms. Besides showing that classification per- sentative results, we conducted a final experiment (setup formance is generally (except for 2020) higher than fore- 3), by training with even less data (i.e. until 2017) and casting performance, these results also show that while testing on all data of 2018. The results showed a simi- H-BERT and LEGEL-BERT generally outperforms SVM lar pattern (with slightly better performance, despite the in classification (except for 2020), they do not improve reduced amount of training data) as the results of setup over SVM in forecasting. 2 (2019). See Tables 6 and 7 for an overview of these results. Table 7 The SVM model allows us to inspect the top coeffi- Performance (precision, recall, f1-score and accuracy) for Lin- cients (weights) of n-grams assigned by the system. We ear SVC, H-BERT, and LEGAL-BERT models per class for fore- observe that for final judgements the system often pri- casting judgements, trained on communicated cases between oritises longer n-grams (the average length for the 100 2000 and 2017 and tested on communicated cases that received top features is 2.475), while for communicated cases it a judgement in 2018. prioritises unigrams and common collocations consisting 2018 - Communicated cases of two words such as public prosecutor or minor offences P R F1 # (the average length for the 100 top features is 1.405). SVM no viol. 0.62 0.55 0.58 210 We should also take into account that the communi- violation 0.72 0.77 0.74 309 cated case is a summary of an applicant’s complaint. As macro avg. 0.67 0.66 0.66 519 a result, it only reflects this party’s side of the events, accuracy 0.68 519 and may be subjective and incomplete. After sending H-BERT no viol. 0.60 0.63 0.61 210 the communicated case to the State involved, the Court violation 0.73 0.71 0.72 309 macro avg. 0.67 0.67 0.67 519 conducts the investigation and inspects the side of the accuracy 0.68 519 State as well. Consequently, the final judgement contains LEGAL- no viol. 0.59 0.52 0.55 210 a more thorough and objective description of the facts BERT violation 0.69 0.75 0.72 309 that takes the sides of both parties into account. This macro avg. 0.64 0.63 0.64 519 explains why the facts available in communicated cases accuracy 0.66 519 can differ considerably from the set of facts presented in the final judgement. This bias towards a violation of human rights can also 6. Discussion be observed in the results. For the forecasting task, all models show a higher performance when predicting the Our results clearly show that our intuition regarding ‘violation’ label than when predicting the ‘non-violation’ the increased difficulty of the task of forecasting judge- label (see Tables 3, 5, and 7). In contrast, the gap in ments as opposed to classifying judgements is confirmed. performance when predicting the two labels for the clas- However, the tasks are conceptually very different, and sification task is considerably smaller (see Tables 2, 4, therefore comparing them in terms of accuracy may not and 6), which confirms the intuition that the description be entirely fair. Nevertheless, both fall under ‘predicting of the facts in final judgements are a better representa- court decisions’ in the existing literature. Our results tion of the events and therefore better predictors of the illustrate that predicting court decisions which have not outcome. Nevertheless, for the 2018 and 2019 data, the been made yet is a much harder task than current aca- performance predicting the ‘violation’ label using the demic research may suggest. communicated cases data (i.e. the forecasting task) is still One potential explanation for the higher performance lower than the overall performance (or the ‘violation’ la- of the classification approach compared to the forecast- bel performance) using the extracted facts from the final ing approach may be the higher amount of data (i.e. an judgements (i.e. the classification task). average of 2000 words for the facts part of the judgement The only case when forecasting judgements shows a versus 1800 words for the communicated case). Since higher performance than classifying judgements is on LEGAL-BERT and H-BERT have a limited input length of the 2020 test data. However, this is caused by the much up to 512 or 1024 tokens (respectively), this difference will lower than usual classification performance. Unfortu- not play a role for these models. However, this is different nately, we have no explanation for this pattern, despite for the SVM which does not have such a limit. Conse- the effort we spent on trying to investigate whether the quently, we evaluated an SVM on the ‘shortened’ facts of 2020 data showed deviating patterns compared to the the final judgements. Specifically, we removed the facts data from earlier years. For example, the average length from the middle of the text (under the presumption that of the 2020 cases, and overall vocabulary is consistent the most important information is present at the begin- with the previous years, as well as the distribution of ning and at the end) until the text was approximately the cases between different States and therefore different same length as the text of the the corresponding commu- Chambers. The court has judged only slightly (4%) fewer nicated case. This change, however, did not affect the cases in 2020 than in 2019, and did not adopt any new performance, as the SVM on this trimmed data yielded policies compared to the previous years. There is no indi- macro F-scores of 0.61, 0.83 and 0.77 for 2020, 2019, and cation that the court used a different selection approach 2018, respectively (compared to 0.62, 0.79 and 0.78). This for the cases it ruled on. Since the cases originated in the suggests that the facts are formulated in a way that is years before 2020, it is also unlikely that this pattern has affected by the final ruling, rather than that there is a any relationship with human rights violations related tangible benefit of the higher amount of data. to COVID-19. Finally, the format of case law has also Table 8 Macro F-scores for Linear SVC, H-BERT, and LEGAL-BERT models for both tasks between 2016 and 2020, including size of training and testing sets. F-score (macro) 2020 2019 2018 2017 2016 Training set size 2264 1806 1386 976 640 Test set size 509 540 519 503 447 SVM (forecasting) 0.61 0.65 0.66 0.65 0.64 H-BERT (forecasting) 0.60 0.64 0.67 0.66 0.66 LEGAL-BERT (forecasting) 0.57 0.60 0.64 0.64 0.58 SVM (classification) 0.62 0.79 0.78 0.78 0.75 H-BERT (classification) 0.56 0.92 0.83 0.84 0.82 LEGAL-BERT (classification) 0.57 0.90 0.86 0.84 0.82 remained the same. For now, we are therefore forced to itself, it is beneficial to also gain insight into how the consider performance on 2020 data (as so many other system reaches a certain outcome, and therefore to take things in 2020) an anomaly. Whether this deviating pat- a step toward explainable AI [34, 35] and large-scale au- tern will continue in 2021 remains to be evaluated. tomatic legal analysis. This requires, for example, un- derstanding which facts lead to which judgement. Par- 6.1. Future work ticularly for the classification task, where determining a judgement of the court which is already known is of no We have discussed a range of approaches to forecast practical use, determining the basis of the classification outcomes of pending applications. Each of these could be is important. improved through more careful tuning, pre-processing, Several methods that are often used in classification data selection, feature selection, et cetera. Furthermore, tasks allow determining the classification basis (to some additional classification or forecasting algorithms could extent). Linear SVC, for example, allows the inspection be used as well. However, this was not the goal of the of its coefficients to evaluate which words and phrases present paper. By releasing our dataset, together with are more characteristic for assigning one class than an- a number of baselines reported in this paper, we hope other. Medvedeva et al. [28] also suggest evaluating such to have provided a new starting point for the task of a system at the sentence level to identify and highlight forecasting ECtHR judgements. the sentences that have the highest probability of belong- Regarding future research, it would be interesting to ing to a specific class. Furthermore, the architecture of assess whether selecting the last tokens, or tokens from H-BERT, for example, allows one to assess which of the specifically chosen facts would be beneficial for BERT- eight included facts (or questions) had the largest impact like models. For example, these models might yield better on classification on the basis of so-called attention [33]. results as initial facts generally are about the procedure Unfortunately LEGAL-BERT by itself cannot be used for and the applicant themselves, while facts from the end of this. While it often produces very high scores, especially the document often are more closely related to the events for final judgement classification, and it may function as a relating to the alleged violation of human rights. Due to good reference point for high classification performance, limited available data, we have only investigated whether one cannot see within the black box. or not a case violated any article of the ECHR. However, it would be interesting to assess how the difference in per- formance between forecasting and classification would 7. Ethical considerations be affected when individual articles are investigated. While we can forecast pending applications using data We believe it is important to emphasise that our goal with from communicated cases, this does not allow us to fore- this work is only to (try to) forecast and classify court cast the judgements for any future cases as this data may judgements. Our interest is scientific and is focused on not always available (e.g., not all cases are communicated assessing whether Natural Language Processing systems to the State). Forecasting using other data available be- are able to identify certain patterns in legal judgements. fore the judgement is known (i.e. from other sources) We do not think that any of the models described in this may likely be even harder, as the uniform documents cre- paper can or should be used for making decisions in courts, ated by the court for the communicated cases are likely especially those where human rights are at stake (which beneficial. concerns the majority of the courts around the world). While predicting judgements is an interesting task in Moreover, we are opposed to the use of such models in other high-stakes situations, due to the inability of these types of models to deal with new legal developments ceedings of the 57th Annual Meeting of the As- and interpretations, previously unobserved issues [36, sociation for Computational Linguistics, Associa- 37], lacking transparency [38, 39, 40], and cybersecurity tion for Computational Linguistics, Florence, Italy, concerns [41]. 2019, pp. 4317–4323. URL: https://www.aclweb.org/ anthology/P19-1424. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 4 2 4 . [6] A. Kaur, B. Bozic, Convolutional neural network- 8. Conclusion based automatic prediction of judgments of the eu- ropean court of human rights, in: AICS, 2019. In this paper we have proposed to make a distinction [7] C. Condevaux, Neural legal outcome prediction between forecasting court judgements and classifying with partial least squares compression, Stats 3 judgements. Forecasting judgements is based on data (2020) 396–411. which is available before the outcome is known (such as [8] O. Salaün, P. Langlais, A. Lou, H. Westermann, the communicated cases of the ECtHR), whereas classi- K. Benyekhlef, Analysis and multilabel classifi- fying judgements is based on (a subset of) data compiled cation of quebec court decisions in the domain of when the outcome was known (such as the facts from housing law, in: International Conference on Ap- the ECtHR ruling). Making this distinction is impor- plications of Natural Language to Information Sys- tant, as earlier research [26], and the experiments con- tems, Springer, 2020, pp. 135–143. ducted in this paper show that performance seems to be [9] M. Medvedeva, M. Vols, M. Wieling, Using machine substantially lower when forecasting future judgements learning to predict decisions of the european court compared to classifying decisions which were already of human rights, Artificial Intelligence and Law 28 made by the court, and the terminology of current papers (2020) 237–266. (i.e. ‘predicting court judgements’) suggests a forecasting [10] R. A. Shaikh, T. P. Sahu, V. Anand, Predicting out- task whereas it actually most often is a task of classify- comes of legal cases based on legal factors using ing final judgements. Classification performance should classifiers, Procedia Computer Science 167 (2020) therefore not be used as an indication of how well these 2393–2402. types of systems are able to forecast judgements of the [11] M. Marković, S. Gostojić, Open judicial data: A court. Interestingly, while more sophisticated models comparative analysis, Social Science Computer appeared to be beneficial for the simpler classification Review (2018). task, this was not the case for the harder forecasting task. [12] B. Luo, Y. Feng, J. Xu, X. Zhang, D. Zhao, Learn- ing to predict charges for criminal cases with le- References gal basis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Pro- [1] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, cessing, Association for Computational Linguis- V. Lampos, Predicting judicial decisions of the eu- tics, Copenhagen, Denmark, 2017, pp. 2727–2736. ropean court of human rights: A natural language URL: https://www.aclweb.org/anthology/D17-1289. processing perspective, PeerJ Computer Science 2 doi:1 0 . 1 8 6 5 3 / v 1 / D 1 7 - 1 2 8 9 . (2016). [13] H. Zhong, Z. Guo, C. Tu, C. Xiao, Z. Liu, M. Sun, [2] O.-M. Şulea, M. Zampieri, M. Vela, J. van Gen- Legal judgment prediction via topological learning, abith, Predicting the law area and decisions of in: Proceedings of the 2018 Conference on Empir- French Supreme Court cases, in: Proceedings of ical Methods in Natural Language Processing, As- the International Conference Recent Advances in sociation for Computational Linguistics, Brussels, Natural Language Processing, RANLP 2017, IN- Belgium, 2018, pp. 3540–3549. URL: https://www. COMA Ltd., Varna, Bulgaria, 2017, pp. 716–722. aclweb.org/anthology/D18-1390. doi:1 0 . 1 8 6 5 3 / v 1 / URL: https://doi.org/10.26615/978-954-452-049-6_ D18- 1390. 092. doi:1 0 . 2 6 6 1 5 / 9 7 8 - 9 5 4 - 4 5 2 - 0 4 9 - 6 _ 0 9 2 . [14] X. Jiang, H. Ye, Z. Luo, W. Chao, W. Ma, Inter- [3] A. Kaufman, P. Kraft, M. Sen, Machine learning, pretable rationale augmented charge prediction sys- text data, and supreme court forecasting, Project tem, in: Proceedings of the 27th International Report, Harvard University (2017). Conference on Computational Linguistics: Sys- [4] C. O’Sullivan, J. Beel, Predicting the outcome of tem Demonstrations, Association for Computa- judicial decisions made by the european court of tional Linguistics, Santa Fe, New Mexico, 2018, pp. human rights, in: AICS 2019 - 27th AIAI Irish 146–151. URL: https://www.aclweb.org/anthology/ Conference on Artificial Intelligence and Cognitive C18-2032. Science., 2019. [15] Y. Li, T. He, G. Yan, S. Zhang, H. Wang, Using [5] I. Chalkidis, I. Androutsopoulos, N. Aletras, Neu- case facts to predict penalty with deep learning, in: ral legal judgment prediction in English, in: Pro- International Conference of Pioneering Computer Scientists, Engineers and Educators, Springer, 2019, [24] K. Kowsrihawat, P. Vateekul, P. Boonkwan, Predict- pp. 610–617. ing judicial decisions of criminal cases from thai [16] H. Ye, X. Jiang, Z. Luo, W. Chao, Interpretable supreme court using bi-directional gru with atten- charge predictions for criminal cases: Learning to tion mechanism, in: 2018 5th Asian Conference on generate court views from fact descriptions, in: Defense Technology (ACDT), IEEE, 2018, pp. 50–55. Proceedings of the 2018 Conference of the North [25] M. Medvedeva, M. Vols, M. Wieling, Judicial deci- American Chapter of the Association for Compu- sions of the european court of human rights: Look- tational Linguistics: Human Language Technolo- ing into the crystal ball, in: Proceedings of the gies, Volume 1 (Long Papers), Association for Com- Conference on Empirical Legal Studies, 2018. putational Linguistics, New Orleans, Louisiana, [26] K. Branting, C. Balhana, C. Pfeifer, J. Aberdeen, 2018, pp. 1854–1864. URL: https://www.aclweb.org/ B. BROWN, Judges are from mars, pro se litigants anthology/N18-1168. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 8 - 1 1 6 8 . are from venus: Predicting decisions from lay text, [17] H. Chen, D. Cai, W. Dai, Z. Dai, Y. Ding, Charge- in: Legal Knowledge and Information Systems: JU- based prison term prediction with deep gating net- RIX 2020: The Thirty-third Annual Conference, work, in: Proceedings of the 2019 Conference on Brno, Czech Republic, December 9-11, 2020, vol- Empirical Methods in Natural Language Processing ume 334, IOS Press, 2020, p. 215. and the 9th International Joint Conference on Nat- [27] B. Waltl, G. Bonczek, E. Scepankova, J. Landthaler, ural Language Processing (EMNLP-IJCNLP), Asso- F. Matthes, Predicting the outcome of appeal de- ciation for Computational Linguistics, Hong Kong, cisions in germany’s tax law, in: International China, 2019, pp. 6361–6366. URL: https://www. Conference on Electronic Participation, Springer, aclweb.org/anthology/D19-1667. doi:1 0 . 1 8 6 5 3 / v 1 / 2017, pp. 89–99. D19- 1667. [28] M. Medvedeva, X. Xu, M. Wieling, M. Vols, Juri [18] W. Chao, X. Jiang, Z. Luo, Y. Hu, W. Ma, Inter- says: Prediction system for the european court of pretable charge prediction for criminal cases with human rights, in: Legal Knowledge and Informa- dynamic rationale attention, Journal of Artificial tion Systems: JURIX 2020: The Thirty-third Annual Intelligence Research 66 (2019) 743–764. Conference, Brno, Czech Republic, December 9-11, [19] D. M. Katz, M. J. Bommarito II, J. Blackman, A 2020, volume 334, IOS Press, 2020, p. 277. general approach for predicting the behavior of the [29] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- supreme court of the united states, PloS one 12 tras, I. Androutsopoulos, LEGAL-BERT: The mup- (2017). pets straight out of law school, in: Findings of the [20] R. D. Sharma, S. Mittal, S. Tripathi, S. Acharya, Us- Association for Computational Linguistics: EMNLP ing modern neural networks to predict the decisions 2020, Association for Computational Linguistics, of supreme court of the united states with state-of- Online, 2020, pp. 2898–2904. URL: https://www. the-art accuracy, in: International Conference on aclweb.org/anthology/2020.findings-emnlp.261. Neural Information Processing, Springer, 2015, pp. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . f i n d i n g s - e m n l p . 2 6 1 . 475–483. [30] S. Greer, J. Gerards, R. Slowe, Human Rights in [21] O.-M. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. the Council of Europe and the European Union: Dinu, J. Van Genabith, Exploring the use of text Achievements, Trends and Challenges, Cambridge classification in the legal domain, in: In Proceed- Studies in European Law and Policy, Cambridge ings of the 2nd Workshop on Automated Semantic University Press, 2018. doi:1 0 . 1 0 1 7 / 9 7 8 1 1 3 9 1 7 9 0 4 1 . Analysis of Information in Legal Texts (ASAIL 2017), [31] D. J. Harris, M. O’Boyle, E. Bates, C. Buckley, Harris, 2017. O’Boyle & Warbrick: Law of the European conven- [22] B. Strickson, B. De La Iglesia, Legal judgement tion on human rights, Oxford University Press, USA, prediction for uk courts, in: Proceedings of the 2020 2014. The 3rd International Conference on Information [32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Science and System, 2020, pp. 204–209. Pre-training of deep bidirectional transformers for [23] H. Westermann, V. R. Walker, K. D. Ashley, language understanding, in: Proceedings of the K. Benyekhlef, Using factors to predict and an- 2019 Conference of the North American Chap- alyze landlord-tenant decisions to increase ac- ter of the Association for Computational Linguis- cess to justice, in: Proceedings of the Seven- tics: Human Language Technologies, Volume 1 teenth International Conference on Artificial In- (Long and Short Papers), Association for Com- telligence and Law, ICAIL ’19, Association for putational Linguistics, Minneapolis, Minnesota, Computing Machinery, New York, NY, USA, 2019, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ p. 133–142. URL: https://doi.org/10.1145/3322640. anthology/N19-1423. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . 3326732. doi:1 0 . 1 1 4 5 / 3 3 2 2 6 4 0 . 3 3 2 6 7 3 2 . [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- eit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polo- sukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, R. Garnett (Eds.), Ad- vances in Neural Information Processing Sys- tems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [34] T. Bench-Capon, The need for good old fashioned ai and law, International Trends in Legal Informat- ics: A Festschrift for Erich Schweighofer. Editions Weblaw, Bern (2020) 23–36. [35] J. Collenette, K. Atkinson, T. Bench-Capon, An explainable approach to deducing outcomes in eu- ropean court of human rights cases using adfs, Fron- tiers in Artificial Intelligence and Applications 326 (2020) 21–32. [36] R. W. Campbell, Artificial intelligence in the court- room: The delivery of justice in the age of machine learning, Colo. Tech. LJ 18 (2020) 323. [37] R. Berk, Berk, Drougas, Machine learning risk as- sessments in criminal justice settings, Springer, 2019. [38] A. Završnik, Criminal justice, artificial intelligence systems, and human rights, in: ERA Forum, vol- ume 20, Springer, 2020, pp. 567–583. [39] F. Thomsen, Iudicium ex machinae – the ethical challenges of automated decision-making in crim- inal sentencing, in: J. Roberts, J. Ryberg (Eds.), Principled Sentencing and Artificial Intelligence, Oxford: Oxford University Press, forthcoming. [40] A. Deeks, N. Lubell, D. Murray, Machine learning, artificial intelligence, and the use of force by states, J. Nat’l Sec. L. & Pol’y 10 (2019) 1. [41] P. M. Nichols, Bribing the machine: Protecting the integrity of algorithms as the revolution begins, American Business Law Journal 56 (2019) 771–814.