=Paper=
{{Paper
|id=Vol-2962/paper13
|storemode=property
|title=Analysis and Prediction of Legal Judgements in the Slovak Criminal Proceedings
|pdfUrl=https://ceur-ws.org/Vol-2962/paper13.pdf
|volume=Vol-2962
|authors=Dávid Varga,Zoltán Szoplák,Pavol Sokol,Stanislav Krajči,Peter Gurský
|dblpUrl=https://dblp.org/rec/conf/itat/VargaSSKG21
}}
==Analysis and Prediction of Legal Judgements in the Slovak Criminal Proceedings ==
Analysis and Prediction of Legal Judgements in the Slovak Criminal Proceedings Dávid Varga, Zoltán Szoplák, Stanislav Krajči, Pavol Sokol, and Peter Gurský Institute of Computer Science Faculty of Science, P.J.Šafárik University in Košice Jesenná 5, 040 01 Košice, Slovakia www.ics.science.upjs.sk david.varga@student.upjs.sk,zoltan.szoplak@student.upjs.sk,stanislav.krajci@upjs.sk, pavol.sokol@upjs.sk,peter.gursky@upjs.sk Abstract: This paper uses machine learning to analyze The model predicts the verdict from the input justification, criminal judgements in the Slovak republic to determine comparing it with the true verdict received at the input. their adequacy and set a baseline for predicting their out- Subsequently, two situations can occur. If the predicted comes. First, we summarize past and recent advance- verdict is identical to the true verdict, we will take this ments in predicting verdicts and other attributes of legal court decision as sufficiently reasoned. If the predicted text written in different languages. We then demonstrate verdict differs from the true verdict, we will take such a data preparation of all publicly available Slovak judge- decision as insufficiently reasoned. The model justifies its ments, extraction of their verdicts and separation into main prediction by extracting the parts of the court’s reasoning parts using a Slovak words inflexion dictionary called that most influenced the prediction of the verdict. This Tvaroslovník. Later we use this data to classify the judge- paper is based on the research stated in Sokol et al. [29], in ments into acquittal or conviction using several known which authors formulated their conclusions on the current machine learning methods ranging from simple statistical state and developing trends in the use of digital evidence methods such as SVM and random forests to deep learn- in judicial proceedings and usage of the in dubio pro reo ing networks based on convolution to recurrence and their principle in criminal proceedings. combinations. We evaluate their efficiency, analyze and To achieve a better understanding of how judgments are identify significant highly correlated terms with each re- reasoned, this paper aims to: sult class, and offer a hypothesis as to why these terms are correlated with these results. We have found that a • create a classification model which can predict the sequential input of word2vec embeddings combined with verdict of the judgement from its reasoning part; convolution-based deep learning methods produces the • identify significant terms in the judgments’ reason- best results, achieving over 99% accuracy. ings closely related to the results of judgements in Keywords: judgement, reasoning, text analysis, Slovak, the criminal proceedings (innocence or guilty). classification, verdict, machine learning This paper is organized into six sections. Section 2 fo- cuses on the review of past and recent advancements in 1 Introduction the classification of legal documents. Section 3 is devoted Since 2016, the Ministry of Justice of the Slovak Repub- to data preprocessing and judgement extraction. Section 4 lic has published more than 3 million publicly available describes the different methods of text representation and court decisions online. These court decisions contain some the learning algorithms that will use them. The results structured data, e.g. name of the judge or court, but mostly produced by these algorithms and their subsequent anal- free text. This free text contains the most relevant parts of ysis are presented in section 5, followed by the last section court decisions: the final verdict and the reasoning behind containing conclusions and future works. the verdict. We aim to find a method to identify court de- cisions that are not sufficiently reasoned and provide such 2 Related works decisions to lawyers for a more detailed analysis. In this paper, we examine several statistical and machine 2.1 A statistical approach learning methods of text representation and classification, intending to correctly predict court decisions based on the Predicting the results of court decisions from a statistical reasoning alone. point of view was addressed by Kort [17] in 1957. He After our model is trained, the reasoning and the verdict aimed to predict the cases concerning the right to counsel of the court decision will become inputs for this model. from The Supreme Court of the United States. He con- Copyright ©2021 for this paper by its authors. Use permitted under structed a table with various facts of the cases paired with Creative Commons License Attribution 4.0 International (CC BY 4.0). certain values. A composite value was calculated for each case by adding up all the facts’ values. If the compos- Another group of researchers, Chalkidis et al. [5], also ite value of a particular case exceeded a certain threshold, predicted outcomes of decisions from ECHR using Bi- then the defendant was wrongly denied the assignment of GRU with attention, hierarchical attention network and a lawyer. That way, he was able to predict successfully 12 Label-Wise attention network. Attention scores provided of the 14 cases. indications of which part of the case affected the predic- Later, Nagel (1960 [25] and 1963 [26]) applied the cor- tion the most. relation analysis for court decisions. He predicted out- Sulea et al. (2017) [31] decided to predict the verdicts comes by calculating correlation coefficients for key vari- of court decisions of the French Court of Cassation. They ables, i.e. those which, according to the court, had the used a linear SVM classifier to train a bag of words in- greatest influence on the determination of the judgment. stead of n-grams. They attempted to predict verdicts, the Mackaay and Robillard [22] applied the nearest neigh- area of law and the length of court proceedings. Later that bour rule method to predict judicial decisions, which was year, they [30] managed to increase the f1 score for each later verified by Keown [14] and compared with a linear prediction using a system based on classifier ensembles. model. Using the dataset from China Judgements Online (CJO), Luo et al. [21] attempted to predict the most frequent crim- 2.2 An approach based on artificial intelligence inal charges and applied articles. The dataset already con- tained fact descriptions from which the authors extracted In 2018, The European Commission for the Efficiency of the applied articles by multiple SVM classifiers. Justice (CEPEJ) wrote the first European Ethical Charter The previous dataset was also used by Hu et al. [10], but on the use of Artificial Intelligence in judicial systems and their task was a prediction of few-shot charges and a pre- their environment [7]. The charter summarized the basic diction of ten chosen attributes. They outperformed SVM, principles that must be respected by artificial intelligence CNN, LSTM and the model created by Luo et al. [21] on (AI). According to CEPEJ, AI can contribute to the effi- few-shot charges by 50 ciency of processing a large number of documents or re- In 2018, Xiao et al. [32] built a large dataset called solving disputes. Still, it must be implemented responsi- CAIL 2018. It contains more than two million Chinese ju- bly, taking into account all human rights and personal data dicial decisions. The authors attempted to predict charges, protection. applied articles and length of imprisonment only using Currently, the most commonly used methods for pre- baseline models such as SVM with TF-IDF, fastText [11] dicting court decisions belong to machine learning, which and CNN. To make predictions easier, they used only de- is a part of artificial intelligence. Ruger et al. [28] and Katz cisions with one defendant and decisions with frequent et al. (2014 [12] and 2017 [13]) worked on a dataset from charges. the United States Supreme Court called the Supreme Court Using judgements from China Judgements Online, Database (SCDB). They used methods such as the classi- CAIL 2018 and Peking University Law Online, Zhong et fication trees, the extremely randomized tree, LibLinear al. [34] created a multi-task framework called TopJudge. SVM and random forest. It uses a directed acyclic graph for subtask dependencies Ashley et al. [3] were working on computer programs and RNN for each subtask. Their subtasks were to predict called SMILE and IBP that united case-based reasoning applied articles, charges, fines and terms of penalty. and information extraction from legal texts. By extracting Long et al. [20] developed their own legal reading com- information from previously decided cases, they attempted prehension model named AutoJudge, which aims to model to predict the verdicts of new cases. complex interactions among case materials and predicts Aletras et al. [1] used machine learning to predict the the final verdict based on fact description, plaintiffs’ pleas rulings of the European Court of Human Rights (ECHR). and law articles. Their data contained 584 judgments in English. They ex- In 2020, Luz de Araujo et al. [2] created a new tracted main features from each decision using n-grams Brazilian dataset called VICTOR, which contained about and trained the Support Vector Machines (SVM) classifier 692,000 annotated legal documents. Legal experts anno- on these extracted n-grams. However, they did not remove tated themes and a type of document (e. g. judgement and the part of the decision in which the texts of the applicable lower court decisions) about 6,800 documents which be- laws were listed from the judgments. From such lists of came a training dataset for further extraction. They used laws, it was easier to predict the results of decisions. The Naïve Bayes, SVM, BiLSTM and CNN for each type of success rate of classification was 79%. classification, but the prediction of verdicts was not one of In Medvedeva et al. [23], the authors decided to ad- their tasks. dress this limitation while also dealing with the judicial decisions of the European Court of Human Rights. They removed the list of applicable laws from court decisions 3 Dataset and used a larger number of decisions. The success of the classification deteriorated to 77%, using the same machine The dataset presented in this work contained more than 3 learning methods as Aletras et al. [1]. million court decisions issued between 2016 and the end of Table 1: Table of related works using the machine learning approach. Literature Datasets Methods Data representation Ruger et al. (2004) SCDB classification trees extracted variables Katz et al. (2014) SCDB extremely randomized trees extracted variables Katz et al. (2017) SCDB random forest, LibLinear SVM, multilayer perceptron extracted variables Ashley et al. (2009) custom database SMILE + IBP factor representation Aletras et al. (2016) ECHR SVM BoW, n-grams Medvedeva et al. (2018) ECHR SVM n-grams, TF-IDF Chalkidis et al. (2019) ECHR BiGRU, HAN, LWAN, BERT, HIER-BERT word embeddings Sulea et al. (2017) [31] The French Supreme Court LibLinear SVM BoW, n-grams Sulea et al. (2017) [30] The French Supreme Court ensemble of multiple SVMs BoW, n-grams Luo et al. (2017) CJO custom method using SVM, softmax sequence embeddings Hu et al. (2018) CJO attentive attribute predictor, softmax fact embeddings Xiao et al. (2018) CAIL 2018 SVM, fastText, CNN skip-gram, TF-IDF Zhong et al. (2018) CJO, CAIL 2018, PKU Law Online TopJudge fact embeddings Long et al. (2019) CJO AutoJudge sentence embeddings Araujo et al. (2020) VICTOR Naïve Bayes, SVM, BiLSTM, CNN, XGBoost BoW, TF-IDF 2020. The court decisions covered all areas of legislation, the judgment, the name of the court, the names of such as civil, family, commercial and criminal law. judges and defendants; These court decisions are formatted as JSON objects, which contain attributes such as the type of the court, the • statement - the section mentioning the verdict and the name of the court, the name of the judge and the area circumstances of the indictment; of legislation. Each object has the document_fulltext at- • reasoning - the part in which the judgment is rea- tribute, which contains the anonymized court decision in soned; its original version. During the preprocessing phase, we have been only working with this attribute and the area of • judicial notice - instruction of the defendant, admis- legislation attribute. sibility of the appeal and others. There were several types of verdicts in these court de- cisions, such as the obligation to pay a sum of money, the The division of the judgments into mentioned parts was acquittal of the defendant, the defendant’s conviction, the not problematic because their original texts were struc- rejection of the plaintiff’s pursuit and many others. To tured well. simplify our work, we have decided to deal with criminal Subsequently, we have replaced the original docu- law containing a verdict of conviction and acquittal. ment_fulltext attribute of each JSON object with the newly There were 226,500 court decisions concerning crimi- created document_divided attribute, whose value was a nal law. We obtained these court decisions by searching JSON object with the attributes details, intro, statement, for the value "Trestné právo" (criminal law) in the reasoning, and judicial notice. mentioned area of legislation attribute. From these deci- sions, it was necessary to extract the reasoning and the verdict, i.e. acquittal or conviction, which were used to 3.2 Extraction of the verdict train our models. After a more thorough filtration of these From observations, we have noticed that certain words are court decisions, explained in subsection 3 of this section, often spelt in a way that there is a space between each let- we ended up with 43,254 decisions with a conviction ver- ter. The main verdicts were often written in this "spaced" dict and 3,139 decisions with an acquittal verdict. style, e.g. the phrase "j e v i n n ý" (is guilty). We decided to extract all words longer than two from the 3.1 Dividing court decisions into main parts decision parts written in this style, and we wanted to find out how the conviction and the acquittal were formulated. The part of the justification that is important for training We created a finite state automaton to extract such the model was not present in the attributes of the original words in one text pass. These words are then stored in JSON files. Therefore, we have decided to split each judg- the field of the newly created wide_words attribute. ment in its original form present in the document_fulltext The conviction always contained in its field wide_words attribute. We divided every judgment into these parts: the word starting with "vinn-", i.e. the beginning part • details - contains semi-structured information about of the word "vinný" (guilty). the court, the judge and the court decision. This in- The acquittal always contained in the wide_words field formation is the same as the values in mentioned at- a word starting with "oslobod-", i.e. the beginning part tributes of JSON object; of the word "oslobodzuje" (freed of charges). Based on the occurrence and co-occurrence of these two • introduction - contains an introductory sentence in terms, we divided the court decisions into four groups. Figure 1: Segment of decision containing the verdict The first group, named none, contained all court deci- not obliged to fill out the reasoning section. This group sions in which neither the word beginning with "vinn-" contained 130,289 decisions. nor the word beginning with "oslobod-" was men- We also excluded those decisions that mentioned para- tioned. Such court decisions, for example, were requests graph 172 article 2 of the Code of Criminal Procedure in for parole. their reasoning section. This article states that if both the The second group, named both, contained court deci- prosecutor and the accused have waived their right to ap- sions which included in the court decision both a word be- peal or have made such a statement within three working ginning with "vinn-" and a word beginning with "oslobod- days of the judgment, a simplified written judgment may ". Such court decisions often concerned several persons, be issued, not stating the reasons. This meant that even several of whom were acquitted and others convicted. though the reasoning was present in the judgment, the rea- The third group, named guilty, contained court deci- soning itself stated that there is no justification stated in the sions that contained words beginning with "vinn-" and judgment. We searched for the mentioning of this article did not contain a word beginning with "oslobod-". The using a regular expression and removed a further 15,953 fourth group named innocent contained words beginning judgements. with "oslobod-" and did not contain a word beginning with The last two groups removed from the training set were "vinn-". These two groups clearly define the verdict, and groups based on the type of verdict, specifically the none we used these two groups to train the model. group, which contained 33,483 judgements and the both Due to the inconsistency of court decisions, it happened group which contained 382 judgements. that a verdict was not written in "spaced" style but was written normally. For example, the verdict "j e v 3.4 Further preprocessing i n n ý" was written as "je vinný". We have also extracted these forms of verdicts by searching for words For each judgment, we have split the reasoning text into beginning with "vinn-" and "oslobod-". words and lemmatized them using a Slovak word form dictionary called Tvaroslovník described in [18]. We have also removed any non-alphabetic words and words shorter 3.3 Filtration of court decisions based on reasonings than three characters. We have used this text as the input and verdicts and the verdict as the label. The data was split into a train- ing and testing set, using two-thirds as training data. Due The first group of the court decisions that we excluded for to the imbalance of target labels, we have downsampled the training set contained those that did not have the rea- the number of guilty verdicts in the training data to match soning part. That is because, in certain cases, judges are the number of innocent examples. 4 Algorithms 4.2 Learning algorithms There are several well-known if slightly outdated classi- 4.1 Text representation fiers that have been used in NLP tasks that will serve as our baseline. Logistic regression, as described in [16] is a method of This section describes various representations of text and classification that uses linear regression equations to pro- algorithms for predicting the outcome of court decisions. duce discrete binary outputs. Most machine learning algorithms are incompatible with A Support vector machine, described in [33] is an al- strings of characters as input data; thus, it is necessary to gorithm tasked with finding an optimal hyperplane that di- create numeric representations that preserve the syntactic vides two or more classes with the greatest possible mar- and semantic relations between words. gin. A simple yet effective method of encoding is the Tf-Idf A random forest, described in [8] is a model that in metric described in [27]. Tf-Idf (term frequency-inverse itself is an ensemble of several decision trees. document frequency) is the combination of term frequency These models can be used with representations that en- - the number of times a given term occurs within a doc- code the reasoning as a singular input, meaning that the ument - and inverse document frequency - a metric that Tf-Idf, the concatenated Word2Vec and the Doc2Vec en- describes how unique or specific a given the word is to codings can all be used. a document. Our vocabulary of terms contained not only In addition to these methods, we have decided to ex- individual words but also all bigrams and trigrams. This plore algorithms that use the sequence of words that make resulted in a large number of features even after excluding up the reasoning encoded by the Word2Vec method in- terms that occur less than five times total in the corpus. stead of taking in a singular input. Therefore, we performed a χ 2 test to find the top 6500 Convolutional Neural Networks or CNNs, described terms that are most correlated with our target classes and in [15] are based on the idea of using alternating layers of used them as features calculating their Tf-Idf values for convolution - a sliding window function applied to a ma- each document. trix - and pooling layers to subsample the input. While While effective, this kind of encoding does not tell us more well-known for their applications in computer vi- much about any spatiotemporal relations of the words sion, they can be applied to NLP tasks quite successfully themselves. Thus, we have opted to use vector embed- due to their nature of capturing spatial dependencies and ding methods, namely Word2Vec and Doc2Vec which ex- their ability to compose higher-level features from low- cel at encoding context for given words and documents. level features. We have used a single convolutional layer Word2Vec, described in [24] is a method for creating em- with 128 features and a kernel size of 5 with a maxpooling beddings from each word by concatenating two prediction layer fed into a dense layer with ten neurons. networks: CBOW, which tries to predict a word given the Recurrent Neural Networks or RNNs, on the other words surrounding it and Skip-Gram, trying to predict the hand, have an internal state that can represent context surrounding words from the input word. We have trained information from an unspecified amount of past inputs. a Word2Vec encoder with an embedding size of 300 on Long Short Memory Networks or LSTMs, described in our dataset and used it in two distinct ways. We merely [9] are able to deal with vanishing and exploding gradients encoded each word of the padded judicial decisions for better than traditional RNNs since they possess two gated algorithms designed to work with sequential inputs. For units that open and close based on the relevance of the algorithms that require encoding of the entire document, data, allowing it to better retain information over longer we calculated the element-wise mean, min and max values sequences. One shortcoming of conventional RNNs is that of all the word vectors of the decision. We concatenated they are only able to make use of the previous context. them into an embedding with the size of 900. This simplis- Bidirectional RNNs are designed to process the data in tic method of pooling allows us to create a representation both directions with two separate hidden layers, one pro- of a collection of words while still retaining semantic and cessing the information going from the beginning forward syntactic information. in time and one from the end backwards. This approach al- While the method above is somewhat effective, there lows us to have complete sequential information for each is a more relevant method of creating embeddings from a input about all points before and after. We use a single sequence of words based on a similar principle, namely bidirectional block of LSTMs, each with 100 cells. the Doc2Vec algorithm described in [19], a modification Some methods combine Recurrent Neural Networks of the Word2Vec model to encode documents instead of with Convolutional Neural Networks in order to preserve words. Using this method, we have created an embedding both the spatial information retaining capabilities of con- of each judicial decision with a vector size of 500. volutional networks and the temporal dependency captur- These representations can be used in conjunction with ing capabilities of recurrent networks. several machine learning algorithms to predict the verdict The first is to create an ensemble model combining a of judicial decisions. convolutional network and a Bidirectional Gated Recur- rent Unit described in [6]. The same input is presented to on the sequential order of ideas have lower performance a CNN model with 100 features and a kernel size of 3 fol- than CNNs, which have a property of location invariance lowed by a maxpool layer as well as a BiGRU model with thus are better suited to detect the presence of individual a layer size of 64. The output of the two separately trained terms that are by large independent and highly correlated networks are concatenated into a single result. with the result class. The performance of such algorithms Another, more indirect way of combining the attributes is quite high, achieving an accuracy of over 99%. We be- and strengths of RNNs and CNNs are Temporal Con- lieve that this may be due to the relatively simple task of volutional Networks or TCN networks, described in [4]. binary classification, combined with semi-structured data. TCN use dilated causal convolution, meaning that outputs We expect this to change as we try to predict more com- at time t is convolved only with elements from time t and plex information from the dataset. earlier in the previous layer. This feature allows for par- We can further observe from Table 2 that the precision allel computation of convolutions rather than the sequen- for the prediction of conviction decisions is better than tial computation of RNNs and requires less memory than the recall metric for every single representation and model RNNs. As for the implementation, we will make use of 2 combination. Since precision is a metric that determines TCN blocks stacked with the kernel size of 3 and dilation the percentage of predicted convictions to be actual con- factors of 1, 2, and 4, the first containing 128 filters the victions while recall tells us the percentage of actual con- second 64 filters. The sequential output of the 2nd block victions found by our algorithm, it stands to reason that a is passed to 2 separate layers of pooling - max and average more significant number of convictions was classified as - the result of which is concatenated into a dense layer of acquittal than the other way around. 16 neurons then passed to the output. Such bias may be the result of several possible causes. In section 5, we describe the results of using these al- One of them is simply the consideration that there are gorithms on the dataset described in chapter 3. Section suspicious cases within the dataset where the verdict 5.1 contains the evaluation of performance and subsequent should’ve been a conviction but ended up being acquittal. comparison of these algorithms, whereas section 5.3 anal- However, a more likely hypothesis is that many individ- yses what features and terms were used to make the pre- ual terms are highly correlated with the target classes and dictions. that many of them are, in actuality, more correlated with the conviction class of samples. So the decision process itself might try to detect values that are correlated more 5 Results and discussion with conviction decisions, and upon their absence, it tends 5.1 Performance evaluation to classify acquittal. Unsure of the reason, we investigated what features contributed most to the prediction. Since We used the data described in section 3 and split it into embedding vectors are difficult to interpret, we used the three parts, using two for training and one for testing. We feature selection method for the Tf-Idf representation us- have implemented the methods described above and, af- ing a bag of words and the χ 2 test. We calculated what ter training, evaluated their performance using standard percentage of documents from the training and testing cor- statistical metrics. These metrics consider the conviction pus is the most relevant terms present for each target class. samples as the Positives and the acquittal samples as We organized these results into tables to determine which Negatives. We have then organized these results into terms are used and how to make such decisions. Table 2. As we can see, regarding algorithms that use a singu- lar representation(rows 1-9), the embedding models of- 5.2 Definition of term categories fer generally poorer performance, with the concatenated pooled Word2Vec being the least efficient since the al- The terms (unigrams, bigrams and trigrams) can be di- gorithm is used in a way it is not designed to be used. vided into three categories according to their meaning and Doc2Vec has better performance, especially when used usage in a judicial decision: in conjunction with Logistic Regression, where the rela- • terms related to legal principles; tively small number of features (500 as opposed to 900 and 6500) is less of a hindrance. However, the best re- • terms used in legal arguments; sults were achieved by using the Tf-Idf representation. We assume the reason for this is that the reasoning text has a • other general legal terms, including terms describing somewhat formalized structure that uses certain standard- the legal language. ized keywords and phrases from which basic information is more readily deductible than from a sequence of justifi- The first group of terms is represented by terms related cations presented within the reasoning. to the application of legal principles, resp. the exercise of This is somewhat further evidenced by the results ob- rights under these principles. Judges often rely on legal tained from methods reliant on the encoded sequence of principles to justify judicial decisions. An example is the words (rows 11-14). RNNs that are more heavily reliant principle of fair trial and the right to a fair trial. Table 2: Table of classification results on the testing data. The rows represent the 14 different representation and algorithm combinations while the columns are the metrics we used to evaluate the performance of the given classifier. Representation + Classifier Accuracy Precision Recall F1 score ROC_AUC word2vec + logistic regression 87.09 90.43 69.46 78.57 82.83 word2vec + svm 89.71 92.34 76.10 83.44 86.42 word2vec + random forest 94.87 99.52 85.35 91.89 92.57 doc2vec + logistic regression 97.83 98.10 95.34 96.70 97.21 doc2vec + svm 97.46 97.07 95.28 96.16 96.92 doc2vec + random forest 94.97 99.23 85.58 91.90 92.63 tf-idf +logitstic regression 95.64 98.69 88.18 93.14 93.80 tf-idf + svm 98.21 98.25 96.39 97.31 97.76 tf-idf + random forest 99.05 99.78 97.38 98.57 98.64 word2vec + CNN 99.24 99.78 97.89 98.83 98.89 word2vec + BiLSTM 98.72 99.20 96.83 98.00 98.23 word2vec + TCN 99.08 99.57 97.65 98.60 98.72 word2vec + Ensemble(CNN + BiGRU) 98.40 99.60 95.68 97.60 97.74 The second group consists of terms that are used in le- defendant’s innocence with the chi-square value and the gal arguments. There are terms expressing usage and in- count of occurrences in judgements that point to the defen- terrelationships of the evidence submitted in the criminal dant’s innocence or guilt. In contrast, Table 4 shows inter- proceedings. Examples are general terms related to indi- esting unigrams, bigrams and trigrams, which are closely cation, such as to prove, proof. Another example is the use connected with judgements, the result of which is recog- of evidence such as expert evidence, real evidence, docu- nition of the defendant guilty. As we can see from these mentary evidence. tables, specific terms correlate significantly more with the The last group are general legal terms that do not fall particular result of the judgement. Judges use in judge- into the groups mentioned above. These terms are part of ments’ reasoning terms such as reason, unequivocally, fe- the legal language and relate to legal institutes with a spe- male witness, situation etc. (Table 3) in the cases that re- cific criminal offence (e.g. legal qualification, theft, breach sult in acquittal of the defendant. On the other hand, ex- of personal data protection), compensation or punishment. pressions such as free choice, advise option, choice, vol- It also includes terms related to the procedure regulation of untarily commit which, willingly etc. (Table 4) are im- the court and law enforcement authorities (e.g. to accuse, portant in the judgements condemning the defendant. The hear, propose). exciting finding is that groups of specific terms are closely Certain legal principles are important for these proceed- connected with a specific type of verdict. The sets of terms ings, among which we can include the presumption of in- prepared in this way can then be analyzed in terms of their nocence of the defendant and the in dubio pro reo princi- mutual correlation or use as attributes for the classification ple. This principle stipulates the obligation of the court to of the judgements. decide in favour of the defendant if there are doubts about Within the used corpus of the judgements, we have fo- his guilt that cannot be removed. It is this principle that cused on terms that are closely related to the evidence (ev- creates a specific imbalance in thinking about guilt or in- idence, prove, testimony, paper, etc.). The results show nocence. The presumed result of judgement is innocence, that these terms are strongly connected with judgements and it is necessary to prove the defendant’s guilt. It is a about the innocence of the defendant. Table 5 shows these specific feature of the judgements in criminal proceedings, terms with the chi-square value and the count of occur- which is also reflected in the reasoning of the judgments. rences in judgements that point to the defendant’s inno- The judge needs to justify the guilt of the defendant and cence or guilt. not his innocence. These results suggest that for a judge to admit someone innocent, a much more detailed evidence-based argumen- 5.3 Analysis of relevant terms tation must be used in the reasoning. At this point, it is necessary to return to the principle in dubio pro reo, which The second goal of this paper was to identify essential implies that the presumed result of judgment is innocence, words or phrases associated with the decision on the merits and it is required to prove the defendant’s guilt. It fol- in criminal proceedings. In other words, the aim was to de- lows that the evidence and their representation in decision termine the strength of the correlation between unigrams, reasoning should be more closely linked to decisions with bigrams and trigrams and the result in guilt or innocence. guilt verdict since guilt must be proved. However, here, In Table 3, we can see unigrams, bigrams and trigrams we come to a disagreement between these claims and a that have a significant relationship with judgments on the dispute between the law in the book ("rules of the game" Table 3: Table of terms relevant to judgments of innocence. The first column is a term in the Slovak language, the second column represents the translation of the term to English, in the third column Chi-square value is listed, and the last columns are the percentage of judgment on innocence, resp. guilt. Term (Slovak) Term (English) Chi-square Percentage_innocence Percentage_guilt obžalovat’ obžaloba to charge the indictment 5295.07 39.6% 2.2% svedkyňa witness (female) 3026.24 37.6% 9.0% pojednávanie trial 2950.20 59.1% 22.4% dôvod reason 2756.17 48.3% 16.6% jednoznačne unequivocally 2653.22 33.0% 7.8% príst’ come 2563.86 35.9% 9.8% situácia situation 2328.96 25.9% 5.2% pamätat’ to remember 2053.86 25.1% 5.6% obdobie period 1928.83 24.4% 5.8% polícia police 1920.33 29.8% 9.2% Table 4: Table of terms relevant to judgments of guilt. The first column is a term in the Slovak language, the second column represents the translation of the term to English, in the third column Chi-square value is listed, and the last columns are the percentages of judgment on innocence, resp. guilt. Term (Slovak) Term (English) Chi-square Percentage_innocence Percentage_guilt slobodný vol’ba free choice 9921.81 0.3% 34.4% možnost’ slobodný vol’ba free choice option 9911.82 0.3% 33.8% skrátit’ vzdávat’ to shorten give up 9530.09 0.0% 32% súhlasit’ návrh to agree to a proposal 9452.09 0.0% 31.8% radit’ spôsob advise option 9402.72 0.3% 32.4% dobrovol’ne spáchat’ voluntarily commit 9261.44 0.3% 32.2% vol’ba choice 9241.86 0.9% 34.5% dobrovol’ne spáchat’ ktorý voluntarily commit which 9157.98 0.2% 31.6% dobrovol’ne willingly 5206.03 7.2% 37.1 % for all cases) and law in action (judgment in the individual 6 Conclusion and future works case). Based on the findings we have found, it appears that the judges do not presume the innocent of the defendant. In this paper, we have shown how to split a judicial de- cision into its relevant parts and extract the verdict of the judgments. In addition, we have shown how to create a representation of the reasoning text using various text rep- This specificity contained in the argumentation can then resentation methods and combined them with several clas- be seen in the algorithms that learn to recognize significant sification algorithms. We evaluated the performance of strings for two groups of decisions (guilty, innocent). This these models and found that methods that are more reliant is evident from the precision and recall ratio as well as Ta- on detecting specific terms than a stream of thoughts pro- ble 4 and Table 3, where the higher χ 2 values and thus duce the most satisfactory results. Multiple models pre- the features better suited for classification are correlated dict most cases with sufficient accuracy so that the outly- with judgements where the verdict was guilty. We have ing cases can be manually examined by a team of experts. also calculated which of the top 300 terms occurs more Furthermore, it can be demonstrated that all representa- in which class and have found that 223 of them had more tions and models are prone to classify conviction as ac- occurrences in the guilty class, and only 77 had more in quittal more often than the other way around, which may the innocent class. This supports the conclusion that we be because our models tend to look for features present in have arrived at after making observations from Table 2. At convictions and interpret their absence as an acquittal. the same time, however, the conclusions of the paper [29], As part of the analysis of significant terms, we have according to which more used evidence correlates with de- identified the groups of specific terms closely connected cisions on the innocence of the defendant, are confirmed. with a specific type of verdict (acquittal or conviction). In the paper [29], authors focused only on the corpus of the Also, we have focused on the terms used in legal argu- judgements concerning digital evidence and IP addresses. ments (judgements’ reasoning) in more detail. According In this paper, we use the extended corpus of the judgments, to results, the in dubio pro reo principle in criminal pro- which covers various areas of criminal law. ceedings affect judgement’s reasonings and the subsequent Table 5: Table of terms used in evidence-based argumentation. The first column is a term in the Slovak language, the second column represents the translation of the term to English, in the third column Chi-square value is listed, and the last columns are the percentages of judgment on innocence, resp. guilt. Term (Slovak) Term (English) Chi-square value Percentage_acquittal Percentage_conviction výpoved’ testimony 3652.45 52.1% 14.4% preukázat’ to prove 3071.27 52% 17.1% dokázat’ to prove 3064.15 26% 2.5% dôkaz ktorý evidence which 2929.94 28% 4% výsluch hearing 2839.05 42.2% 12.4% listinný documentary 2638.38 42.3% 13.3% dôkaz evidence 2625.49 55.3% 21.7% listinný dôkaz documentary evidence 2588.78 41.3% 13% znalecký expert 1998.48 28.6% 8% dokazovanie proving 1917.96 27.5% 7.9% analysis of this legal text. [7] European Commission for the Efficiency of Justice As an extension of this research, we plan to examine (CEPEJ): European ethical Charter on the use of Ar- the cases where the labels and predictions differ and con- tificial Intelligence in judicial systems and their envi- sult a lawyers team. Their task would be to determine for ronment (2018), https://rm.coe.int/ethical-charter-en-for- individual cases whether the failure is caused by the pre- publication-4-december-2018/16808f699c dictor, in which case we will research ways to improve our [8] Ho, T.K.: Random decision forests. In: Proceedings of 3rd methods further. We will also replace all article references international conference on document analysis and recog- nition. vol. 1, pp. 278–282. IEEE (1995) with the actual text of the articles to increase our predic- tive capability. We plan to make further predictions where [9] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997) in addition to determining the presence of guilt, we will also attempt to predict the severity of the sentence (e.g. [10] Hu, Z., Li, X., Tu, C., Liu, Z., Sun, M.: Few-shot charge prediction with discriminative legal attributes. In: Proceed- jail time or fine amount). In case there are multiple de- ings of the 27th International Conference on Computational fendants, we will try to determine the sentence for each of Linguistics. pp. 487–498 (2018) them. [11] Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: Compressing text classification References models (2016) [12] Katz, D.M., au2, M.J.B.I., Blackman, J.: Predicting the be- [1] Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D., Lampos, havior of the supreme court of the united states: A general V.: Predicting judicial decisions of the european court of approach (2014) human rights: A natural language processing perspective. [13] Katz, D.M., Bommarito, M.J., Blackman, J.: A general ap- PeerJ Computer Science 2, e93 (2016) proach for predicting the behavior of the supreme court of [2] Luz de Araujo, P.H., de Campos, T.E., Ataides Braz, F., the united states. PloS one 12(4), e0174698 (2017) Correia da Silva, N.: VICTOR: a dataset for Brazilian legal [14] Keown, R.: Mathematical models for legal prediction. documents classification. In: Proceedings of the 12th Lan- Computer/lj 2, 829 (1980) guage Resources and Evaluation Conference. pp. 1449– [15] Kim, Y.: Convolutional neural networks for sentence clas- 1458. European Language Resources Association, Mar- sification (2014) seille, France (May 2020), https://www.aclweb. [16] Kleinbaum, D.G., Dietz, K., Gail, M., Klein, M., Klein, M.: org/anthology/2020.lrec-1.181 Logistic regression. Springer (2002) [3] Ashley, K.D., Brüninghaus, S.: Automatically classifying [17] Kort, F.: Predicting supreme court decisions mathemat- case texts and predicting outcomes. Artificial Intelligence ically: A quantitative analysis of the" right to counsel" and Law 17(2), 125–165 (2009) cases. The American Political Science Review 51(1), 1–12 [4] Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of (1957) generic convolutional and recurrent networks for sequence [18] Krajči, S., Novotný, R.: Tvaroslovník–databáza tvarov slov modeling. arXiv:1803.01271 (2018) slovenského jazyka. In: Proceedings of international con- [5] Chalkidis, I., Androutsopoulos, I., Aletras, N.: Neural le- ference ITAT 2012. pp. 57–61. SAIA (2012) gal judgment prediction in english. CoRR abs/1906.02059 [19] Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec (2019), http://arxiv.org/abs/1906.02059 with practical insights into document embedding gener- [6] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., ation. In: Proceedings of the 1st Workshop on Repre- Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase sentation Learning for NLP. pp. 78–86. Association for representations using rnn encoder-decoder for statistical Computational Linguistics, Berlin, Germany (Aug 2016). machine translation (2014) https://doi.org/10.18653/v1/W16-1609, https://www. aclweb.org/anthology/W16-1609 [20] Long, S., Tu, C., Liu, Z., Sun, M.: Automatic judgment prediction via legal reading comprehension. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) Chinese Com- putational Linguistics. pp. 558–572. Springer International Publishing, Cham (2019) [21] Luo, B., Feng, Y., Xu, J., Zhang, X., Zhao, D.: Learn- ing to predict charges for criminal cases with legal basis. CoRR abs/1707.09168 (2017), http://arxiv.org/ abs/1707.09168 [22] Mackaay, E., Robillard, P.: Predicting judicial decisions: The nearest neighbour rule. November, 1974 41, 302 (2020) [23] Medvedeva, M., Vols, M., Wieling, M.: Using machine learning to predict decisions of the european court of hu- man rights. Artificial Intelligence and Law 28(2), 237–266 (2020) [24] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient es- timation of word representations in vector space (2013) [25] Nagel, S.: Using simple calculations to predict judicial de- cisions. American Behavioral Scientist 4(4), 24–28 (1960) [26] Nagel, S.S.: Applying correlation analysis to case predic- tion. Tex. L. Rev. 42, 1006 (1963) [27] Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instruc- tional conference on machine learning. vol. 242, pp. 29–48. Citeseer (2003) [28] Ruger, T.W., Kim, P.T., Martin, A.D., Quinn, K.M.: The supreme court forecasting project: Legal and political sci- ence approaches to predicting supreme court decisionmak- ing. Columbia Law Review pp. 1150–1210 (2004) [29] Sokol, P., Rózenfeldová, L., Lučivjanská, K., Harašta, J.: Ip addresses in the context of digital evidence in the criminal and civil case law of the slovak republic. Forensic Science International: Digital Investigation 32, 300918 (2020) [30] Sulea, O., Zampieri, M., Malmasi, S., Vela, M., Dinu, L.P., van Genabith, J.: Exploring the use of text classification in the legal domain. CoRR abs/1710.09306 (2017), http: //arxiv.org/abs/1710.09306 [31] Sulea, O.M., Zampieri, M., Vela, M., Van Genabith, J.: Pre- dicting the law area and decisions of french supreme court cases. arXiv preprint arXiv:1708.01681 (2017) [32] Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun, M., Feng, Y., Han, X., Hu, Z., Wang, H., Xu, J.: CAIL2018: A large-scale legal dataset for judgment pre- diction. CoRR abs/1807.02478 (2018), http://arxiv. org/abs/1807.02478 [33] Zhang, Y.: Support vector machine classification algo- rithm and its application. In: International Conference on Information Computing and Applications. pp. 179–186. Springer (2012) [34] Zhong, H., Guo, Z., Tu, C., Xiao, C., Liu, Z., Sun, M.: Le- gal judgment prediction via topological learning. In: Pro- ceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pp. 3540–3549 (2018)