=Paper= {{Paper |id=Vol-2962/paper13 |storemode=property |title=Analysis and Prediction of Legal Judgements in the Slovak Criminal Proceedings |pdfUrl=https://ceur-ws.org/Vol-2962/paper13.pdf |volume=Vol-2962 |authors=Dávid Varga,Zoltán Szoplák,Pavol Sokol,Stanislav Krajči,Peter Gurský |dblpUrl=https://dblp.org/rec/conf/itat/VargaSSKG21 }} ==Analysis and Prediction of Legal Judgements in the Slovak Criminal Proceedings == https://ceur-ws.org/Vol-2962/paper13.pdf
         Analysis and Prediction of Legal Judgements in the Slovak Criminal
                                     Proceedings

                          Dávid Varga, Zoltán Szoplák, Stanislav Krajči, Pavol Sokol, and Peter Gurský

                                        Institute of Computer Science
                             Faculty of Science, P.J.Šafárik University in Košice
                                     Jesenná 5, 040 01 Košice, Slovakia
                                     www.ics.science.upjs.sk
    david.varga@student.upjs.sk,zoltan.szoplak@student.upjs.sk,stanislav.krajci@upjs.sk,
                          pavol.sokol@upjs.sk,peter.gursky@upjs.sk

Abstract: This paper uses machine learning to analyze                     The model predicts the verdict from the input justification,
criminal judgements in the Slovak republic to determine                   comparing it with the true verdict received at the input.
their adequacy and set a baseline for predicting their out-               Subsequently, two situations can occur. If the predicted
comes. First, we summarize past and recent advance-                       verdict is identical to the true verdict, we will take this
ments in predicting verdicts and other attributes of legal                court decision as sufficiently reasoned. If the predicted
text written in different languages. We then demonstrate                  verdict differs from the true verdict, we will take such a
data preparation of all publicly available Slovak judge-                  decision as insufficiently reasoned. The model justifies its
ments, extraction of their verdicts and separation into main              prediction by extracting the parts of the court’s reasoning
parts using a Slovak words inflexion dictionary called                    that most influenced the prediction of the verdict. This
Tvaroslovník. Later we use this data to classify the judge-               paper is based on the research stated in Sokol et al. [29], in
ments into acquittal or conviction using several known                    which authors formulated their conclusions on the current
machine learning methods ranging from simple statistical                  state and developing trends in the use of digital evidence
methods such as SVM and random forests to deep learn-                     in judicial proceedings and usage of the in dubio pro reo
ing networks based on convolution to recurrence and their                 principle in criminal proceedings.
combinations. We evaluate their efficiency, analyze and                      To achieve a better understanding of how judgments are
identify significant highly correlated terms with each re-                reasoned, this paper aims to:
sult class, and offer a hypothesis as to why these terms
are correlated with these results. We have found that a                       • create a classification model which can predict the
sequential input of word2vec embeddings combined with                           verdict of the judgement from its reasoning part;
convolution-based deep learning methods produces the
                                                                              • identify significant terms in the judgments’ reason-
best results, achieving over 99% accuracy.
                                                                                ings closely related to the results of judgements in
Keywords: judgement, reasoning, text analysis, Slovak,                          the criminal proceedings (innocence or guilty).
classification, verdict, machine learning
                                                                             This paper is organized into six sections. Section 2 fo-
                                                                          cuses on the review of past and recent advancements in
1    Introduction                                                         the classification of legal documents. Section 3 is devoted
Since 2016, the Ministry of Justice of the Slovak Repub-                  to data preprocessing and judgement extraction. Section 4
lic has published more than 3 million publicly available                  describes the different methods of text representation and
court decisions online. These court decisions contain some                the learning algorithms that will use them. The results
structured data, e.g. name of the judge or court, but mostly              produced by these algorithms and their subsequent anal-
free text. This free text contains the most relevant parts of             ysis are presented in section 5, followed by the last section
court decisions: the final verdict and the reasoning behind               containing conclusions and future works.
the verdict. We aim to find a method to identify court de-
cisions that are not sufficiently reasoned and provide such               2     Related works
decisions to lawyers for a more detailed analysis.
   In this paper, we examine several statistical and machine
                                                                          2.1   A statistical approach
learning methods of text representation and classification,
intending to correctly predict court decisions based on the               Predicting the results of court decisions from a statistical
reasoning alone.                                                          point of view was addressed by Kort [17] in 1957. He
   After our model is trained, the reasoning and the verdict              aimed to predict the cases concerning the right to counsel
of the court decision will become inputs for this model.                  from The Supreme Court of the United States. He con-
     Copyright ©2021 for this paper by its authors. Use permitted under   structed a table with various facts of the cases paired with
Creative Commons License Attribution 4.0 International (CC BY 4.0).       certain values. A composite value was calculated for each
case by adding up all the facts’ values. If the compos-               Another group of researchers, Chalkidis et al. [5], also
ite value of a particular case exceeded a certain threshold,       predicted outcomes of decisions from ECHR using Bi-
then the defendant was wrongly denied the assignment of            GRU with attention, hierarchical attention network and
a lawyer. That way, he was able to predict successfully 12         Label-Wise attention network. Attention scores provided
of the 14 cases.                                                   indications of which part of the case affected the predic-
   Later, Nagel (1960 [25] and 1963 [26]) applied the cor-         tion the most.
relation analysis for court decisions. He predicted out-              Sulea et al. (2017) [31] decided to predict the verdicts
comes by calculating correlation coefficients for key vari-        of court decisions of the French Court of Cassation. They
ables, i.e. those which, according to the court, had the           used a linear SVM classifier to train a bag of words in-
greatest influence on the determination of the judgment.           stead of n-grams. They attempted to predict verdicts, the
   Mackaay and Robillard [22] applied the nearest neigh-           area of law and the length of court proceedings. Later that
bour rule method to predict judicial decisions, which was          year, they [30] managed to increase the f1 score for each
later verified by Keown [14] and compared with a linear            prediction using a system based on classifier ensembles.
model.                                                                Using the dataset from China Judgements Online (CJO),
                                                                   Luo et al. [21] attempted to predict the most frequent crim-
2.2   An approach based on artificial intelligence                 inal charges and applied articles. The dataset already con-
                                                                   tained fact descriptions from which the authors extracted
In 2018, The European Commission for the Efficiency of             the applied articles by multiple SVM classifiers.
Justice (CEPEJ) wrote the first European Ethical Charter              The previous dataset was also used by Hu et al. [10], but
on the use of Artificial Intelligence in judicial systems and      their task was a prediction of few-shot charges and a pre-
their environment [7]. The charter summarized the basic            diction of ten chosen attributes. They outperformed SVM,
principles that must be respected by artificial intelligence       CNN, LSTM and the model created by Luo et al. [21] on
(AI). According to CEPEJ, AI can contribute to the effi-           few-shot charges by 50
ciency of processing a large number of documents or re-               In 2018, Xiao et al. [32] built a large dataset called
solving disputes. Still, it must be implemented responsi-          CAIL 2018. It contains more than two million Chinese ju-
bly, taking into account all human rights and personal data        dicial decisions. The authors attempted to predict charges,
protection.                                                        applied articles and length of imprisonment only using
   Currently, the most commonly used methods for pre-              baseline models such as SVM with TF-IDF, fastText [11]
dicting court decisions belong to machine learning, which          and CNN. To make predictions easier, they used only de-
is a part of artificial intelligence. Ruger et al. [28] and Katz   cisions with one defendant and decisions with frequent
et al. (2014 [12] and 2017 [13]) worked on a dataset from          charges.
the United States Supreme Court called the Supreme Court              Using judgements from China Judgements Online,
Database (SCDB). They used methods such as the classi-             CAIL 2018 and Peking University Law Online, Zhong et
fication trees, the extremely randomized tree, LibLinear           al. [34] created a multi-task framework called TopJudge.
SVM and random forest.                                             It uses a directed acyclic graph for subtask dependencies
   Ashley et al. [3] were working on computer programs             and RNN for each subtask. Their subtasks were to predict
called SMILE and IBP that united case-based reasoning              applied articles, charges, fines and terms of penalty.
and information extraction from legal texts. By extracting            Long et al. [20] developed their own legal reading com-
information from previously decided cases, they attempted          prehension model named AutoJudge, which aims to model
to predict the verdicts of new cases.                              complex interactions among case materials and predicts
   Aletras et al. [1] used machine learning to predict the         the final verdict based on fact description, plaintiffs’ pleas
rulings of the European Court of Human Rights (ECHR).              and law articles.
Their data contained 584 judgments in English. They ex-               In 2020, Luz de Araujo et al. [2] created a new
tracted main features from each decision using n-grams             Brazilian dataset called VICTOR, which contained about
and trained the Support Vector Machines (SVM) classifier           692,000 annotated legal documents. Legal experts anno-
on these extracted n-grams. However, they did not remove           tated themes and a type of document (e. g. judgement and
the part of the decision in which the texts of the applicable      lower court decisions) about 6,800 documents which be-
laws were listed from the judgments. From such lists of            came a training dataset for further extraction. They used
laws, it was easier to predict the results of decisions. The       Naïve Bayes, SVM, BiLSTM and CNN for each type of
success rate of classification was 79%.                            classification, but the prediction of verdicts was not one of
   In Medvedeva et al. [23], the authors decided to ad-            their tasks.
dress this limitation while also dealing with the judicial
decisions of the European Court of Human Rights. They
removed the list of applicable laws from court decisions           3   Dataset
and used a larger number of decisions. The success of the
classification deteriorated to 77%, using the same machine         The dataset presented in this work contained more than 3
learning methods as Aletras et al. [1].                            million court decisions issued between 2016 and the end of
                              Table 1: Table of related works using the machine learning approach.
 Literature                 Datasets                         Methods                                               Data representation
 Ruger et al. (2004)        SCDB                             classification trees                                  extracted variables
 Katz et al. (2014)         SCDB                             extremely randomized trees                            extracted variables
 Katz et al. (2017)         SCDB                             random forest, LibLinear SVM, multilayer perceptron   extracted variables
 Ashley et al. (2009)       custom database                  SMILE + IBP                                           factor representation
 Aletras et al. (2016)      ECHR                             SVM                                                   BoW, n-grams
 Medvedeva et al. (2018)    ECHR                             SVM                                                   n-grams, TF-IDF
 Chalkidis et al. (2019)    ECHR                             BiGRU, HAN, LWAN, BERT, HIER-BERT                     word embeddings
 Sulea et al. (2017) [31]   The French Supreme Court         LibLinear SVM                                         BoW, n-grams
 Sulea et al. (2017) [30]   The French Supreme Court         ensemble of multiple SVMs                             BoW, n-grams
 Luo et al. (2017)          CJO                              custom method using SVM, softmax                      sequence embeddings
 Hu et al. (2018)           CJO                              attentive attribute predictor, softmax                fact embeddings
 Xiao et al. (2018)         CAIL 2018                        SVM, fastText, CNN                                    skip-gram, TF-IDF
 Zhong et al. (2018)        CJO, CAIL 2018, PKU Law Online   TopJudge                                              fact embeddings
 Long et al. (2019)         CJO                              AutoJudge                                             sentence embeddings
 Araujo et al. (2020)       VICTOR                           Naïve Bayes, SVM, BiLSTM, CNN, XGBoost                BoW, TF-IDF



2020. The court decisions covered all areas of legislation,                the judgment, the name of the court, the names of
such as civil, family, commercial and criminal law.                        judges and defendants;
   These court decisions are formatted as JSON objects,
which contain attributes such as the type of the court, the              • statement - the section mentioning the verdict and the
name of the court, the name of the judge and the area                      circumstances of the indictment;
of legislation. Each object has the document_fulltext at-
                                                                         • reasoning - the part in which the judgment is rea-
tribute, which contains the anonymized court decision in
                                                                           soned;
its original version. During the preprocessing phase, we
have been only working with this attribute and the area of               • judicial notice - instruction of the defendant, admis-
legislation attribute.                                                     sibility of the appeal and others.
   There were several types of verdicts in these court de-
cisions, such as the obligation to pay a sum of money, the           The division of the judgments into mentioned parts was
acquittal of the defendant, the defendant’s conviction, the          not problematic because their original texts were struc-
rejection of the plaintiff’s pursuit and many others. To             tured well.
simplify our work, we have decided to deal with criminal                Subsequently, we have replaced the original docu-
law containing a verdict of conviction and acquittal.                ment_fulltext attribute of each JSON object with the newly
   There were 226,500 court decisions concerning crimi-              created document_divided attribute, whose value was a
nal law. We obtained these court decisions by searching              JSON object with the attributes details, intro, statement,
for the value "Trestné právo" (criminal law) in the                  reasoning, and judicial notice.
mentioned area of legislation attribute. From these deci-
sions, it was necessary to extract the reasoning and the
verdict, i.e. acquittal or conviction, which were used to            3.2    Extraction of the verdict
train our models. After a more thorough filtration of these
                                                                     From observations, we have noticed that certain words are
court decisions, explained in subsection 3 of this section,
                                                                     often spelt in a way that there is a space between each let-
we ended up with 43,254 decisions with a conviction ver-
                                                                     ter. The main verdicts were often written in this "spaced"
dict and 3,139 decisions with an acquittal verdict.
                                                                     style, e.g. the phrase "j e       v i n n ý" (is guilty).
                                                                     We decided to extract all words longer than two from the
3.1   Dividing court decisions into main parts                       decision parts written in this style, and we wanted to find
                                                                     out how the conviction and the acquittal were formulated.
The part of the justification that is important for training            We created a finite state automaton to extract such
the model was not present in the attributes of the original          words in one text pass. These words are then stored in
JSON files. Therefore, we have decided to split each judg-           the field of the newly created wide_words attribute.
ment in its original form present in the document_fulltext              The conviction always contained in its field wide_words
attribute. We divided every judgment into these parts:               the word starting with "vinn-", i.e. the beginning part
  • details - contains semi-structured information about             of the word "vinný" (guilty).
    the court, the judge and the court decision. This in-               The acquittal always contained in the wide_words field
    formation is the same as the values in mentioned at-             a word starting with "oslobod-", i.e. the beginning part
    tributes of JSON object;                                         of the word "oslobodzuje" (freed of charges).
                                                                        Based on the occurrence and co-occurrence of these two
  • introduction - contains an introductory sentence in              terms, we divided the court decisions into four groups.
                                  Figure 1: Segment of decision containing the verdict


   The first group, named none, contained all court deci-     not obliged to fill out the reasoning section. This group
sions in which neither the word beginning with "vinn-"        contained 130,289 decisions.
nor the word beginning with "oslobod-" was men-                 We also excluded those decisions that mentioned para-
tioned. Such court decisions, for example, were requests      graph 172 article 2 of the Code of Criminal Procedure in
for parole.                                                   their reasoning section. This article states that if both the
   The second group, named both, contained court deci-        prosecutor and the accused have waived their right to ap-
sions which included in the court decision both a word be-    peal or have made such a statement within three working
ginning with "vinn-" and a word beginning with "oslobod-      days of the judgment, a simplified written judgment may
". Such court decisions often concerned several persons,      be issued, not stating the reasons. This meant that even
several of whom were acquitted and others convicted.          though the reasoning was present in the judgment, the rea-
   The third group, named guilty, contained court deci-       soning itself stated that there is no justification stated in the
sions that contained words beginning with "vinn-" and         judgment. We searched for the mentioning of this article
did not contain a word beginning with "oslobod-". The         using a regular expression and removed a further 15,953
fourth group named innocent contained words beginning         judgements.
with "oslobod-" and did not contain a word beginning with       The last two groups removed from the training set were
"vinn-". These two groups clearly define the verdict, and     groups based on the type of verdict, specifically the none
we used these two groups to train the model.                  group, which contained 33,483 judgements and the both
   Due to the inconsistency of court decisions, it happened   group which contained 382 judgements.
that a verdict was not written in "spaced" style but was
written normally. For example, the verdict "j e           v   3.4   Further preprocessing
i n n ý" was written as "je vinný". We have also
extracted these forms of verdicts by searching for words      For each judgment, we have split the reasoning text into
beginning with "vinn-" and "oslobod-".                        words and lemmatized them using a Slovak word form
                                                              dictionary called Tvaroslovník described in [18]. We have
                                                              also removed any non-alphabetic words and words shorter
3.3   Filtration of court decisions based on reasonings       than three characters. We have used this text as the input
      and verdicts                                            and the verdict as the label. The data was split into a train-
                                                              ing and testing set, using two-thirds as training data. Due
The first group of the court decisions that we excluded for   to the imbalance of target labels, we have downsampled
the training set contained those that did not have the rea-   the number of guilty verdicts in the training data to match
soning part. That is because, in certain cases, judges are    the number of innocent examples.
4     Algorithms                                                4.2   Learning algorithms

                                                                There are several well-known if slightly outdated classi-
4.1   Text representation                                       fiers that have been used in NLP tasks that will serve as
                                                                our baseline.
                                                                   Logistic regression, as described in [16] is a method of
This section describes various representations of text and
                                                                classification that uses linear regression equations to pro-
algorithms for predicting the outcome of court decisions.
                                                                duce discrete binary outputs.
Most machine learning algorithms are incompatible with
                                                                   A Support vector machine, described in [33] is an al-
strings of characters as input data; thus, it is necessary to
                                                                gorithm tasked with finding an optimal hyperplane that di-
create numeric representations that preserve the syntactic
                                                                vides two or more classes with the greatest possible mar-
and semantic relations between words.
                                                                gin.
   A simple yet effective method of encoding is the Tf-Idf         A random forest, described in [8] is a model that in
metric described in [27]. Tf-Idf (term frequency-inverse        itself is an ensemble of several decision trees.
document frequency) is the combination of term frequency           These models can be used with representations that en-
- the number of times a given term occurs within a doc-         code the reasoning as a singular input, meaning that the
ument - and inverse document frequency - a metric that          Tf-Idf, the concatenated Word2Vec and the Doc2Vec en-
describes how unique or specific a given the word is to         codings can all be used.
a document. Our vocabulary of terms contained not only             In addition to these methods, we have decided to ex-
individual words but also all bigrams and trigrams. This        plore algorithms that use the sequence of words that make
resulted in a large number of features even after excluding     up the reasoning encoded by the Word2Vec method in-
terms that occur less than five times total in the corpus.      stead of taking in a singular input.
Therefore, we performed a χ 2 test to find the top 6500
                                                                   Convolutional Neural Networks or CNNs, described
terms that are most correlated with our target classes and
                                                                in [15] are based on the idea of using alternating layers of
used them as features calculating their Tf-Idf values for
                                                                convolution - a sliding window function applied to a ma-
each document.
                                                                trix - and pooling layers to subsample the input. While
   While effective, this kind of encoding does not tell us      more well-known for their applications in computer vi-
much about any spatiotemporal relations of the words            sion, they can be applied to NLP tasks quite successfully
themselves. Thus, we have opted to use vector embed-            due to their nature of capturing spatial dependencies and
ding methods, namely Word2Vec and Doc2Vec which ex-             their ability to compose higher-level features from low-
cel at encoding context for given words and documents.          level features. We have used a single convolutional layer
Word2Vec, described in [24] is a method for creating em-        with 128 features and a kernel size of 5 with a maxpooling
beddings from each word by concatenating two prediction         layer fed into a dense layer with ten neurons.
networks: CBOW, which tries to predict a word given the            Recurrent Neural Networks or RNNs, on the other
words surrounding it and Skip-Gram, trying to predict the       hand, have an internal state that can represent context
surrounding words from the input word. We have trained          information from an unspecified amount of past inputs.
a Word2Vec encoder with an embedding size of 300 on             Long Short Memory Networks or LSTMs, described in
our dataset and used it in two distinct ways. We merely         [9] are able to deal with vanishing and exploding gradients
encoded each word of the padded judicial decisions for          better than traditional RNNs since they possess two gated
algorithms designed to work with sequential inputs. For         units that open and close based on the relevance of the
algorithms that require encoding of the entire document,        data, allowing it to better retain information over longer
we calculated the element-wise mean, min and max values         sequences. One shortcoming of conventional RNNs is that
of all the word vectors of the decision. We concatenated        they are only able to make use of the previous context.
them into an embedding with the size of 900. This simplis-      Bidirectional RNNs are designed to process the data in
tic method of pooling allows us to create a representation      both directions with two separate hidden layers, one pro-
of a collection of words while still retaining semantic and     cessing the information going from the beginning forward
syntactic information.                                          in time and one from the end backwards. This approach al-
   While the method above is somewhat effective, there          lows us to have complete sequential information for each
is a more relevant method of creating embeddings from a         input about all points before and after. We use a single
sequence of words based on a similar principle, namely          bidirectional block of LSTMs, each with 100 cells.
the Doc2Vec algorithm described in [19], a modification            Some methods combine Recurrent Neural Networks
of the Word2Vec model to encode documents instead of            with Convolutional Neural Networks in order to preserve
words. Using this method, we have created an embedding          both the spatial information retaining capabilities of con-
of each judicial decision with a vector size of 500.            volutional networks and the temporal dependency captur-
   These representations can be used in conjunction with        ing capabilities of recurrent networks.
several machine learning algorithms to predict the verdict         The first is to create an ensemble model combining a
of judicial decisions.                                          convolutional network and a Bidirectional Gated Recur-
rent Unit described in [6]. The same input is presented to     on the sequential order of ideas have lower performance
a CNN model with 100 features and a kernel size of 3 fol-      than CNNs, which have a property of location invariance
lowed by a maxpool layer as well as a BiGRU model with         thus are better suited to detect the presence of individual
a layer size of 64. The output of the two separately trained   terms that are by large independent and highly correlated
networks are concatenated into a single result.                with the result class. The performance of such algorithms
   Another, more indirect way of combining the attributes      is quite high, achieving an accuracy of over 99%. We be-
and strengths of RNNs and CNNs are Temporal Con-               lieve that this may be due to the relatively simple task of
volutional Networks or TCN networks, described in [4].         binary classification, combined with semi-structured data.
TCN use dilated causal convolution, meaning that outputs       We expect this to change as we try to predict more com-
at time t is convolved only with elements from time t and      plex information from the dataset.
earlier in the previous layer. This feature allows for par-       We can further observe from Table 2 that the precision
allel computation of convolutions rather than the sequen-      for the prediction of conviction decisions is better than
tial computation of RNNs and requires less memory than         the recall metric for every single representation and model
RNNs. As for the implementation, we will make use of 2         combination. Since precision is a metric that determines
TCN blocks stacked with the kernel size of 3 and dilation      the percentage of predicted convictions to be actual con-
factors of 1, 2, and 4, the first containing 128 filters the   victions while recall tells us the percentage of actual con-
second 64 filters. The sequential output of the 2nd block      victions found by our algorithm, it stands to reason that a
is passed to 2 separate layers of pooling - max and average    more significant number of convictions was classified as
- the result of which is concatenated into a dense layer of    acquittal than the other way around.
16 neurons then passed to the output.                             Such bias may be the result of several possible causes.
   In section 5, we describe the results of using these al-    One of them is simply the consideration that there are
gorithms on the dataset described in chapter 3. Section        suspicious cases within the dataset where the verdict
5.1 contains the evaluation of performance and subsequent      should’ve been a conviction but ended up being acquittal.
comparison of these algorithms, whereas section 5.3 anal-      However, a more likely hypothesis is that many individ-
yses what features and terms were used to make the pre-        ual terms are highly correlated with the target classes and
dictions.                                                      that many of them are, in actuality, more correlated with
                                                               the conviction class of samples. So the decision process
                                                               itself might try to detect values that are correlated more
5     Results and discussion
                                                               with conviction decisions, and upon their absence, it tends
5.1   Performance evaluation                                   to classify acquittal. Unsure of the reason, we investigated
                                                               what features contributed most to the prediction. Since
We used the data described in section 3 and split it into      embedding vectors are difficult to interpret, we used the
three parts, using two for training and one for testing. We    feature selection method for the Tf-Idf representation us-
have implemented the methods described above and, af-          ing a bag of words and the χ 2 test. We calculated what
ter training, evaluated their performance using standard       percentage of documents from the training and testing cor-
statistical metrics. These metrics consider the conviction     pus is the most relevant terms present for each target class.
samples as the Positives and the acquittal samples as          We organized these results into tables to determine which
Negatives. We have then organized these results into           terms are used and how to make such decisions.
Table 2.
   As we can see, regarding algorithms that use a singu-
lar representation(rows 1-9), the embedding models of-         5.2   Definition of term categories
fer generally poorer performance, with the concatenated
pooled Word2Vec being the least efficient since the al-        The terms (unigrams, bigrams and trigrams) can be di-
gorithm is used in a way it is not designed to be used.        vided into three categories according to their meaning and
Doc2Vec has better performance, especially when used           usage in a judicial decision:
in conjunction with Logistic Regression, where the rela-
                                                                  • terms related to legal principles;
tively small number of features (500 as opposed to 900
and 6500) is less of a hindrance. However, the best re-           • terms used in legal arguments;
sults were achieved by using the Tf-Idf representation. We
assume the reason for this is that the reasoning text has a       • other general legal terms, including terms describing
somewhat formalized structure that uses certain standard-           the legal language.
ized keywords and phrases from which basic information
is more readily deductible than from a sequence of justifi-       The first group of terms is represented by terms related
cations presented within the reasoning.                        to the application of legal principles, resp. the exercise of
   This is somewhat further evidenced by the results ob-       rights under these principles. Judges often rely on legal
tained from methods reliant on the encoded sequence of         principles to justify judicial decisions. An example is the
words (rows 11-14). RNNs that are more heavily reliant         principle of fair trial and the right to a fair trial.
Table 2: Table of classification results on the testing data. The rows represent the 14 different representation and algorithm
combinations while the columns are the metrics we used to evaluate the performance of the given classifier.
                 Representation + Classifier             Accuracy Precision Recall F1 score ROC_AUC
           word2vec + logistic regression                87.09       90.43        69.46     78.57       82.83
           word2vec + svm                                89.71       92.34        76.10     83.44       86.42
           word2vec + random forest                      94.87       99.52        85.35     91.89       92.57
           doc2vec + logistic regression                 97.83       98.10        95.34     96.70       97.21
           doc2vec + svm                                 97.46       97.07        95.28     96.16       96.92
           doc2vec + random forest                       94.97       99.23        85.58     91.90       92.63
           tf-idf +logitstic regression                  95.64       98.69        88.18     93.14       93.80
           tf-idf + svm                                  98.21       98.25        96.39     97.31       97.76
           tf-idf + random forest                        99.05       99.78        97.38     98.57       98.64
           word2vec + CNN                                99.24       99.78        97.89     98.83       98.89
           word2vec + BiLSTM                             98.72       99.20        96.83     98.00       98.23
           word2vec + TCN                                99.08       99.57        97.65     98.60       98.72
           word2vec + Ensemble(CNN + BiGRU) 98.40                    99.60        95.68     97.60       97.74


   The second group consists of terms that are used in le-        defendant’s innocence with the chi-square value and the
gal arguments. There are terms expressing usage and in-           count of occurrences in judgements that point to the defen-
terrelationships of the evidence submitted in the criminal        dant’s innocence or guilt. In contrast, Table 4 shows inter-
proceedings. Examples are general terms related to indi-          esting unigrams, bigrams and trigrams, which are closely
cation, such as to prove, proof. Another example is the use       connected with judgements, the result of which is recog-
of evidence such as expert evidence, real evidence, docu-         nition of the defendant guilty. As we can see from these
mentary evidence.                                                 tables, specific terms correlate significantly more with the
   The last group are general legal terms that do not fall        particular result of the judgement. Judges use in judge-
into the groups mentioned above. These terms are part of          ments’ reasoning terms such as reason, unequivocally, fe-
the legal language and relate to legal institutes with a spe-     male witness, situation etc. (Table 3) in the cases that re-
cific criminal offence (e.g. legal qualification, theft, breach   sult in acquittal of the defendant. On the other hand, ex-
of personal data protection), compensation or punishment.         pressions such as free choice, advise option, choice, vol-
It also includes terms related to the procedure regulation of     untarily commit which, willingly etc. (Table 4) are im-
the court and law enforcement authorities (e.g. to accuse,        portant in the judgements condemning the defendant. The
hear, propose).                                                   exciting finding is that groups of specific terms are closely
   Certain legal principles are important for these proceed-      connected with a specific type of verdict. The sets of terms
ings, among which we can include the presumption of in-           prepared in this way can then be analyzed in terms of their
nocence of the defendant and the in dubio pro reo princi-         mutual correlation or use as attributes for the classification
ple. This principle stipulates the obligation of the court to     of the judgements.
decide in favour of the defendant if there are doubts about          Within the used corpus of the judgements, we have fo-
his guilt that cannot be removed. It is this principle that       cused on terms that are closely related to the evidence (ev-
creates a specific imbalance in thinking about guilt or in-       idence, prove, testimony, paper, etc.). The results show
nocence. The presumed result of judgement is innocence,           that these terms are strongly connected with judgements
and it is necessary to prove the defendant’s guilt. It is a       about the innocence of the defendant. Table 5 shows these
specific feature of the judgements in criminal proceedings,       terms with the chi-square value and the count of occur-
which is also reflected in the reasoning of the judgments.        rences in judgements that point to the defendant’s inno-
The judge needs to justify the guilt of the defendant and         cence or guilt.
not his innocence.                                                   These results suggest that for a judge to admit someone
                                                                  innocent, a much more detailed evidence-based argumen-
5.3   Analysis of relevant terms                                  tation must be used in the reasoning. At this point, it is
                                                                  necessary to return to the principle in dubio pro reo, which
The second goal of this paper was to identify essential           implies that the presumed result of judgment is innocence,
words or phrases associated with the decision on the merits       and it is required to prove the defendant’s guilt. It fol-
in criminal proceedings. In other words, the aim was to de-       lows that the evidence and their representation in decision
termine the strength of the correlation between unigrams,         reasoning should be more closely linked to decisions with
bigrams and trigrams and the result in guilt or innocence.        guilt verdict since guilt must be proved. However, here,
   In Table 3, we can see unigrams, bigrams and trigrams          we come to a disagreement between these claims and a
that have a significant relationship with judgments on the        dispute between the law in the book ("rules of the game"
Table 3: Table of terms relevant to judgments of innocence. The first column is a term in the Slovak language, the
second column represents the translation of the term to English, in the third column Chi-square value is listed, and the last
columns are the percentage of judgment on innocence, resp. guilt.
     Term (Slovak)          Term (English)               Chi-square Percentage_innocence Percentage_guilt
     obžalovat’ obžaloba to charge the indictment 5295.07                         39.6%                   2.2%
     svedkyňa              witness (female)             3026.24                  37.6%                   9.0%
     pojednávanie           trial                        2950.20                  59.1%                  22.4%
     dôvod                  reason                       2756.17                  48.3%                  16.6%
     jednoznačne           unequivocally                2653.22                  33.0%                   7.8%
     príst’                 come                         2563.86                  35.9%                   9.8%
     situácia               situation                    2328.96                  25.9%                   5.2%
     pamätat’               to remember                  2053.86                  25.1%                   5.6%
     obdobie                period                       1928.83                  24.4%                   5.8%
     polícia                police                       1920.33                  29.8%                   9.2%


Table 4: Table of terms relevant to judgments of guilt. The first column is a term in the Slovak language, the second
column represents the translation of the term to English, in the third column Chi-square value is listed, and the last
columns are the percentages of judgment on innocence, resp. guilt.
 Term (Slovak)                Term (English)              Chi-square Percentage_innocence Percentage_guilt
 slobodný vol’ba              free choice                 9921.81                 0.3%                  34.4%
 možnost’ slobodný vol’ba     free choice option          9911.82                 0.3%                  33.8%
 skrátit’ vzdávat’            to shorten give up          9530.09                 0.0%                   32%
 súhlasit’ návrh              to agree to a proposal      9452.09                 0.0%                  31.8%
 radit’ spôsob                advise option               9402.72                 0.3%                  32.4%
 dobrovol’ne spáchat’         voluntarily commit          9261.44                 0.3%                  32.2%
 vol’ba                       choice                      9241.86                 0.9%                  34.5%
 dobrovol’ne spáchat’ ktorý voluntarily commit which 9157.98                      0.2%                  31.6%
 dobrovol’ne                  willingly                   5206.03                 7.2%                  37.1 %


for all cases) and law in action (judgment in the individual     6   Conclusion and future works
case). Based on the findings we have found, it appears that
the judges do not presume the innocent of the defendant.         In this paper, we have shown how to split a judicial de-
                                                                 cision into its relevant parts and extract the verdict of the
                                                                 judgments. In addition, we have shown how to create a
                                                                 representation of the reasoning text using various text rep-
   This specificity contained in the argumentation can then      resentation methods and combined them with several clas-
be seen in the algorithms that learn to recognize significant    sification algorithms. We evaluated the performance of
strings for two groups of decisions (guilty, innocent). This     these models and found that methods that are more reliant
is evident from the precision and recall ratio as well as Ta-    on detecting specific terms than a stream of thoughts pro-
ble 4 and Table 3, where the higher χ 2 values and thus          duce the most satisfactory results. Multiple models pre-
the features better suited for classification are correlated     dict most cases with sufficient accuracy so that the outly-
with judgements where the verdict was guilty. We have            ing cases can be manually examined by a team of experts.
also calculated which of the top 300 terms occurs more           Furthermore, it can be demonstrated that all representa-
in which class and have found that 223 of them had more          tions and models are prone to classify conviction as ac-
occurrences in the guilty class, and only 77 had more in         quittal more often than the other way around, which may
the innocent class. This supports the conclusion that we         be because our models tend to look for features present in
have arrived at after making observations from Table 2. At       convictions and interpret their absence as an acquittal.
the same time, however, the conclusions of the paper [29],          As part of the analysis of significant terms, we have
according to which more used evidence correlates with de-        identified the groups of specific terms closely connected
cisions on the innocence of the defendant, are confirmed.        with a specific type of verdict (acquittal or conviction).
In the paper [29], authors focused only on the corpus of the     Also, we have focused on the terms used in legal argu-
judgements concerning digital evidence and IP addresses.         ments (judgements’ reasoning) in more detail. According
In this paper, we use the extended corpus of the judgments,      to results, the in dubio pro reo principle in criminal pro-
which covers various areas of criminal law.                      ceedings affect judgement’s reasonings and the subsequent
Table 5: Table of terms used in evidence-based argumentation. The first column is a term in the Slovak language, the
second column represents the translation of the term to English, in the third column Chi-square value is listed, and the last
columns are the percentages of judgment on innocence, resp. guilt.
   Term (Slovak) Term (English)                 Chi-square value Percentage_acquittal Percentage_conviction
   výpoved’          testimony                  3652.45                        52.1%                    14.4%
   preukázat’        to prove                   3071.27                         52%                     17.1%
   dokázat’          to prove                   3064.15                         26%                      2.5%
   dôkaz ktorý       evidence which             2929.94                         28%                       4%
   výsluch           hearing                    2839.05                        42.2%                    12.4%
   listinný          documentary                2638.38                        42.3%                    13.3%
   dôkaz             evidence                   2625.49                        55.3%                    21.7%
   listinný dôkaz    documentary evidence 2588.78                              41.3%                     13%
   znalecký          expert                     1998.48                        28.6%                      8%
   dokazovanie       proving                    1917.96                        27.5%                     7.9%


analysis of this legal text.                                          [7] European Commission for the Efficiency of Justice
   As an extension of this research, we plan to examine                   (CEPEJ): European ethical Charter on the use of Ar-
the cases where the labels and predictions differ and con-                tificial Intelligence in judicial systems and their envi-
sult a lawyers team. Their task would be to determine for                 ronment (2018), https://rm.coe.int/ethical-charter-en-for-
individual cases whether the failure is caused by the pre-                publication-4-december-2018/16808f699c
dictor, in which case we will research ways to improve our            [8] Ho, T.K.: Random decision forests. In: Proceedings of 3rd
methods further. We will also replace all article references              international conference on document analysis and recog-
                                                                          nition. vol. 1, pp. 278–282. IEEE (1995)
with the actual text of the articles to increase our predic-
tive capability. We plan to make further predictions where            [9] Hochreiter, S., Schmidhuber, J.: Long short-term memory.
                                                                          Neural Computation 9(8), 1735–1780 (1997)
in addition to determining the presence of guilt, we will
also attempt to predict the severity of the sentence (e.g.           [10] Hu, Z., Li, X., Tu, C., Liu, Z., Sun, M.: Few-shot charge
                                                                          prediction with discriminative legal attributes. In: Proceed-
jail time or fine amount). In case there are multiple de-
                                                                          ings of the 27th International Conference on Computational
fendants, we will try to determine the sentence for each of               Linguistics. pp. 487–498 (2018)
them.
                                                                     [11] Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H.,
                                                                          Mikolov, T.: Fasttext.zip: Compressing text classification
References                                                                models (2016)
                                                                     [12] Katz, D.M., au2, M.J.B.I., Blackman, J.: Predicting the be-
 [1] Aletras, N., Tsarapatsanis, D., Preoţiuc-Pietro, D., Lampos,        havior of the supreme court of the united states: A general
     V.: Predicting judicial decisions of the european court of           approach (2014)
     human rights: A natural language processing perspective.        [13] Katz, D.M., Bommarito, M.J., Blackman, J.: A general ap-
     PeerJ Computer Science 2, e93 (2016)                                 proach for predicting the behavior of the supreme court of
 [2] Luz de Araujo, P.H., de Campos, T.E., Ataides Braz, F.,              the united states. PloS one 12(4), e0174698 (2017)
     Correia da Silva, N.: VICTOR: a dataset for Brazilian legal     [14] Keown, R.: Mathematical models for legal prediction.
     documents classification. In: Proceedings of the 12th Lan-           Computer/lj 2, 829 (1980)
     guage Resources and Evaluation Conference. pp. 1449–            [15] Kim, Y.: Convolutional neural networks for sentence clas-
     1458. European Language Resources Association, Mar-                  sification (2014)
     seille, France (May 2020), https://www.aclweb.
                                                                     [16] Kleinbaum, D.G., Dietz, K., Gail, M., Klein, M., Klein, M.:
     org/anthology/2020.lrec-1.181
                                                                          Logistic regression. Springer (2002)
 [3] Ashley, K.D., Brüninghaus, S.: Automatically classifying
                                                                     [17] Kort, F.: Predicting supreme court decisions mathemat-
     case texts and predicting outcomes. Artificial Intelligence
                                                                          ically: A quantitative analysis of the" right to counsel"
     and Law 17(2), 125–165 (2009)
                                                                          cases. The American Political Science Review 51(1), 1–12
 [4] Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of        (1957)
     generic convolutional and recurrent networks for sequence
                                                                     [18] Krajči, S., Novotný, R.: Tvaroslovník–databáza tvarov slov
     modeling. arXiv:1803.01271 (2018)
                                                                          slovenského jazyka. In: Proceedings of international con-
 [5] Chalkidis, I., Androutsopoulos, I., Aletras, N.: Neural le-          ference ITAT 2012. pp. 57–61. SAIA (2012)
     gal judgment prediction in english. CoRR abs/1906.02059
                                                                     [19] Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec
     (2019), http://arxiv.org/abs/1906.02059
                                                                          with practical insights into document embedding gener-
 [6] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,            ation. In: Proceedings of the 1st Workshop on Repre-
     Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase               sentation Learning for NLP. pp. 78–86. Association for
     representations using rnn encoder-decoder for statistical            Computational Linguistics, Berlin, Germany (Aug 2016).
     machine translation (2014)
     https://doi.org/10.18653/v1/W16-1609, https://www.
     aclweb.org/anthology/W16-1609
[20] Long, S., Tu, C., Liu, Z., Sun, M.: Automatic judgment
     prediction via legal reading comprehension. In: Sun, M.,
     Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) Chinese Com-
     putational Linguistics. pp. 558–572. Springer International
     Publishing, Cham (2019)
[21] Luo, B., Feng, Y., Xu, J., Zhang, X., Zhao, D.: Learn-
     ing to predict charges for criminal cases with legal basis.
     CoRR abs/1707.09168 (2017), http://arxiv.org/
     abs/1707.09168
[22] Mackaay, E., Robillard, P.: Predicting judicial decisions:
     The nearest neighbour rule. November, 1974 41, 302
     (2020)
[23] Medvedeva, M., Vols, M., Wieling, M.: Using machine
     learning to predict decisions of the european court of hu-
     man rights. Artificial Intelligence and Law 28(2), 237–266
     (2020)
[24] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient es-
     timation of word representations in vector space (2013)
[25] Nagel, S.: Using simple calculations to predict judicial de-
     cisions. American Behavioral Scientist 4(4), 24–28 (1960)
[26] Nagel, S.S.: Applying correlation analysis to case predic-
     tion. Tex. L. Rev. 42, 1006 (1963)
[27] Ramos, J., et al.: Using tf-idf to determine word relevance
     in document queries. In: Proceedings of the first instruc-
     tional conference on machine learning. vol. 242, pp. 29–48.
     Citeseer (2003)
[28] Ruger, T.W., Kim, P.T., Martin, A.D., Quinn, K.M.: The
     supreme court forecasting project: Legal and political sci-
     ence approaches to predicting supreme court decisionmak-
     ing. Columbia Law Review pp. 1150–1210 (2004)
[29] Sokol, P., Rózenfeldová, L., Lučivjanská, K., Harašta, J.: Ip
     addresses in the context of digital evidence in the criminal
     and civil case law of the slovak republic. Forensic Science
     International: Digital Investigation 32, 300918 (2020)
[30] Sulea, O., Zampieri, M., Malmasi, S., Vela, M., Dinu, L.P.,
     van Genabith, J.: Exploring the use of text classification in
     the legal domain. CoRR abs/1710.09306 (2017), http:
     //arxiv.org/abs/1710.09306
[31] Sulea, O.M., Zampieri, M., Vela, M., Van Genabith, J.: Pre-
     dicting the law area and decisions of french supreme court
     cases. arXiv preprint arXiv:1708.01681 (2017)
[32] Xiao, C., Zhong, H., Guo, Z., Tu, C., Liu, Z., Sun,
     M., Feng, Y., Han, X., Hu, Z., Wang, H., Xu, J.:
     CAIL2018: A large-scale legal dataset for judgment pre-
     diction. CoRR abs/1807.02478 (2018), http://arxiv.
     org/abs/1807.02478
[33] Zhang, Y.: Support vector machine classification algo-
     rithm and its application. In: International Conference on
     Information Computing and Applications. pp. 179–186.
     Springer (2012)
[34] Zhong, H., Guo, Z., Tu, C., Xiao, C., Liu, Z., Sun, M.: Le-
     gal judgment prediction via topological learning. In: Pro-
     ceedings of the 2018 Conference on Empirical Methods in
     Natural Language Processing. pp. 3540–3549 (2018)