Words Matter – Using Machine Learning to Verify How Words
Absolve or Condemn Defendants⋆

Vithor G. F. Bertalana and Evandro E. S. Ruizb
a
  Polytechnique Montreal, Departement de Genie Informatique et Genie Logiciel. 2500 Chem. de Polytechnique, Montreal,
Canada
b
  Universidade de São Paulo, Faculdade de Filosofia, Ciências e Letras de Ribeirão Preto. Av. Bandeirantes, 3900,
Ribeirão Preto, SP – Brazil


                 Abstract
                 Legal prediction is one of the most critical subfields in Natural Language Processing. The
                 researchers use state-of-the-art machine learning and artificial intelligence methodologies to
                 predict specific judicial facets, such as the judicial outcome. For this research, we have built a
                 web text crawler to extract homicide data cases from Brazilian electronic legal systems. Then,
                 we used word embeddings, processed in several neural networks, to test the ability of our model
                 to predict the outcome of the cases based on their textual characteristics and features, finding
                 that Gated Recurring Units (GRU) showed the best performance. Afterwards, we applied
                 Hierarchical Attention Networks (HAN) to see a sample of the most important words used to
                 absolve or convict defendants. We also analyzed those results to find whether we could track
                 patterns in each of the outcomes.

                 Keywords 1
                 Legal prediction, Attention weights, Natural language processing, Machine learning

1. Introduction
    Law is among the most text-dependent areas of human activity. Daily new court decisions, appeals,
and other written legal instruments are produced by specialized professionals, such as lawyers, judges,
defendants, and plaintiffs. All of them have different necessities that intelligent systems could supply.
Branting et al. [1] exemplify this potential for using new technologies. In this article, the authors argue
that Computer Science has long acted in the automation of legal reasoning, and problem-solving has
been an objective of research in Computer Science. Therefore, it is legitimate to consider that Artificial
Intelligence (AI) can be used to assist and optimize the daily lives of legal professionals.
    With computing resources progressively present in society, today, criminals and law en- forcement
are increasingly using the opportunities offered by AI. In criminal law, deepfake technologies and
algorithmic profiling contribute to existing new types of crime. In criminal procedural law, AI can be
used as a law enforcement technology, for example, for predictive policing or cyber agent technology.
Machine Learning (ML) is also generally used to improve the legal sector. Ashley and Brüninghaus [2]
have also reposted that two long-standing goals in ML and law research are automatic classification of
case texts and more precise prediction of case outcomes to make attorneys understand them. In the
article by Sula and colleagues [3] we also find arguments that legal professionals benefit significantly
from the kind of automation provided by machine learning.
    Concerning the data used to feed these AI and ML algorithms, it is worth mentioning that judges
have very particular writing styles, as stated in Alarie et al. [4], and often develop partic- ular writing

SCIA-2022: 1st International Workshop on Social Communication and Information Activity in Digital Humanities, October 20, 2022, Lviv,
Ukraine
Email: vithor.bertalan@polymtl.ca (V. G. F. Bertalan); evandro@usp.br ((E. E. S. Ruiz)
ORCID: 0000-0002-1585-7694 (V. G. F. Bertalan); 0000-0002-7434-897X (E. E. S. Ruiz)
              ©️ 2022 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
skills to customize how they present information. Le and his co-authors [5] also characterize legal
documents as being very formal, and their structure is sometimes as essential as their readability.
Branting et al. [6] state that even if the laws are faithfully represented in the texts, the terms of these
same laws are often impossible for a layperson to interpret.
    In addition, legal texts have specific properties and structures. The work of Aletras et al. [7] shows
us that the formal facts of a court case are the most important predictors of a lawsuit. These descriptive
factor texts are essential because, in addition to passages like these, legal texts also have other specific
characteristics that differentiate them from other types of narratives. For example, references in a legal
text have a specific structure that differs from references in the Public Domain [8]. In [7], we see that
textual content and various parts of the case are vital factors influencing the results of the judicial
tribunals.
    In this project, we applied Hierarchical Attention Networks (HAN) to see a sample of the most
important words used to absolve or convict defendants. We also analyzed those results to find whether
we could track patterns in each of the outcomes. Our main inspiration for this work is the research
performed by Aletras et al. [7], in which textual features were used to predict judicial decisions. To the
best of our knowledge, our work is the first to use this methodology, adapted to our final goal of
checking the weights of each word in the comprehensive documents to predict judicial decisions in
Brazilian Portuguese. The complete research can be found in [9]. In this first section of the paper, we
present the introduction and motivation for our research.
    In Section 2, we write about similar research that has also inspired our main work. In Section 3, we
describe the methodology used in this work, as all as the algorithms chosen. Section 4 presents the
discussion about the attention weights of tokens in our judicial texts. In Section 5, we show the results
and the discussion conducted on the numbers given in the previous section. Finally, in Section 6, we
finish with the conclusion, problems found, and possible future research possibilities.

2. Related Works
    The main objective of this research lies in the field of sentence prediction. The primary reference to
this project is the work of Aletras et al. [7]. In this project, the authors used a dataset of cases from the
European Court of Human Rights on cases that violate three articles of its Convention,
    namely: Article 3 prohibits torture and inhuman and degrading treatment; Article 6, which protects
the right to a fair trial; and Article 8, which provides for the right to respect for private and family life,
home and correspondence. From this set of sentences, an equal number of cases that violate (+1) and
do not violate (-1) each of the articles were selected and marked. Pre-processing tools and regular
expressions for extracting the texts were used. The n-grams were tagged and retrieved in the sections:
Procedure, Circumstances, Facts, Relevant Law, Law, and Full Case. These same n-grams were grouped
in a vector space model to find the main topics of each article. Classification using Support Vector
Machines (SVM) was 78% accurate in predicting topics and circumstances for article 3, 84% in
predicting topics and circumstances for article 6, and 78% in predicting topics and circumstances for
article 8.
    Random Forests’ methodology was used in the work of Katz, Bommarito, and Blackman [10] to
predict the behavior of the US Supreme Court. The dataset used came from the United States Supreme
Court. This set contains 240 variables; among these are: chronological variables, case antecedent
variables, justice-specific variables, and outcome variables. Qualitative texts were labeled ’Inverted’,
’Affirmed’, or ’Other’. The remaining categorical variables were coded with binary or indicator
attributes. The authors used the Support Vector Machines and Random Forests methods to classify and
predict sentences. The authors pointed to Random Forests as the best algorithm to work on their dataset.
Using this methodology, the authors achieved a recall of up to 77% for the ’Affirmed’ class, while in
binary labeling, they had a recall of 78% for the ’Not Reversed’ cases. The authors also studied the
weights assigned to terms by the SVM method to find the terms that most impacted each violation of
the laws.
    Suela and colleagues [3] also carried out an investigation similar to the previous ones in sentence
prediction but now involving the areas of Criminal Law and Social or Commercial Law. This group
drew on the texts of French Supreme Court decisions (Court de Cassation) and also used the
methodology of SVMs. The decision collection had 131,830 decision documents combined with
descriptive metadata. Such standardized metadata consisted of a label for the legal field, a timestamp,
the case decision (e.g., forfeiture, rejection, expropriation, etc.), a description of the case, and the cited
laws. After pre-processing, the dataset was reduced to 126,865 different court decisions. Each of these
documents contained a description of the case and four different types of labels: a) legal area, b) the
date of the decision, c) the process itself, and; d) a list of articles and laws cited in the description. The
most significant attributes were then selected using hierarchical clustering, and then SVMs were used
to classify the dataset. The authors reached a precision of 90.2% of correct answers to classify the area
of Law, 96.9% to classify the judicial decision using SVM of 6 classes, and 74.3% using SVM of 7
classes to estimate the date of the process and the decision.
    In another study by Gokhale and Fasli [11], the authors developed a learning algorithm for
classifying human rights abuses using SVM and logistic regression. They based their work on a domain
ontology developed for human rights as background knowledge so that the initial term is extracted to
generate labeled data for class training. In the paper by Le and colleagues [5], the authors used index
extraction with sentence structure information for Japanese legal documents to assign each token
weight, a statistical score showing its importance.
    Jurist court verdicts have distinct structures that are particularly suitable for text analysis. In a similar
study, Ashley and Brüninghaus [2] also followed this thought. The authors developed a model that
extracts information from textual descriptions of facts about the decided cases and applies this
information to predict the results of issues. In a similar study, Bertalan [12] used the textual
characteristics of tokens to predict court outcomes. However, analysts should not forget that the team
led by Nicolas Sannier [13] mentioned that a crucial complexity in analyzing legal texts is that legal
provisions are usually interconnected and spread over different texts that cannot be taken in isolation
from each other.
    In the paper by Sukanya and Priyadarshini [14] we can see several methods that use attention to
predict judicial outcomes. All these academic works indicate that the areas of Artificial Intelligence and
Natural Language Processing applied to Law is a growing field that offers promising opportunities for
research and application in the future. Law is an area of enormous activities in Applied Social Sciences,
providing a wide range of applications and generally producing a significant amount of text.
Consequently, researchers can produce substantial results with specific applications. The study of
computational prediction models for sentences that offer satisfactory results for a sizeable judicial court,
such as the São Paulo Justice Court (TJSP), can be of great use to any other court.

3. Methods
   This section explains the characterization of domains and the steps necessary to complete the
research.

3.1.     Data Collection
   We based our work on the data provided by the eSAJ system, the electronic document manage- ment
system used by the São Paulo Justice Court. The São Paulo Justice Court is the world’s largest judicial
court, considering the number of legal processes [15]. A crawler was designed and implemented to
query and save the documents retrieved.
   When one uses eSAJ to query the process’s database, the user can select from many fields to exhibit
judicial opinions, such as subjects, judges by their names, process numbers, and other fields, to choose
the appropriate texts to be retrieved. From the many attributes, have selected the following: Judicial
Class, Judicial Subject, Date of availability and Text. Classes are types of judicial documents. For
example, repeals and termination of contracts would be judicial classes under Brazilian law. Subjects
are the type of judicial case being conducted, such as drug trafficking or feminicide. The field Text
contains the court decision.
   Every process is labeled with its outcomes (absolved or condemned). The average word count for
each document is 1.019 words, with the most extensive text accounting for 5.846 words. The crawler
can gather the data in a predefined date interval to keep the number of selected cases doable. The initial
date of availability and a final date mentioned in the previous paragraph are initial arguments to capture
a restricted corpus. The documents retrieved were available from July 2018 to March 2019. We have
collected 1.681 homicide cases, accounting for 844 absolutions and 837 condemnations.
   For the labeling process, we hired two Brazilian different criminal lawyers, that were re- sponsible
for labeling 40% and 60% of the cases, respectively. That step is necessary due to the complexity of
law terms, and since there is no straight information on the outcome of a judicial decision on eSAJ.
Hence, the lawyers read the whole content of the decision so as to define whether the defendant had
been absolved or convicted.

3.2.    Data Transformation
    Instead of using the words in the documents, the words were transformed into an embedded form.
Turian, Ratinov, and Bengio [16] define word embeddings as vectors composed of real numbers
distributed over an interdimensional space induced by semi-supervised learning. Since machine
learning algorithms work with numbers, not words, word embeddings are a very effective alternative to
transforming words into mathematical values. The reasoning for word embeddings is to calculate the
similarities between two vectors by cosine similarity. Each vector dimension represents a characteristic
intending to capture a word’s semantic, in other words, its synthetic or morphological proprieties in a
distributed way.
    GloVe [17] was the chosen methodology used to construct the embedded words used in the project.
GloVe learns word vectors such that their dot product equals the logarithm of the words’ probability of
co-occurrence. Rather than using a window to define local context, GloVe constructs an explicit word-
context or word co-occurrence matrix using statistics across the whole text corpus. It can combine local
and global representations of a term by mixing the features of two model families: the global matrix
factorization and local context window methods. We have used the pre-trained GloVe corpora
developed by the team led by Sandra Aluisio [18] for Brazilian Portuguese.

3.3.    Machine Learning Processing
    We used five different neural network algorithms to process the dataset. These are Multilayer
Perceptron (MLP), Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), Gated
Recurring Unit (GRU), and Hierarchical Attention Networks (HAN). We also applied a 10-fold cross-
validation procedure in all five algorithms. Every file was embedded using Brazilian Portuguese GloVe
word vectors with 600 embedding dimensions. Both the learning rate and the loss function define the
convergence criteria. We used those word embeddings on GloVe, processed in several neural networks,
to test the ability of the model to predict the outcome of the cases based on their textual characteristics
and features. The learning rate was lowered by 0.2 every three epochs. Concerning the loss function,
the algorithm stops if, after five epochs, it does not decrease by at least 0.001. Results were evaluated
through the standard quality evaluation measures: accuracy, precision, recall, and F-measure. In our
tests, we have tried several different combinations of hyperparameters, to see which ones offered the
best results, and to find the optimal combination for each of the algorithms.

3.4.    Using Hierarchical Attention Networks
    Although we used many different neural networks to evaluate the dataset, the research’s core
revolves around utilizing Hierarchical Attention Networks (HANs). Yang et al. [19] describe a HAN as
a neural network architecture that highlights the importance of individual words, or phrases, in building
the representation of a document. This model emphasizes the most important sequences of terms that
affect the classification of a document. In theory, it is known that not all terms are equally crucial for
the classification of a text and that not all sentences represent the same meaning either.
    The result of processing HAN networks over a text is the association of an attention coefficient for
each term, indicating the importance of that term in its sentence. Still, considering the article by Yang
et al. [19], we can see the application of a HAN in Figure 1. Note that sentence 1 ("The woman was
present at the crime scene") is more critical than sentence 2 for the overall text label (the distinguishable
shades of gray exemplify this difference).
   This sentence was highlighted in the HAN network because the network ranked this sentence among
the most important in this text - that is, this sentence received a relatively greater weight of attention
than the others. Also, note that in this sentence, some words are marked in red. These words constitute
essential terms in these sentences, as they have the highest attention weights of the words.


Figure 1: Example of word attention weights generated by an Hierarchical Attention Networks(HAN)
[19].

4. Results
  In this section, we describe the numerical results and the textual analysis performed on the attention
weights of the HANs.

4.1.    Numerical Evaluation of Results
   We used four distinct networks to evaluate the numerical results: a Multilayer Perceptron network,
an RNN (recurrent neural network), an LSTM type network, Long-short term memory, and, finally, a
GRT, Gated Recurrent Network.
   For the Multilayer Perceptron network, we adopted the following configuration: the archi- tecture
of 3 hidden layers, with 512 X 512 X 250 neurons, respectively, executed in 25 epochs. For the RNNs,
we use a hidden layer architecture with 128 units, with a dropout probability of 0.5 for the hidden layer
and a dropout probability of 0.2 for the inputs. The sigmoid function was used as the activation function,
and the binary cross-entropy function as the loss function performed in 25 epochs.
   For the LSTM networks, a hidden layer architecture with 128 units was also used, with a dropout
probability of 0.2 for the hidden layer and a dropout probability of 0.2 for the inputs. For the tests using
the GRN network, we also used a 1-hidden layer architecture with 128 units and a probability equal to
0.2 for both a 0.2 dropout in the hidden layer and the inputs. In contrast, all HANs were used with 600
dimensions of embedding, with a word and phrase encoder. The results can be found in Table 1.

Table 1
Evaluation of different Neural Network approaches.

                          Neural        Precision Recall F-Score Accuracy
                         Network
                           MLP             0.98      0.98      0.98       0.98
                           RNN             0.86      0.88      0.86       0.85
                          LSTM             0.98      0.98      0.98       0.98
                           GRU             0.99      0.99      0.99       0.99
                           HAN             0.96      0.98      0.97       0.98
   Our results are similar to Chung et al. [20]. Both have shown that the GRU is faster than other neural
network architectures while showing comparable accuracies. Chung et al. also mention that the choice
of the type of network may depend heavily on the dataset and corresponding task. Considering this
particular dataset, our research shows that GRU is the best choice.

4.2.    Computation of Attention Weights
    All attention weights were calculated for each word in the data sets. Homonymous words appear
several times in the dataset corresponding to their different meanings. Note that the same word can have
different weights of attention for each class involved, acquittal or condemnation.
    We have recorded 248,460 single words for the acquittal texts in alleged murders and 466,461 single
word cards for convictions for murder. For each of the tokens, their attention weight was calculated.


Figure 2: Histogram of word attention weights for the homicide comdenmnations.

   A histogram of word attention weights may be seen in Figure 2. The slope of this histogram may be
explained through Zipf’s Law. By this law, considering a relevance metric, it is possible to infer that
the frequency of any word is inversely proportional to its rank. Accordingly, the relevant terms account
for a small proportion of our dataset.
   The graphic in Figure 3 does not show either the top 10% or the low 10% that were removed, and it
shows only the 80% middle words remaining. We can see that Zipf’s Law continues to explain the
behavior of the terms. Although Figure 3 refers to the condemnation documents, the same behavior was
observed in the absolution documents.
   After the classification had been performed, the words in each dataset were ordered by their attention
weights. Each word had a unique value, ranging from 0 (a word with no importance for document
classification) to 1 (for a word with maximum importance for document classification). It is worth
mentioning that a word might have different attention weights in distinct sentences.
   As an example, consider these two sentences: The defendant robbed a bank and; The defendant did
not participate in the robbery, because it was going to a blood bank. Both have the word bank, but in
two very different contexts. In the first sentence, the word would be a vital contributor to the
condemnation, while the same word would contribute to the absolution in the second sentence.
Therefore, words have appeared more than once in our final calculations, with different attention
weights. The list with the top 50 words for each of the four outcomes is listed in Table 2.
Figure 3: Histogram of 80% word attention weights on the middle interval.

5. Discussion and Conclusions
   The first item we can analyze from the results is the difference in the number of words of absolutions
and condemnations: 248,460 unique word tokens for absolutions and 466,461 unique tokens for
condemnations, an 86% increase between the two groups. Therefore we infer that judges, as well as
their clerks, tend to write more in condemnation texts and dispatches – or, at least, they try to use a
vocabulary more refined and distinctive in those cases. That can reflect a more significant amount of
the established jargon in condemnation cases, indicating that those types of outcomes have particular
words, or that Law professionals tend to write more and more differently in those types of judicial cases.
   Also, when we analyze the dimension of the weights, we see that the condemnation tokens have a
more substantial weight on the outcome of the case. Even though the top two tokens (coincidentally,
"bo" for both outcomes have the same weight). However, when we look down on the list, we see that
the condemnation words keep on having a high weight. The 50th weights 0.373, while the 50th in the
absolution list weights 0.310. From that, we can observe that the unique tokens in the condemnation list
have a higher impact on the outcomes than absolutions. This fact matches the previous observation,
indicating that the use of a more advanced vocabulary for those cases significantly affects the text as a
whole. We can see this difference graphically in Figure 4.
Figure 4: Boxplot of word attention weights.

    Looking directly at the meaning of words, we can also see patterns in the lists of both outcomes. In
the absolution list, we can read many words relating to the context of the defendant, such as bem (good),
origem (origin), infância (childhoood), social (social), tititi (gossip), cor (color), mãe (mother), anos
(years), and soubessem (if they knew). Although further analysis is required, we can conclude that many
absolution texts consider the socio-cultural aspects of the defendant to substantiate the decision.
    On the other hand, condemnation unique tokens contain several references to violent terms, such as
homicídio (homicide),qualificado (aggravated), disparos (gun shots), golpes (blows), socos (punches),
lesões (lesions) and colisão (collision). Also, several terms relating to the judicial process, such as
infração (infraction), penal (criminal), such as sentença (sentence), acusação (accusation) and comarca
(county). We can see by those terms that condemnation texts tend to heavily emphasize the nature of
the crime and the judicial process to substantiate the criminal penalty.
    It is helpful to mention that a word might have different attention weights in distinct sentences. As
a short example, the sentence The defendant robbed a bank and the sentence The defendant did not
participate in the robbery, because it was going to a blood bank both have the word bank, but in very
different contexts. In the first sentence, the word would be a vital contributor to the condemnation, while
it would contribute to the absolution in the second sentence.

6. Conclusions
    By analyzing the attention weights of each outcome, we could see that words do have substantial
effects on the meaning of the text. All the mathematical coefficients for the attention weights are shown
in Table 2. A HAN analysis offers a significant advantage when we compare our results to a single
word count. When we use attention neural networks, we can mathematically capture the impact of each
word in a single sentence, and each sentence in the overall text.
    However, we have to consider such a technique’s social and technical issues. As for technical issues,
as described by Surden [21], there are certain well-known limitations to applying AI in Law. First, the
model will be helpful only if a future case class has standard features pertinent to previous analyzed
topics in the training set. It will only be helpful if the model has the same properties as the previous
case. Therefore, the model does not consider the subtle changes in judicial thinking over time, except
when these changes represent a significant volume of training data. It will only consider subtle changes
in judicial thinking. The authors also present an example: not every law firm has several cases that are
similar enough to each other that the previous case has elements helpful in predicting future outcomes.
Therefore, one might infer that only large law firms will possess the financial and technological power
to develop such models.
    Regarding the social issues, Katz, Bommarito, and Blackman [10] state that qualitatively oriented
legal experts tend to suggest improvements in the model based on anecdotes or their untested mental
model rather than reliable facts and factual data. The authors suggest that to support a case from the
future applicability of a model, it must consistently outperform a baseline comparison. This prerequisite
is necessary not only for scientific purposes but also to gain lawyers’ trust for the model. We can also
see potential strategies to address gaps in social problems with Artificial Intelligence in Santoni and
Mecacci [22].
    A future path of research could also consider the sentences, not just the unique tokens, to perform a
contextual analysis of the meaning of the texts. This way, the most important sentences for each
outcome could also be considered and combined with the analysis we did in this paper to determine
which complex expressions have the highest weight in each of the outcomes. Another possibility is to
extend the research to the legal systems of other countries or languages to identify if our findings are
consistent with those different scenarios or if each language has its behavior pattern in its legal texts.

Table 2
Word attention weights for the homicide dataset, and their English translations.
             Homicide Absolutions                             Homicide Condemnations
   Position Token                        Weight Position Token                              Weight
   1         bo (incident report)        0.521      1           bo (incident report)        0.521
   2         mogi (Brazilian city)       0.428      2           cristina (Brazlian name)    0.492
   3         comarca (county)            0.417      3           horário (time)              0.479
   4         estado (state)              0.416      4           infração (infraction)       0.464
   5         santos (Brazilian city)     0.414      5           penal (criminal)            0.456
   6         justiça (justice)           0.413      6           cf (Federal Constitution)   0.452
   7         bem (good)                  0.411      7           regime (regime)             0.445
   8         sala (room)                 0.407      8           xavier (Brazilian name)     0.444
   9         cep (postal code)           0.407      9           homicídio (homicide)        0.442
   10        competência                 0.403      10          qualificado (aggravated)    0.442
             (competence)
   11        antes (before)              0.389      11          disparos (gun shots)        0.440
   12        volta (return)              0.387      12          sant (unknown token)        0.438
   13        origem (origin)             0.387      13          exposto (exposed)           0.430
   14        infância (childhood)        0.379      14          provisório (provisory)      0.424
   15        social (social)             0.373      15          mediante (through)          0.422
   16        porque (why)                0.369      16          philipe (Brazilian name)    0.421
   17        júri (jury)                 0.363      17          sentença (sentence)         0.419
   18        principal (main)            0.352      18          toledo (Brazilian name)     0.417
   19        júri (jury)                 0.352      19          osmarina (Brazilian name)   0.416
   20        altura (height)             0.349      20          juízo (in court)            0.415
   21        placas (signs)              0.349      21          ip (police investigation)   0.415
   22        nunes (Braziilan name)      0.348      22          narra (tells)               0.413
   23        p (page)                    0.348      23          golpes (blows)              0.409
   24        machado (Brazilian name) 0.346         24          justiça (justice)           0.409
   25        porte (weapon carry)        0.344      25          sala (room)                 0.407
   26        agnaldo (Brazilian name) 0.343         26          acusação (accusation)       0.403
   27        sp (Brazilian state)        0.338      27          sentença (sentence)         0.401
   28        anos (years)                0.335      28          marta (Brazilian name)      0.400
   29        regina (Brazilian name)     0.332      29          estado (state)              0.398
   30        tititi (gossip)             0.328      30          silva (Brazilian name)      0.394
   31        permitido (allowed)         0.327      31          estado (state)              0.393
   32        cor (color)                 0.326      32          sassolli (Brazilian name)   0.391
   33        mãe (mother)                0.325      33          prisão (prison)             0.389
   34        josé (Brazilian name)       0.322      34          rua (street)                0.387
   35        instrução (instruction)     0.322      35          sentença (sentence)         0.386
   36        cento (cent)                0.322      36          justiça (justice)           0.385
   37        comum (common)              0.321      37          socos (punches)             0.385
   38        réu (defendant)             0.320      38          análise (analysis)          0.384
   39        cosmópolis (Brazlian city) 0.319       39          flores (flowers)            0.382
   40         estado (state)               0.319     40          estrita (strict)         0.382
   41         saído (gone)                 0.318     41          mínimo (minimum)         0.382
   42         soubessem (if they knew)     0.317     42          competência (competence) 0.381
   43         ação (action)                0.317     43          lesões (lesions)         0.379
   44         tribunal (court)             0.314     44          infância (childhood)     0.379
   45         pública (public)             0.313     45          artigos (articles)       0.377
   46         ordinário (ordinary)         0.313     46          colisão (collision)      0.377
   47         todos (all)                  0.312     47          regime (regime)          0.376
   48         central (central)            0.311     48          causaram (caused)        0.376
   49         sessenta (sixty)             0.311     49          comarca (county)         0.374
   50         júri (jury)                  0.310     50          dinheiro (money)         0.373

7. References
[1] L. K. Branting, A. Yeh, B. Weiss, E. Merkhofer, B. Brown, Inducing Predictive Models for De- cision
     Support in Administrative Adjudication, in: U. Pagallo, M. Palmirani, P. Casanovas, G. Sartor, S.
     Villata (Eds.), AI Approaches to the Complexity of Legal Systems, Springer International
     Publishing, Cham, 2018, pp. 465–477.
[2] K. D. Ashley, S. Brüninghaus, Automatically classifying case texts and predicting outcomes, Artificial
     Intelligence and Law 17 (2009) 125–165. doi:10.1007/s10506-009-9077-9.
[3] O.-M. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. Dinu, J. van Genabith, Exploring the Use of
     Text Classification in the Legal Domain, in: Proceedings of the 2nd Workshop on Automated
     Semantic      Analysis     of    Information    in    Legal      Texts    (ASAIL),      2017.    URL:
     http://arxiv.org/abs/1710.09306. arXiv:1710.09306.
[4] B. Alarie, A. Niblett, A. Yoon, How Artificial Intelligence Will Affect the Practice of Law, in:
     Artificial Intelligence, Technology and the Future of Law, 2017, pp. 1–16. URL:
     https://papers.ssrn.com/sol3/papers.cfm?abstract{_}id=3066816.
[5] T. T. N. Le, K. Shirai, M. L. Nguyen, A. Shimazu, Extracting indices from Japanese legal
     documents, Artificial Intelligence and Law 23 (2015) 315–344. doi:10.1007/ s10506-015-9168-
     8.
[6] L. K. Branting, A. Yeh, B. Weiss, E. Merkhofer, B. Brown, Inducing predictive models for decision
     support in administrative adjudication, in: AI Approaches to the Complexity of Legal Systems,
     Springer, 2015, pp. 465–477.
[7] N. Aletras, D. Tsarapatsanis, D. Preoţiuc-Pietro, V. Lampos, Predicting judicial decisions of the
     European Court of Human Rights: a Natural Language Processing perspective, PeerJ Computer
     Science 2 (2016) e93. URL: https://peerj.com/articles/cs-93. doi:10.7717/ peerj-cs.93.
[8] O. T. Tran, B. X. Ngo, M. L. Nguyen, A. Shimazu, Automated reference resolu- tion in legal
     texts, Artificial Intelligence and Law 22 (2014) 29–60. doi:10.1007/ s10506-013-9149-8.
[9] V. G. F. Bertalan, Using natural language processing methods to predict judicial outcomes, Ph.D. thesis,
     Universidade de São Paulo, 2020.
[10] D. M. Katz, M. J. Bommarito II, J. Blackman, A general approach for predicting the behavior of the
     Supreme Court of the United States, PloS one 12 (2017) e0174698.
[11] R. Gokhale, M. Fasli, Deploying A Co-training Algorithm to Classify Human-Rights Abuses, in:
     2017 International Conference on the Frontiers and Advances in Data Science (FADS), 2017, pp. 108–
     113.
[12] V. G. F. Bertalan, E. E. S. Ruiz, Predicting judicial outcomes in the brazilian legal system using
     textual features., in: DHandNLP@ PROPOR, 2020, pp. 22–32.
[13] N. Sannier, M. Adedjouma, M. Sabetzadeh, L. Briand, An automated framework for detection
     and resolution of cross references in legal texts, Requirements Engineering 22 (2017) 215–237.
     doi:10.1007/s00766-015-0241-3.
[14] G. Sukanya, J. Priyadarshini, A meta analysis of attention models on legal judgment prediction
     system, International Journal of Advanced Computer Science and Applications 12 (2021).
[15] Tribunal      de    Justica     de     Sao     Paulo    –     Quem        Somos,      2022.    URL:
     https://www.tjsp.jus.br/QuemSomos.
[16] J. Turian, L. Ratinov, Y. Bengio, Word representations: a simple and general method for semi-
     supervised learning, in: Proceedings of the 48th annual meeting of the association for computational
     linguistics, Association for Computational Linguistics, 2010, pp. 384–394.
[17] J. Pennington, R. Socher, C. D. Manning, GloVe: Global Vectors for Word Representation, in:
     Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL:
     http://www.aclweb.org/anthology/D14-1162.
[18] N. Hartmann, E. Fonseca, C. Shulby, M. Treviso, J. da Silva, S. Aluisio, Portuguese Word
     Embeddings: Evaluating on Word Analogies and Natural Language Tasks, arXiv preprint
     arXiv:1708.06025 (2017). URL: https://arxiv.org/abs/1708.06025.
[19] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical Attention Networks for
     Document Classification, in: Proceedings of the 2016 Conference of the North American Chapter of
     the Association for Computational Linguistics: Human Language Technologies, Association for
     Computational Linguistics, San Diego, California, 2016, pp. 1480–1489. URL:
     https://www.aclweb.org/anthology/N16-1174. doi:10.18653/v1/N16-1174.
[20] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks
     on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
[21] H. Surden, Artificial Intelligence and Law, Washington Law Review 89 (2014) 87–116.
[22] F. Santoni de Sio, G. Mecacci, Four responsibility gaps with artificial intelligence: Why they matter
     and how to address them, Philosophy & Technology (2021) 1–28.