=Paper=
{{Paper
|id=Vol-2896/RELATED_2021_paper_6
|storemode=property
|title=Topic Modelling of Legal Documents via LEGAL-BERT
|pdfUrl=https://ceur-ws.org/Vol-2896/RELATED_2021_paper_6.pdf
|volume=Vol-2896
|authors=Raquel Silveira,Carlos G.O. Fernandes,João A. Monteiro Neto,Vasco Furtado,José Ernesto Pimentel Filho
}}
==Topic Modelling of Legal Documents via LEGAL-BERT==
<pdf width="1500px">https://ceur-ws.org/Vol-2896/RELATED_2021_paper_6.pdf</pdf>
<pre>
Topic Modelling of Legal Documents via LEGAL-BERT1
Raquel Silveira 1, Carlos G. O. Fernandes 2, João A. Monteiro Neto 3, Vasco Furtado 4, and José
Ernesto Pimentel Filho 5.
1
  Federal Institute of Education, Science and Technology of Ceará, Tianguá, Ceará, Brazil
2
  University of Fortaleza and Banco do Nordeste do Brasil S.A, Fortaleza, Ceará, Brazil
3
  University of Fortaleza Law School, and FUNCAP, Fortaleza, Ceará, Brazil
4
  University of Fortaleza, Fortaleza, Ceará, Brazil
5
  Federal University of Paraíba, João Pessoa, Paraíba, and FUNCAP, Fortaleza, Ceará, Brazil


                                  Abstract
                                  Legal text processing is a challenging task for modeling approaches due to the peculiarities
                                  inherent to its features, such as long texts and their technical vocabulary. Topic modeling
                                  consists of discovering a semantic structure in the text. This way, it requires specific
                                  approaches. The relevant topics strongly depend on the context in which the legal documents
                                  will be presented. This work aims to describe and evaluate the use of BERTopic for topic
                                  modeling in legal documents. The authors have focused on a subset of landmark cases from
                                  the US Caselaw dataset to evaluate the impact of topic modeling, via domain-specific
                                  embeddings pre-trained from LEGAL-BERT. The research investigated different variations of
                                  generating sentence embeddings from the cases. Results here presented demonstrate that
                                  considering the references to statutory law (e.g. US Code) during the process of text
                                  embeddings improves the quality of topic modeling.

                                  Keywords
                                  Natural Language Processing (NLP); American Case Law; Contextualized Embeddings

1. Introduction

    Topic Modeling has been successfully applied to Natural Language Processing (NLP) and it is
frequently used when a huge textual collection cannot be reasonably read and classified by one person.
Given a set of text documents, a topic model is applied to find out interpretable semantic concepts, or
topics, present in documents. Topics represent the theme, or subject, of the text and can be used for the
elaboration of high-level abstracts considering a massive collection of documents, research documents
of interest, and also for grouping similar documents. [1].
    The increasing volume of publicly available legal information has required a continuous effort in
the field of automatic processing, intending to promote access to relevant information to the public. The
authors have in mind the necessity of students, legal scholars, lawyers, judges, and court officials daily.
In the case of long documents, succeeding in having a useful abstract with key information about its
content and context is an important path to deliver legal services which will create an appropriate
environment for improving the productivity in courts involving all due agents of such a process.
Therefore, topic modeling may contribute to making efficient the analysis of legal documents since it
reveals implied meanings, on the one hand. On the other hand, it performs the discovery of theme
relations among different legal documents [2, 3].
    Legal documents are often full of technical terminology. Students of law are commonly invited to
make notes with particular views of the document because a series of opinions might be triggered as


RELATED - Relations in the Legal Domain Workshop, in conjunction with ICAIL 2021, June 25, 2021, São Paulo, Brazil
EMAIL: raquel_silveira@ifce.edu.br (R. Silveira); carlosgustavo@edu.unifor.br (C. G. O. Fernandes); joaoneto@unifor.br (J. A. Monteiro
Neto); vasco@unifor.br (V. Furtado); jepf@academico.ufpb.br (J. E. Pimentel Filho)
ORCID: 0000-0001-7445-605X (R. Silveira); 0000-0003-0575-4509 (C. G. O. Fernandes); 0000-0002-0690-2449 (J. A. Monteiro Neto);
0000-0001-8721-4308 (V. Furtado); 0000-0002-7534-9405 (J. E. Pimentel Filho)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)
entries for understanding the case. A case is subject to interpretations that are not simple tasks.
Therefore, it does give obstacles to create divisions and themes for the document. These features
contribute to making topic modeling both challenging and resources consuming. [2] Contextualized
text representations have been used to capture semantics. More recently, with BERT [5], these tasks
have revolutionized natural language processing for many structured prediction problems [4]. As it
happens in other specialized domains, the legal text (for instance, statutory law, lawsuits, contracts) has
distinct characteristics in comparison to generic corpora, such as specialized vocabulary, particularly
formal syntax, semantics based on extensive knowledge specific domain, you name it. [6] Thus, content
representation of text from juridical documents is better processed when applied to a domain-specific
model. [7]
    In this article we investigate stochastic topic modeling approaches for legal documents, using
BERTopic [8], a topic modeling technique that represents text from embeddings of BERT. Specifically,
the proposed approach represents legal documents in content form using the pre-trained model LEGAL-
BERT [7], a model pre-trained from legal data, intended to assist legal NLP research, computational
law, and legal technology applications. The pre-trained contextual embeddings for the specific domain
provide a more refined and semantic richer representation of the text. We then evaluate how much the
quality of topic modeling of the document is influenced by citations to laws in the body of the text. We
do this by extending the semantic representation of the document, with the insertion of text describing
the United States Code cited in the document. The results of the different variations of this method have
shown that adding the references to laws in embedding representation of the text improves the quality
of topic modeling.

2. Related Work
    Efforts have been made to apply Natural Language Processing and Machine Learning techniques to
legal text. Latent Dirichlet Allocation (LDA) [9] has been used to model legal corpora [3, 10, 11]. The
proposed approach by [10] uses LDA to model Extraordinary Resources received by the Supreme Court
of Brazil. The data consist of a corpus of lawsuits annotated manually by the Court's specialists with
thematic labels. The semantic analysis of topics shows that models with 10 and 30 topics were able to
capture some of the legal matters discussed by the Court. In addition, experiments show that the model
with 300 topics was the best text vectorizer and that the interpretable, low dimensional representations
it generates achieve good classification results.
    [11] qualitatively evaluates the performance of topic models to summarize and visualize British
legislation, intending to facilitate the navigation and identification of relevant legal topics and their
respective set of topic-specific terms. More specifically, Saffron models are evaluated (a software tool
that can construct a model-free topic hierarchy), Non-Negative Matrix Factorization (NMF) [12], Latent
Semantic Analysis (LSA) [13], LDA [9], and Hierarchical Dirichlet process (HDP) [14]. After
evaluation Saffron has been consistently ranked as the most favorable of all models, as the
aforementioned vocabulary pruning and usage of multi-word expressions has played a fundamental role
in topic coherency.
    To explore the possibilities of finding topics in case law documents, [3] evaluates the use of the
LDA in extracting precise and useful topics and whether legal experts and people without legal training
agree or not in their judgments about it. Experts evaluated Dutch case law documents, identifying that,
for most documents, the model was unable to locate the main topic related to the subject of the
document.
    Until now, the LDA is still the preferred model for modeling topics. Despite its popularity, LDA has
several weaknesses. To achieve optimal results they often require the number of topics to be known,
custom stop-word lists, stemming, and lemmatization. Additionally, this method relies on the bag-of-
words representation of documents which ignore the ordering and semantics of words. Distributed
representations of documents and words are gaining popularity due to their ability to capture the
semantics of words and documents [1]. Pre-trained language models based on [5] and its variants, have
achieved state-of-the-art results in several downstream NLP tasks. This model is able to represent the
text in a complex multidimensional space that has the property of capturing the characteristics of the
language necessary for its comprehension. [7] release LEGAL-BERT, a family of BERT models for
the legal domain, pre-trained with EU and UK legislations, European Court of Justice cases, European
Court of Human Rights cases, US court cases and US contracts.
    [4] use generalized contextualized language models (BERT [5], GPT-2 [15], and RoBERTa [16])
for token-level contextualized word representations. These contextualized representations are used by
the k-means algorithm to produce topics of the document in English from Wikipedia articles, Supreme
Court of the United States legal opinions, and Amazon product reviews. These cluster models are
simple, reliable, and can perform as well, if not better than the LDA topic models while maintaining
the high quality of the topics.
    [1] developed Top2Vec, a model that uses document and word semantic embedding to find topic
vectors. Some of the characteristics of this model are: it does not require stop-word lists, stemming, or
lemmatization, and it automatically identifies the number of topics. The resulting topic vectors are
jointly embedded with the document and word vectors, the distance of which represents the semantic
similarity between them.
    BERTopic is a topic modeling technique that leverages models based on transformers to achieve
robust text representation, HDBSCAN to create dense and relevant clusters, and class-based TF-IDF
(c-TF-IDF) to allow easy interpretable topics, while keeps important words in topic descriptions [8].
    The relevance of topics modeled in legal documents depends heavily on the legal context and the
broader context of laws cited. Legal documents are of a specific domain: different contexts in the real
world can lead to the violation of the same law, while the same context in the real world can violate
different cases of law [2]. However, we are not aware of publications examining the topic modeling of
legal documents considering the representation of the document from language models of the legal
context.

3. Methodology

   This section describes the approach used to identify and evaluate topics in legal documents. Initially,
the paper presents the set of legal documents used to identify the topics, and after it indicates the
methodology used for operating the topic modeling and evaluating activities.

3.1.    Data Collection

    We collected our primary set of legal documents from the Cornell Legal Information (Cornell LII)'s
repository of Historic US Supreme Court Decisions representing the list of landmark court decisions in
the United States. 314 legal cases were selected randomly and submitted to a cleaning process sweep
all text associated with these cases available through the Cornell LII site. For each case, we then
removed all HTML markup and editorial information and split the remaining text into paragraphs.
    Landmark case categories and subcategories might be named in various ways with labeling
processes derived from different criteria. Cases are often grouped by experts, organizations, or citizens
to create a gallery of historic values of society. Data will be used both to represent the theme of the
document and check the coherence of the modeled topics. For that, experts in the legal field analyzed
each document, identifying two types of columns, the division and subdivision columns. This way, we
aim to elect specific topics that are more useful for legal experts. Following patterns of historical
analysis in which classification of documents must match the clustering purposes and fit previous
experiences with the text itself, this proposed division and subdivision does create a strong meaning.

3.2.    Topic Modelling

   At its most basic level, topic modeling aims to capture the words that represent the concept of the
document. Given a legal document dealing with capital punishment, the topic modeling algorithm can,
for example, identify the following words "penalty, death, death penalty, punishment, capital
punishment, execution, lethal, lethal injection, cruel, protocol" which can be the topic that the document
represents.
    Our data are defined in terms of documents D = {d1, d2, ... , dN} and of the paragraphs of each
document Pdi = {p1, p2, ... , pn}. Thus, given a document composed of a set of paragraphs, in general,
the objective is to cluster the paragraphs according to the contextual similarity (each cluster represents
a topic) and then choose the topics that represent the main thematic of the document di. In addition,
each paragraph is represented by a domain-specific contextual embedding; each topic is composed of a
set of words that the approach identifies as most relevant to characterize it; and, the words of the topics
chosen to represent the document identify the theme of the document. Therefore, the input of the
approach is a text of the legal document and the output is the k-top words of the topic that represents
the theme of the document.
    We emphasize that our objective in this preliminary paper is not to discover the best architecture for
this task but to provide a baseline to be used in future works.
    Figure 1 shows the architecture of the topic modeling approach in legal documents used in this paper.


Figure 1: Architecture of the topic modeling approach in legal documents.

    The approach presented in this paper identifies the document topics using BERTopic [8], a topic
modeling technique that takes advantage of BERT embeddings [5], dimensionality reduction and
clustering algorithms, as well as a class-based TF-IDF to create dense clusters, allowing interpretable
topics from the extraction of the most important words from the clusters. In the following, we explain
the topic modeling process, as well as some of the specific features of BERTopic.
    We highlight that legal documents are known to be complex and written using a very peculiar
structure and a specific set of words and expressions. They are often difficult to understand, are
extensive, and can cite other cases and legislation. Initially, we split the documents into smaller units,
that is, each document 𝑑! ∈ 𝐷 is split into paragraphs Pdi = {p1, p2, ..., pn}, according to the original
structure of the document (corresponding to "(a) Split paragraphs" in Figure 1).
    One of our assumptions is that adding more information about the context of the documents increases
the quality of the extracted topics. In the case of legal documents, citations to pre-existing cases and
laws are as important as the content of the document itself. In this way, for each paragraph 𝑝" ∈ 𝑑! , we
check that the paragraph contains a citation for the general and permanent laws of the United States
(United States Code). If it contains, we add the text of the section of the laws cited to the set of
paragraphs of the document 𝑃#! = {𝑝$ , 𝑝% , . . . , 𝑝& , 𝒑𝒍𝟏 , 𝒑𝒍𝟐 , . . . , 𝒑𝒍𝒌 }(step "(b) Add laws" in Figure 1).
Finally, we remove any duplicate paragraphs.
    Then, as shown in "(c) Generate embeddings by paragraphs (pj ∈Pdi )" of Figure 1, we convert the
elements of Pdi in contextualized numerical vector representations of the legal domain, EMBdi = {embp1,
embp2, …, embplk}. We used the LEGAL-BERT [7] (an extended model of BERT pre-trained specifically
for the legal domain) for this purpose, as it extracts different embeddings based on the context of the
legal texts. In this way, we obtain the vector representation, 𝑒𝑚𝑏+" ∈ 𝐸𝑀𝐵#! , for each paragraph 𝑝" ∈
𝑃#! , using equations (1) to (3) below:
                                  𝑇𝐾+" = 𝑡𝑜𝑘𝑒𝑛𝑖𝑧𝑒(𝑝" ), 𝑝" ∈ 𝑃#!                                 (1)
                         𝐸,-+" = 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑇𝑜𝑘𝑒𝑛𝑠(𝑡𝑘! ), 𝑡𝑘! ∈ 𝑇𝐾+"                                     (2)
                             𝑒𝑚𝑏+" = 𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑃𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ(𝐸,-+" ),                                       (3)

where, initially the tokenize(pj) function adds the special tokens [CLS] and [SEP] at the beginning and
end of the paragraph pj, respectively, and splits it into subword tokens TKpj, using the WordPiece
algorithm [17], according to the LEGAL-BERT structure [7]. Then, the embeddingTokens(tki) function,
vectorizes (embedding) each token 𝑡𝑘! ∈ 𝑇𝐾+" , considering the 768 hidden units of the hidden state of
encoding last layer returned by the pre-trained model LEGAL-BERT [7]. Finally, the
𝑒𝑚𝑏𝑒𝑑𝑑𝑖𝑛𝑔𝑃𝑎𝑟𝑎𝑔𝑟𝑎𝑝ℎ(𝐸,-+" )function averages the embeddings for each embedding of token
 𝑒𝑚𝑏,- ∈ 𝐸,-+" , setting to embpj the embedding of the text of the entire paragraph (a vector
representation of 768 units).
    Before using a clustering algorithm, we first need to reduce the dimensionality of the embeddings
of the paragraphs, since many clustering algorithms deal poorly with the high dimensionality.
Dimension reduction allows for dense clusters of documents to be found more efficiently and accurately
in the reduced space [1]. Among the dimensionality reduction algorithms, the Uniform Manifold
Approximation and Projection (UMAP) [18] preserves the high-dimensional local and global structure
in lower dimensionality and is capable of scaling to very large data sets. We used UMAP and reduced
the dimensionality to 5 (corresponding to "(d) Dimension reduction" in Figure 1). UMAP has several
hyperparameters that determine how it performs the dimension reduction. Possibly the most important
parameter is the number of nearest neighbors. This parameter controls the balance between the
preservation of global and local structures in the low dimensional embedding. We set the number of
nearest neighbors to 15.
    After having reduced the dimensionality of the embeddings of the paragraphs, we can cluster them,
as shown in "(e) Find clusters" in Figure 1. The goal of density-based clustering is to find areas of
highly similar embeddings in the semantic space, which indicate an underlying topic. This is performed
on the UMAP reduced embeddings. We used Hierarchical Density-Based Spatial Clustering of
Applications with Noise (HDBSCAN) [19] to find dense areas of embeddings without forcing data
points into clusters, as we consider them outliers. The number of clusters has not been defined, while
the minimum size of paragraphs in each cluster has been set to 5. In this way, the algorithm will try to
find the ideal number of clusters, grouping similar paragraphs, whose clusters must represent the topics
of the paragraphs.
    Then, we identified a set of words that represent the content of each cluster (step "(f) Find the most
relevant words to each cluster" in Figure 1). For this, a variant of the TF-IDF (Term Frequency - Inverse
Document Frequency) is structured in clusters, named c-TF-IDF. The c-TF-IDF compares the
importance of words to a specific cluster, revealing the most significant words in a topic, according to
the TF-IDF score. The c-TFxIDF is calculated according to equation (4) below:
                                                    .!            0                                  (4)
                                  𝑐 − 𝑇𝐹 − 𝐼𝐷𝐹 = /#      × 𝑙𝑜𝑔 ∑" . ,
                                                       !        #    #
where the frequency of each word f is extracted from each cluster i and divided by the total number of
words wd of cluster i. This action can be seen as a way of normalizing the frequency of words in the
cluster. Then the number of clusters m is divided by the total frequency of the word f across all clusters.
    To create a topic representation, we obtain the top-10 most representative words of each topic, based
on their scores in c-TF-IDF. The higher the score, the more representative the word must be for the
cluster, therefore for the topic.
    After grouping similar paragraphs and identifying the most representative words for each topic, we
selected the topics to characterize the document (corresponding to "(g) Topics selection for document
di" in Figure 1). Each topic 𝑡- ∈ 𝑇 receives a weight wk. Our intuition is that the clusters with the largest
number of paragraphs best represent the theme of the document since most paragraphs will be related
to the main subject of the document, while a smaller number of paragraphs will be related to
complementary subjects, but not the main subject. In this way, the weight wk of the topic tk is the number
of paragraphs clustered in the topic |tk|. TS represents the topics of T sorted (in descending order) by
topic weight. Therefore, to represent the topics of the document, we select the n topics 𝑡- ∈ 𝑇2 until
achieving a threshold.

    3.3.        Evaluation

    The topic modeling approach described in this paper has been applied in a set of legal documents
characterized as US landmark cases. These documents have divisions and subdivisions that suggest the
main theme of the document.
    Two variations of the approach were evaluated. These variations are associated with the
representation of the document: (1) the document is represented only by the paragraphs that form it Pdi
= {p1, p2, ... , pn}; (2) extending the semantic representation of the document, with the insertion of the
text of the laws cited in the document to the set of paragraphs of the document, 𝑃#! =
 {𝑝$ , 𝑝% , . . . , 𝑝& , 𝒑𝒍𝟏 , 𝒑𝒍𝟐 , . . . , 𝒑𝒍𝒌 }.
    The quality of the topic models can be evaluated in different ways. We carry out a qualitative
assessment under the criterion of interpretability, that is, how the terms that define the topic from a
consistent and coherent meaning can be understood by humans. For this, two experts in the legal field
performed a manual inspection on the set of words most representative of the topics selected by the
model (for example, the 10 most important words). From this inspection, the experts recorded whether
there is a semantic correspondence of these words concerning the main thematic of each legal document
analyzed (comparing them to the text and the division and subdivision of the document), indicating, if
so, that the topics selected by the model represent the main theme of the document.
    Then, the Kappa coefficient [20] was used to assess the degree of agreement between experts,
calculated using equation (5) below:
                                                            +34+5
                                                    𝐾𝑎𝑝𝑝𝑎 =       ,                                  (5)
                                                     $4+5
where po represents the observed proportion of concordances (sum of the concordant responses divided
by the total); and pe represents the expected proportion of concordances (sum of the expected values of
the concordant responses divided by the total).
   Although there is no specific objective value from which the value of the Kappa coefficient should
be considered as adequate, there are some suggestions in the literature that normally guide this decision,
highlighting the proposal of [21], where Kappa < 0.40 indicates poor agreement; Kappa between 0.40
and 0.75, represents satisfactory to good; while Kappa > 0.75 represents excellent agreement.

4. Results and Analysis

    In this section, we analyze the topics retrieved by the approach for each document. Initially, we
evaluated the topics modeled according to the two variations of representation of the document. When
using the representation of the document only with the paragraphs that compose it, in approximately
8% of the documents, the approach fails to model the topics. All of these documents have less than 100
paragraphs. Our observation is that the approach does not have enough information to group paragraphs
according to the same semantics. By expanding the representation of the document with the insertion
of the text of the aforementioned laws, we are adding more information on the subject of the document,
thus, only 5% of the documents had no topics modeled. Thus, considering that the addition of laws in
the representation of the document improves the quality of the theme modeling, the rest of the
evaluation was carried out in this scenario.
    From the qualitative evaluation carried out by the specialists, we obtain that 84.6% of the topics
selected by the model correspond to the main theme of the document (considering the evaluations in
which there is an agreement between the experts). We emphasize that the level of agreement of the
evaluators, measured by the Kappa coefficient, is 0.78, qualitatively representing the level of agreement
excellent, according to the approach of [21].
    Table 1 shows the top-10 representative words (according to c-TF-IDF described in section 3.2
Topic Modeling) extracted by the topic modeling approach presented in this paper for a subset of the
dataset with ten legal documents of different thematic. The words listed for each topic appear in
descending order, from the highest to least c-TF-IDF. More specifically, the documents are associated
with the following themes: capital punishment, detainment of terrorism suspects, passengers and
interstate commerce, federal native American law, Amish, freedom of speech and of the press, end of
life, copyright/patents, federalism, and birth control and abortion, respectively.

Table 1
Topics modeled to legal documents.
  ID         Division            Subdivision                                Topics
  D1      Criminal law       Capital punishment         death, execution, risk, id, injection, penalty,
                                                             pain, lethal, punishment, protocol
  D2        Criminal law          Detainment of          court, jurisdiction, habeas, states, united,
                                terrorism suspects       united states, courts, district, eisentrager,
                                                                              writ

  D3      Equal Protection        Passengers and         statute, interstate, state, commerce, court,
              Clause                Interstate               passengers, led, states, sct, virginia
                                    Commerce
  D4       Federal Native         Federal Native         indian, non indians, jurisdiction, non indian,
           American law            American law         try, courts, congress, tribes, indian tribes, try
                                                                             nonindians
  D5     First Amendment               Amish               amish', 'education', 'children', 'religious',
                rights                                      'school', 'life', 'state', 'child', 'parents',
                                                                            'compulsory
  D6     First Amendment        Freedom of speech          sct, states, united states, present, led2d,
                rights           and of the press           danger, present danger, clear present
  D7      Individual rights         End of life         new, suicide, treatment, medical, health, sct,
                                                                  york, new york, ann, patients
  D8         Intellectual       Copyright/Patents           copyright, work, facts, original, works,
              Property                                  protection, originality, act, author, telephone
  D9           Tax Law              Federalism          direct, constitution, tax, taxes, apportioned,
                                                        apportionment, cases, rule, present, indirect
 D10      Women's rights         Birth control and         abortion, procedure, state, fetus, court,
                                     abortion                      medical, law, statute, dx, id

    One way to evaluate topic modeling is to analyze how well the topics describe the documents. This
assessment measures how informative the topics are to a user. Thus, when inspecting the topic model,
we can confirm that some topics provide information about the document (D1, D3, D4, D5, D7, D8,
D9, and D10, according to Table 1), that is, the words of the topic associated to the document are
semantically related to the thematic of that legal document (represented by the division and
subdivision), making it possible to identify the subject of the document. For example, the words "death,
execution, risk, id, injection, penalty, pain, lethal, punishment, protocol" allow us to summarize the
subject of "capital punishment". This example of a summary allows a user to identify the subject related
to certain legal matters or simply summarize the content of a legal document by analyzing the topic of
the document. It should be noted that the topics extracted for documents D2 and D6 do not provide
information about the document. The authors assume that these documents are short, therefore
presenting little information to cluster significant paragraphs with the theme of the document.
    To illustrate the visualization of the topics generated by the approach, in Figure 2 we show the 2-
dimensional t-space (reduction of dimensionality performed using UMAP) of the embeddings of the
paragraphs of a legal document dealing with "capital punishment". Specifically, semantically similar
texts must be close to each other in the vector space of embeddings, while different texts must be more
distant from each other. In Figure 2, each circled area represents a cluster identified by the clustering
algorithm (HDBSCAN). In this case, the document's paragraphs were clustered into 4 topics. The T1
topic has the largest number of paragraphs and is therefore chosen to represent the subject of the
document. It is observed that the other topics (T2, T3, and T4) are distant from T1, capturing relatively
different topics in the legal document. When applying c-TF-IDF, we obtain the following top-5 most
representative words for the topic T1 "death, execution, risk, id, injection". While the following top-5
words for topics T2, T3 and T4, respectively, "503, 503 653, 653, 536 304, 536", "428 153, 153, 1994,
1976, 428" and "130 1879, 99 130 1879, 1879, 99 130, 99 ". Therefore, we emphasize that the words
in the topic T1 (chosen to represent the document) are consistent with the theme of that document
(capital punishment).


Figure 2: 2-dimensional projection of the vectorial space of the paragraphs of a legal document on the
subject of the capital punishment.

   The overview of the most significant words in the document topic enhances the understanding of
the document's subject. A word cloud was also generated for the top-30 words of the topic T1 of the
document shown in Figure 2, according to the c-TF-IDF, to observe the most important terms for the
topic, as shown in Figure 3 below.


Figure 3: Most relevant words for the topic, according to c-TF-IDF.

   Although the approach presented in this paper is still initial, it offers an attractive way to automate
the summary of legal documents quickly. It can be useful when we have a large amount of text data and
we want to identify the subject of a particular legal document. In this situation, we can classify and
search a large number of documents more efficiently.

5. Conclusions
   We propose the use of BERTopic to build thematic models of legal documents. The legal text has
specific characteristics, such as specialized vocabulary, formal syntax, semantics based on an extensive
specific domain of knowledge, and presents citations to other cases, statutory law, the Constitution, and
amendments. In this way, we represent the text contextually from the LEGAL-BERT (pre-trained model
for the legal domain) and provide information about the laws mentioned in the document. From a
qualitative assessment, the approach presents good results, revealing topics consistent with the
document's theme.
   This preliminary approach can be used as a baseline for future papers. In the future, it is intended to
explore different strategies for choosing the topics of a document, as well as to quantitatively evaluate
the interpretability and coherence of the topics and to compare the proposed approach with other
approaches of the state of the art. It is also intended to extend the approach to clustering documents
according to the modeled topics.
6. References

[1] Dimo Angelov. Top2Vec: Distributed Representations of Topics. arXiv:2008.09470v1, (2020).
[2] A. Kanapala, S. Pal, R. Pamula. Text summarization from legal documents: a survey. Artificial
     Intelligence Review 51(3), 371–402 (2019).
[3] Ylja Remmits. Finding the Topics of Case Law: Latent Dirichlet Allocation on Supreme Court
     Decisions. Thesis. Radboad Universiteit, (2017).
[4] Laure Thompson, David Mimno. Topic Modeling with Contextualized Word Representation
     Clusters. arXiv:2010.12626v1, (2020).
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
     bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
     Association for Computational Linguistics, (2019).
[6] Christopher Williams. Tradition and change in legal English: Verbal constructions in prescriptive
     texts, volume 20. Peter Lang, (2007).
[7] Ilias Chalkidis, Manos Fergadiotis. LEGAL-BERT: The Muppets straight out of Law School,
     arXiv:2010.02559v1, (2020).
[8] Maarten Grootendorst. BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable
     topics. doi: 10.5281/zenodo.4381785, (2020).
[9] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn.
     Res. 3:993-1022 (2003).
[10] Pedro Henrique Luz de Araújo, and Teófilo de Campos. Topic Modelling Brazilian Supreme Court
     Lawsuits. JURI SAYS, 113, (2020).
[11] James O’ Neill, Cécile Robin, Leona O’ Brien, Paul Buitelaar. An Analysis of Topic Modelling
     for Legislative Texts. ASAIL 2017, London, UK, June 16, (2017).
[12] Daniel D. Lee, and H. Sebastian Seung. Learning the parts of objects by non-negative matrix
     factorization. Nature. 401 (6755): 788–791. (1999).
[13] Susan T. Dumais. Latent Semantic Analysis. Annual Review of Information Science and
     Technology. 38: 188–230, (2005).
[14] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet Processes. Journal of
     the American Statistical Association. 101 (476): pp. 1566–1581, (2006).
[15] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
     Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, (2019).
[16] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike
     Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining
     approach. ArXiv, abs/1907.11692, (2019).
[17] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang
     Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah,
     Melvin John- son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,
     Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason
     Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Gregory S. Corrado, Macduff Hughes, and
     Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and
     machine translation. ArXiv, abs/1609.08144, (2016).
[18] L. McInnes and J. Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension
     Reduction, ArXiv e-prints 1802.03426, (2018).
[19] Leland McInnes, John Healy, and Steve Astels. hdbscan: Hierarchical density based clustering.
     Journal of Open Source Software, 2(11), 205, doi:10.21105/joss.00205, (2017).
[20] Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological
     Measurement, 20, 37-46, (1960).
[21] Fleiss, J. (1981). Statistical methods for rates and proportions (2th Ed.). New York: John Wiley &
     Sons, (1981).

</pre>