A Two-Step Method based on Embedding and
Clustering to Identify Regularities in Legal Case
Judgements
(Discussion Paper)

    Graziella De Martino1 , Gianvito Pio1,2 and Michelangelo Ceci1,2,3
1
  Department of Computer Science - University of Bari Aldo Moro, Via Orabona, 4, 70125, Bari (Italy)
2
  Big Data Laboratory, National Interuniversity Consortium for Informatics, Via Ariosto, 25, 00185, Rome (Italy)
3
  Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana (Slovenia)


                                         Abstract
                                         In an era characterized by fast technological signs of progress that introduce new scenarios every day,
                                         working in the law field may appear very difficult if not supported by the right tools. In this paper, we
                                         discuss a recently submitted work that proposes a novel method, called PRILJ, that identifies paragraph
                                         regularities in legal case judgments to support legal experts during the redaction of legal documents.
                                         Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according
                                         to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-
                                         based methods are adopted to properly represent documents and paragraphs into a semantic numerical
                                         feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve
                                         the most similar paragraphs with respect to the paragraphs of a document under preparation. Our
                                         extensive experimental evaluation, performed on a real-world dataset, proves the effectiveness and the
                                         efficiency of the proposed method even if documents contain noisy data.

                                         Keywords
                                         Legal Information Retrieval, Embedding, Clustering, Approximate Nearest Neighbor Search


1. Introduction
The legal sector is generally characterized by a slow response to the new scenarios that appear
every day in the modern society. In this context, Artificial Intelligence (AI) methods can support
the design of advanced (also automated) solutions to improve the efficiency of the processes
in this field. Among the attempts in this direction, we can mention the work presented in [1],
where the authors applied AI techniques to measure the similarity among legal case documents,
that can be useful to speed up the identification and analysis of judicial precedents. Another
relevant example is the work in [2], where the authors consider the semi-automation of some
legal tasks, such as the prediction of judicial decisions of the European Court of Human Rights.

SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV),
Italy
" graziella.demartino@uniba.it ( Graziella De Martino); gianvito.pio@uniba.it ( Gianvito Pio);
michelangelo.ceci@uniba.it ( Michelangelo Ceci)
 0000-0002-3492-6317 ( Graziella De Martino); 0000-0003-2520-3616 ( Gianvito Pio); 0000-0002-6690-7583
( Michelangelo Ceci)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
   Following this line of research, in this discussion paper, we describe a novel method, called
PRILJ, that identifies paragraph regularities in legal case judgements, to support legal experts
during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach
that first groups documents into clusters, according to their semantic content, and then identifies
regularities in the paragraphs for each cluster. Embedding-based methods are adopted to
properly represent documents and paragraphs into a semantic numerical feature space, and
an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most
similar paragraphs. Therefore, given a (possibly incomplete or under preparation) document,
henceforth called target document, PRILJ supports the retrieval of similar paragraphs appearing
in a set of reference documents related to previous transcribed legal case judgments.
   Document clustering has received a lot of attention by the research community, but together
with the design of advanced algorithms [3, 4, 5, 6], the most critical aspect is in the design of
a proper representation of the objects/items at hand [7, 8], as well as of similarity measures.
In the literature we can find several document similarity measures implemented through a)
network-based approaches [9, 10], b) text-based methods [11, 1] or c) hybrid approaches [11].
   In this context, PRILJ has the main advantage of properly combining embedding methods,
to catch the semantics, with a two-step approach, that consists in learning a different repre-
sentation for each group of documents, rather than one single model. This aspect allows us to
capture peculiarities of paragraphs according to the specific topic represented by each cluster
of documents.
   Our extensive experimental evaluation, performed on a real-world dataset, proves the effec-
tiveness and the efficiency of the proposed method. In particular, its ability of modeling different
topics of legal documents, as well as of capturing the semantics of the textual content, appear
very beneficial for the considered task, and make PRILJ very robust to the possible presence of
noise in the data.


2. Method
Before describing PRILJ, in the following, we provide some useful definitions:

    • Training set 𝐷𝑇 : a collection of legal judgments, represented as textual documents,
      adopted to train our models;
    • Reference set 𝐷𝑅 : a collection of legal judgments, represented as textual documents,
      from which we are interested to identify paragraph regularities;
    • Target document 𝑑: a legal judgment (possibly under preparation) about which we are
      interested to identify paragraph regularities from 𝐷𝑅 .

The training set and the reference set may fully (or partially) overlap i.e., 𝐷𝑇 = 𝐷𝑅 (or
𝐷𝑇 ∩ 𝐷𝑅 ̸= ∅), namely, the set of documents adopted to train our models may be the same as
(or overlap with) the collection from which we want to identify paragraph regularities with
respect to the target document. Note that PRILJ is fully unsupervised and the target document
𝑑 is never contained in either the training set or in the reference set (i.e., 𝑑 ∈
                                                                                 / (𝐷𝑇 ∪ 𝐷𝑅 )).
   The three phases of PRILJ are detailed in the following subsections.
Figure 1: Graphical overview of the training phase. Green- and red-dotted rectangles represent inputs
and outputs, respectively.


2.1. Training phase
As shown in Fig. 1, PRILJ starts with the application of some pre-processing steps to the
documents in 𝐷𝑡 . Specifically, the pre-processing consists of: i) lowercasing the text, ii) re-
moving punctuation and digits, iii) applying lemmatization, and iv) removing rare words. The
pre-processed documents are then used to train a document embedding model 𝑀 , that is subse-
quently exploited to represent each document of the training set 𝐷𝑇 in the latent feature space,
obtaining the set of embedded training documents 𝐸𝑇 . Such documents are then partitioned
into 𝑘 clusters [𝐶1 , 𝐶2 , ..., 𝐶𝑘 ] by adopting the 𝑘-means clustering algorithm. Each cluster of
documents becomes the input for a further learning step at the paragraph level: documents
falling in the same cluster will contribute to the learning of a specific paragraph embedding
model. Algorithmically, for each document cluster 𝐶𝑖 , 1 ≤ 𝑖 ≤ 𝑘, we extract the paragraphs
(i.e., sentences delimited by a full stop) from the documents falling into 𝐶𝑖 and train a paragraph
embedding model 𝑃𝑖 . This approach allows us to learn more specific paragraph embedding
models, according to the topic possibly represented by the identified clusters.
    The embedding models, both at the document level and at the paragraph level, are learned
by PRILJ through neural network architectures based on Word2Vec Continuous-Bag-of-Words
(CBOW) [7] or Doc2Vec [8] distributed memory distributed memory (PV-DM). This choice
is motivated by the fact that previous works demonstrated the superiority of Word2Vec and
Doc2Vec over classical counting-based approaches, since they take into account both the syntax
and semantics of the text [12, 1]. In addition, their ability to catch the semantics and the context
of single words and paragraphs allow them to properly represent new (previously unseen)
documents which features have not been explicitly observed during the training phase.

2.2. Paragraph embedding of the reference set
In Fig. 2, we show the workflow followed by PRILJ to represent the paragraphs of the documents
belonging to the reference set into a latent feature space. Analogously to the training phase,
we pre-process the documents of the reference set 𝐷𝑟 . Then, each document of the reference
set is embedded using the previously learned document embedding model 𝑀 . The embedded
representation of the document is then used to identify the closest document cluster that
corresponds the optimal paragraph embedding model (i.e., 𝑃𝑐 ). We stress the fact that PRILJ
Figure 2: Graphical overview of the paragraph embedding of the reference set. Green- and red-dotted
rectangles represent inputs and outputs, respectively.


performs this step to identify the most proper paragraph embedding model to represent the
paragraphs of a given document.
  The set of all the embedded paragraphs 𝐸𝑅 is finally returned. Paragraph regularities for a
given target document 𝑑 will be identified from such set 𝐸𝑅 .

2.3. Identification of paragraph regularities
The final phase, which workflow is represented in Fig. 3, starts by following the same steps
mentioned in Sec. 2.2 to represent each paragraph of the target document 𝑑 in the paragraph
embedding space. Specifically, the most proper paragraph embedding model is adopted to
embed its paragraphs, selected by identifying the closest document cluster with respect to 𝑑.
For each embedded paragraph, we finally identify the top-𝑛 most similar paragraphs from the
set of embedded paragraphs 𝐸𝑅 belonging to the reference set.
   It is noteworthy that their identification could straightforwardly be based on the computation
of vector-based similarity/distance measures (e.g., cosine similarity, Euclidean distance, etc.)
between the embedded paragraphs of the target document 𝑑 and all the embedded paragraphs
of the reference set 𝐸𝑟 . Such a pairwise comparison would be computational intensive and
would lead to inefficiencies during the adoption of the proposed system in a real-world scenario.
To overcome this issue, we adopt a more advanced method for the identification of the top-𝑛
most similar paragraphs, based on random projections. In particular, we propose an approach
based on Annoy [13], where the idea is to perform an approximated nearest neighbour search
(ANNS), consisting in two phases: index construction on the paragraphs of the reference set,
and search, that occurs when we actually need to identify the top-𝑛 most similar paragraphs
with respect to a paragraph of the target document. During the index construction, we build
𝑇 binary trees, where each tree is built by partitioning the input set of vectors recursively, by
randomly selecting two vectors and defining a hyperplane that is equidistant from them. It is
noteworthy that even if based on a random partitioning, vectors that are close to each other in
the feature space are more likely to appear close to each other in the tree. During the search
process, a priority queue is exploited, and each tree is recursively traversed, where the priority
Figure 3: Graphical overview of the identification of paragraph regularities. Green- and red-dotted
rectangles represent inputs and outputs, respectively.


of each split node is defined according to the distance to the query vector (that is a paragraph of
the target document, in our case). This process leads to the identification of 𝑇 leaf nodes, where
the query vector falls into. The distance between the query vector and the set of vectors falling
into the identified leaves is finally exploited to return the top-𝑛 most similar paragraphs [14].


3. Experiments
All the experiments were performed using a real-world dataset consisting of 4,181 official public
EU legal documents, provided by EUR-Lex (https://eur-lex.europa.eu/homepage.html), in a
10-fold cross-validation setting. All the documents of the testing set were considered as target
documents, while the reference set was built by constructing 20 replicas of each paragraph of
the documents in the testing set, perturbed by introducing a controlled amount of noise. In
particular, noise was introduced by replacing a given percentage of words of each paragraph
by random words selected from the Oxford dictionary (raw.githubusercontent.com/cduica/
Oxford-Dictionary-Json/master/dicts.json). In our experiments, we considered different levels
of noise, namely, 10%, 20%, 30%, 40%, 50% and 60%, in order to evaluate the robustness of the
proposed approach to different amounts of noise.
   In order to assess the specific contribution of the adopted embedding strategies, we compared
the results obtained through Word2Vec and Doc2Vec with those achieved using a baseline
strategy, i.e., the classical TF-IDF. In all the cases, we adopted a 50-dimensional feature vector.
Note that we use 50 features, since it is a commonly used dimensionality in other pre-trained
embedding models. For TF-IDF, we selected the top-50 words showing the highest frequency
across the set of legal judgments.
   We specifically evaluated the contribution     √︀ of the two-step model implemented in PRILJ
with different numbers of clusters, i.e., 𝑘 ∈ { |𝐷𝑇 |/2, |𝐷𝑇 |, |𝐷𝑇 | · 2}, and compared the
                                                             √︀      √︀

observed performance with that obtained without grouping training documents into clusters
(henceforth denoted as one-step model).
Figure 4: F1-score@20 results obtained
                                   √︀ when using TF-IDF, Doc2Vec or Word2Vec as embedding strate-
gies, with the two-step model (𝑘 = |𝐷𝑇 | · 2).


Figure 5: F1-score@20 results obtained with the two-step model (with different values of 𝑘) and with
the one-step model. As embedding strategy, we considered Word2Vec.


   Finally, we evaluated the effectiveness and the efficiency of the approach implemented in PRILJ
for the identification of the 𝑡𝑜𝑝-n most similar paragraphs based on ANNS (with 𝑇 = 100).
Specifically, we performed an additional comparative analysis against a non-approximated
solution based on the cosine similarity, on a subset of 100 documents randomly selected from
the dataset. This analysis was performed considering the best number of clusters 𝑘, and also
focused on evaluating the advantages in terms of computational efficiency.
   As evaluation measures, we collected precision@n, recall@n and f1-score@n, averaged over
the paragraphs of target documents and over the 10 folds, with 𝑛 ∈ {5, 10, 15, 20, 50, 100}.
Specifically, for each paragraph of a target document in the testing set, we considered as True
Positives the number of correctly retrieved (perturbed) replicas from the reference set. Note that,
in this discussion paper, for space constraints we only show the results in terms of f1-score@20.
                                        ANNS Cosine Similarity
                                TF-IDF 0.513      407.612
                               Doc2Vec 0.551      580.842
                               Word2Vec 0.610     668.040
Table 1
Average running√︀
                time (s) for the identification of the top-𝑛 most similar paragraphs, with the two-step
model and 𝑘 = |𝐷𝑇 | · 2.


3.1. Results
In Fig. 4 we can observe that, although the baseline based on TF-IDF obtained acceptable
results, the adoption of the embedding methods implemented in PRILJ is significantly beneficial.
Moreover, although Doc2Vec is natively able to work with word sequences, Word2Vec always
obtains better results. This is possibly due to the fact that several paragraphs of different legal
documents may share a similar topic, and the adoption of the unique sequence ID to associate
the context with the document, as done by Doc2Vec (see [8] for details), may lead to overfitting
issues.
   In Fig. 5, it is possible to clearly observe the contribution of the two-step process we propose.
Indeed, the results show that the proposed two-step model outperforms the one-step model, in
all the situations. In particular, the two-step model is much more robust to the presence of noise:
although we can still observe a decrease when the noise amount increases, its impact is much
less evident. We can also observe that in general, the number of extracted cluster  √︀𝑘 seems to not
significantly affect the results, even if the best results are observed with 𝑘 = |𝐷𝑇 | · 2. This
means that the documents are distributed among several topics and that learning a different
(more specialized) paragraph embedding model for each of them is helpful to retrieve significant
paragraph regularities.
   Finally, the comparison between the adopted ANNS and the exact computation of the cosine
similarity emphasized a difference of 0.6% in terms of f1-score@n, which can be considered
negligible. On the other hand, the advantage in terms of efficiency is significant: the exact
search required up to 1000x the time took by the ANNS implemented in PRILJ (see Table 1).


4. Conclusions
In this work, we discussed PRILJ, a novel approach to identify paragraph regularities in legal
judgments. PRILJ represents documents and paragraphs thereof in a numerical feature space
by exploiting embedding methods able to catch the context and the semantics. Moreover,
PRILJ is based on a two-step model, that groups similar documents into clusters and, for each
of them, learns a specific paragraph embedding model. This approach allows us to properly
catch peculiarities exhibited by paragraphs and documents of similar topics and to handle the
presence of noise in a robust manner. Finally, PRILJ is able to identify paragraph regularities
very efficiently, thanks to an ANNS strategy.
   Our extensive experimental evaluation has shown the accuracy and the efficiency of the
developed approach on real data. This means that PRILJ can be considered a useful tool in
real-world scenarios, also when large collections of legal documents have to be analyzed.
Acknowledgements
GP acknowledges the support of Ministry of Universities and Research through the project “Big
Data Analytics”, AIM 1852414-1 (line 1).


References
 [1] A. Mandal, R. Chaki, S. Saha, K. Ghosh, A. Pal, S. Ghosh, Measuring similarity among
     legal court case documents, in: Proc. of the 10th Annual ACM India Compute Conference,
     Association for Computing Machinery, 2017, p. 1–9.
 [2] M. Medvedeva, M. Vols, M. Wieling, Using machine learning to predict decisions of the
     european court of human rights, Artificial Intelligence and Law 28 (2020).
 [3] P. Berkhin, Survey of clustering data mining techniques, A Survey of Clustering Data
     Mining Techniques. Grouping Multidimensional Data: Recent Advances in Clustering. 10
     (2002).
 [4] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters
     in large spatial databases with noise, in: Proc. of the 2nd International Conference on
     Knowledge Discovery and Data Mining, KDD’96, 1996, p. 226–231.
 [5] G. Pio, M. Ceci, C. Loglisci, D. D’Elia, D. Malerba, Hierarchical and Overlapping Co-
     Clustering of mRNA: miRNA Interactions, in: ECAI 2012, volume 242 of Frontiers in
     Artificial Intelligence and Applications, IOS Press, 2012, pp. 654–659.
 [6] R. Corizzo, G. Pio, M. Ceci, D. Malerba, DENCAST: distributed density-based clustering
     for multi-target regression, J. Big Data 6 (2019) 43.
 [7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed representations of
     words and phrases and their compositionality, Advances in Neural Information Processing
     Systems 26 (2013).
 [8] Q. Le, T. Mikolov, Distributed representations of sentences and documents, 31st Interna-
     tional Conference on Machine Learning, ICML 2014 4 (2014).
 [9] S. Kumar, P. K. Reddy, V. B. Reddy, A. Singh, Similarity analysis of legal judgments, in:
     Proceedings of the 4th Bangalore Annual Compute Conference, Compute 2011, Bangalore,
     India, March 25-26, 2011, ACM, 2011, p. 17.
[10] A. Minocha, N. Singh, A. Srivastava, Finding relevant indian judgments using dispersion
     of citation network, in: Proceedings of the 24th International Conference on World Wide
     Web, Association for Computing Machinery, 2015, p. 1085–1088.
[11] S. Kumar, P. K. Reddy, V. B. Reddy, M. Suri, Finding similar legal judgements under common
     law system, in: Databases in Networked Information Systems, Springer Berlin Heidelberg,
     2013, pp. 103–116.
[12] K. Donghwa, D. Seo, S. Cho, P. Kang, Multi-co-training for document classification using
     various document representations: TF–IDF, LDA, and Doc2Vec, Information Sciences 477
     (2018).
[13] E. Bernhardsson, Annoy at github, https://github.com/spotify/annoy, 2015.
[14] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, X. Lin, Approximate nearest neighbor search
     on high dimensional data - experiments, analyses, and improvement (v1.0), CoRR (2016).