A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements

A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements GraziellaDe Martino graziella.demartino@uniba.it Department of Computer Science University of Bari Aldo Moro

Via Orabona, 4 70125 Bari Italy

GianvitoPio gianvito.pio@uniba.it Department of Computer Science University of Bari Aldo Moro

Via Orabona, 4 70125 Bari Italy

Big Data Laboratory National Interuniversity Consortium for Informatics

Via Ariosto, 25 00185 Rome Italy

MichelangeloCeci michelangelo.ceci@uniba.it Department of Computer Science University of Bari Aldo Moro

Via Orabona, 4 70125 Bari Italy

Big Data Laboratory National Interuniversity Consortium for Informatics

Via Ariosto, 25 00185 Rome Italy

Department of Knowledge Technologies Jožef Stefan Institute

Jamova cesta 39 1000 Ljubljana Slovenia

DeMartino A Two-Step Method based on Embedding and Clustering to Identify Regularities in Legal Case Judgements 1613-0073 B4A6871567F4B0DB87145785FEFD776A GROBID - A machine learning software for extracting information from scholarly documents Legal Information Retrieval Embedding Clustering Approximate Nearest Neighbor Search

In an era characterized by fast technological signs of progress that introduce new scenarios every day, working in the law field may appear very difficult if not supported by the right tools. In this paper, we discuss a recently submitted work that proposes a novel method, called PRILJ, that identifies paragraph regularities in legal case judgments to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embeddingbased methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs with respect to the paragraphs of a document under preparation. Our extensive experimental evaluation, performed on a real-world dataset, proves the effectiveness and the efficiency of the proposed method even if documents contain noisy data.

Introduction

The legal sector is generally characterized by a slow response to the new scenarios that appear every day in the modern society. In this context, Artificial Intelligence (AI) methods can support the design of advanced (also automated) solutions to improve the efficiency of the processes in this field. Among the attempts in this direction, we can mention the work presented in [1], where the authors applied AI techniques to measure the similarity among legal case documents, that can be useful to speed up the identification and analysis of judicial precedents. Another relevant example is the work in [2], where the authors consider the semi-automation of some legal tasks, such as the prediction of judicial decisions of the European Court of Human Rights.

Following this line of research, in this discussion paper, we describe a novel method, called PRILJ, that identifies paragraph regularities in legal case judgements, to support legal experts during the redaction of legal documents. Methodologically, PRILJ adopts a two-step approach that first groups documents into clusters, according to their semantic content, and then identifies regularities in the paragraphs for each cluster. Embedding-based methods are adopted to properly represent documents and paragraphs into a semantic numerical feature space, and an Approximated Nearest Neighbor Search method is adopted to efficiently retrieve the most similar paragraphs. Therefore, given a (possibly incomplete or under preparation) document, henceforth called target document, PRILJ supports the retrieval of similar paragraphs appearing in a set of reference documents related to previous transcribed legal case judgments.

Document clustering has received a lot of attention by the research community, but together with the design of advanced algorithms [3,4,5,6], the most critical aspect is in the design of a proper representation of the objects/items at hand [7,8], as well as of similarity measures. In the literature we can find several document similarity measures implemented through a) network-based approaches [9,10], b) text-based methods [11,1] or c) hybrid approaches [11].

In this context, PRILJ has the main advantage of properly combining embedding methods, to catch the semantics, with a two-step approach, that consists in learning a different representation for each group of documents, rather than one single model. This aspect allows us to capture peculiarities of paragraphs according to the specific topic represented by each cluster of documents.

Our extensive experimental evaluation, performed on a real-world dataset, proves the effectiveness and the efficiency of the proposed method. In particular, its ability of modeling different topics of legal documents, as well as of capturing the semantics of the textual content, appear very beneficial for the considered task, and make PRILJ very robust to the possible presence of noise in the data.

Method

Before describing PRILJ, in the following, we provide some useful definitions:

• Training set 𝐷 𝑇 : a collection of legal judgments, represented as textual documents, adopted to train our models; • Reference set 𝐷 𝑅 : a collection of legal judgments, represented as textual documents, from which we are interested to identify paragraph regularities; • Target document 𝑑: a legal judgment (possibly under preparation) about which we are interested to identify paragraph regularities from 𝐷 𝑅 .

The training set and the reference set may fully (or partially) overlap i.e., 𝐷 𝑇 = 𝐷 𝑅 (or 𝐷 𝑇 ∩ 𝐷 𝑅 ̸ = ∅), namely, the set of documents adopted to train our models may be the same as (or overlap with) the collection from which we want to identify paragraph regularities with respect to the target document. Note that PRILJ is fully unsupervised and the target document 𝑑 is never contained in either the training set or in the reference set (i.e., 𝑑 / ∈ (𝐷 𝑇 ∪ 𝐷 𝑅 )). The three phases of PRILJ are detailed in the following subsections.

Training phase

As shown in Fig. 1, PRILJ starts with the application of some pre-processing steps to the documents in 𝐷 𝑡 . Specifically, the pre-processing consists of: i) lowercasing the text, ii) removing punctuation and digits, iii) applying lemmatization, and iv) removing rare words. The pre-processed documents are then used to train a document embedding model 𝑀 , that is subsequently exploited to represent each document of the training set 𝐷 𝑇 in the latent feature space, obtaining the set of embedded training documents 𝐸 𝑇 . Such documents are then partitioned into 𝑘 clusters [𝐶 1 , 𝐶 2 , ..., 𝐶 𝑘 ] by adopting the 𝑘-means clustering algorithm. Each cluster of documents becomes the input for a further learning step at the paragraph level: documents falling in the same cluster will contribute to the learning of a specific paragraph embedding model. Algorithmically, for each document cluster 𝐶 𝑖 , 1 ≤ 𝑖 ≤ 𝑘, we extract the paragraphs (i.e., sentences delimited by a full stop) from the documents falling into 𝐶 𝑖 and train a paragraph embedding model 𝑃 𝑖 . This approach allows us to learn more specific paragraph embedding models, according to the topic possibly represented by the identified clusters.

The embedding models, both at the document level and at the paragraph level, are learned by PRILJ through neural network architectures based on Word2Vec Continuous-Bag-of-Words (CBOW) [7] or Doc2Vec [8] distributed memory distributed memory (PV-DM). This choice is motivated by the fact that previous works demonstrated the superiority of Word2Vec and Doc2Vec over classical counting-based approaches, since they take into account both the syntax and semantics of the text [12,1]. In addition, their ability to catch the semantics and the context of single words and paragraphs allow them to properly represent new (previously unseen) documents which features have not been explicitly observed during the training phase.

Paragraph embedding of the reference set

In Fig. 2, we show the workflow followed by PRILJ to represent the paragraphs of the documents belonging to the reference set into a latent feature space. Analogously to the training phase, we pre-process the documents of the reference set 𝐷 𝑟 . Then, each document of the reference set is embedded using the previously learned document embedding model 𝑀 . The embedded representation of the document is then used to identify the closest document cluster that corresponds the optimal paragraph embedding model (i.e., 𝑃 𝑐 ). We stress the fact that PRILJ The set of all the embedded paragraphs 𝐸 𝑅 is finally returned. Paragraph regularities for a given target document 𝑑 will be identified from such set 𝐸 𝑅 .

Identification of paragraph regularities

The final phase, which workflow is represented in Fig. 3, starts by following the same steps mentioned in Sec. 2.2 to represent each paragraph of the target document 𝑑 in the paragraph embedding space. Specifically, the most proper paragraph embedding model is adopted to embed its paragraphs, selected by identifying the closest document cluster with respect to 𝑑. For each embedded paragraph, we finally identify the top-𝑛 most similar paragraphs from the set of embedded paragraphs 𝐸 𝑅 belonging to the reference set.

It is noteworthy that their identification could straightforwardly be based on the computation of vector-based similarity/distance measures (e.g., cosine similarity, Euclidean distance, etc.) between the embedded paragraphs of the target document 𝑑 and all the embedded paragraphs of the reference set 𝐸 𝑟 . Such a pairwise comparison would be computational intensive and would lead to inefficiencies during the adoption of the proposed system in a real-world scenario. To overcome this issue, we adopt a more advanced method for the identification of the top-𝑛 most similar paragraphs, based on random projections. In particular, we propose an approach based on Annoy [13], where the idea is to perform an approximated nearest neighbour search (ANNS), consisting in two phases: index construction on the paragraphs of the reference set, and search, that occurs when we actually need to identify the top-𝑛 most similar paragraphs with respect to a paragraph of the target document. During the index construction, we build 𝑇 binary trees, where each tree is built by partitioning the input set of vectors recursively, by randomly selecting two vectors and defining a hyperplane that is equidistant from them. It is noteworthy that even if based on a random partitioning, vectors that are close to each other in the feature space are more likely to appear close to each other in the tree. During the search process, a priority queue is exploited, and each tree is recursively traversed, where the priority of each split node is defined according to the distance to the query vector (that is a paragraph of the target document, in our case). This process leads to the identification of 𝑇 leaf nodes, where the query vector falls into. The distance between the query vector and the set of vectors falling into the identified leaves is finally exploited to return the top-𝑛 most similar paragraphs [14].

Experiments

All the experiments were performed using a real-world dataset consisting of 4,181 official public EU legal documents, provided by EUR-Lex (https://eur-lex.europa.eu/homepage.html), in a 10-fold cross-validation setting. All the documents of the testing set were considered as target documents, while the reference set was built by constructing 20 replicas of each paragraph of the documents in the testing set, perturbed by introducing a controlled amount of noise. In particular, noise was introduced by replacing a given percentage of words of each paragraph by random words selected from the Oxford dictionary (raw.githubusercontent.com/cduica/ Oxford-Dictionary-Json/master/dicts.json). In our experiments, we considered different levels of noise, namely, 10%, 20%, 30%, 40%, 50% and 60%, in order to evaluate the robustness of the proposed approach to different amounts of noise.

In order to assess the specific contribution of the adopted embedding strategies, we compared the results obtained through Word2Vec and Doc2Vec with those achieved using a baseline strategy, i.e., the classical TF-IDF. In all the cases, we adopted a 50-dimensional feature vector. Note that we use 50 features, since it is a commonly used dimensionality in other pre-trained embedding models. For TF-IDF, we selected the top-50 words showing the highest frequency across the set of legal judgments.

We specifically evaluated the contribution of the two-step model implemented in PRILJ with different numbers of clusters, i. Finally, we evaluated the effectiveness and the efficiency of the approach implemented in PRILJ for the identification of the 𝑡𝑜𝑝-n most similar paragraphs based on ANNS (with 𝑇 = 100). Specifically, we performed an additional comparative analysis against a non-approximated solution based on the cosine similarity, on a subset of 100 documents randomly selected from the dataset. This analysis was performed considering the best number of clusters 𝑘, and also focused on evaluating the advantages in terms of computational efficiency.

As evaluation measures, we collected precision@n, recall@n and f1-score@n, averaged over the paragraphs of target documents and over the 10 folds, with 𝑛 ∈ {5, 10, 15, 20, 50, 100}. Specifically, for each paragraph of a target document in the testing set, we considered as True Positives the number of correctly retrieved (perturbed) replicas from the reference set. Note that, in this discussion paper, for space constraints we only show the results in terms of f1-score@20.

ANNS Cosine

Results

In Fig. 4 we can observe that, although the baseline based on TF-IDF obtained acceptable results, the adoption of the embedding methods implemented in PRILJ is significantly beneficial. Moreover, although Doc2Vec is natively able to work with word sequences, Word2Vec always obtains better results. This is possibly due to the fact that several paragraphs of different legal documents may share a similar topic, and the adoption of the unique sequence ID to associate the context with the document, as done by Doc2Vec (see [8] for details), may lead to overfitting issues. In Fig. 5, it is possible to clearly observe the contribution of the two-step process we propose. Indeed, the results show that the proposed two-step model outperforms the one-step model, in all the situations. In particular, the two-step model is much more robust to the presence of noise: although we can still observe a decrease when the noise amount increases, its impact is much less evident. We can also observe that in general, the number of extracted cluster 𝑘 seems to not significantly affect the results, even if the best results are observed with 𝑘 = √︀ |𝐷 𝑇 | • 2. This means that the documents are distributed among several topics and that learning a different (more specialized) paragraph embedding model for each of them is helpful to retrieve significant paragraph regularities.

Finally, the comparison between the adopted ANNS and the exact computation of the cosine similarity emphasized a difference of 0.6% in terms of f1-score@n, which can be considered negligible. On the other hand, the advantage in terms of efficiency is significant: the exact search required up to 1000x the time took by the ANNS implemented in PRILJ (see Table 1).

Conclusions

In this work, we discussed PRILJ, a novel approach to identify paragraph regularities in legal judgments. PRILJ represents documents and paragraphs thereof in a numerical feature space by exploiting embedding methods able to catch the context and the semantics. Moreover, PRILJ is based on a two-step model, that groups similar documents into clusters and, for each of them, learns a specific paragraph embedding model. This approach allows us to properly catch peculiarities exhibited by paragraphs and documents of similar topics and to handle the presence of noise in a robust manner. Finally, PRILJ is able to identify paragraph regularities very efficiently, thanks to an ANNS strategy.

Our extensive experimental evaluation has shown the accuracy and the efficiency of the developed approach on real data. This means that PRILJ can be considered a useful tool in real-world scenarios, also when large collections of legal documents have to be analyzed.

Figure 1 :1Figure 1: Graphical overview of the training phase. Green-and red-dotted rectangles represent inputs and outputs, respectively.

Figure 2 :2Figure 2: Graphical overview of the paragraph embedding of the reference set. Green-and red-dotted rectangles represent inputs and outputs, respectively.

Figure 3 :3Figure 3: Graphical overview of the identification of paragraph regularities. Green-and red-dotted rectangles represent inputs and outputs, respectively.

e., 𝑘 ∈ { √︀ |𝐷 𝑇 |/2, √︀ |𝐷 𝑇 |, √︀ |𝐷 𝑇 | • 2}, and compared the observed performance with that obtained without grouping training documents into clusters (henceforth denoted as one-step model).

Figure 4 :4Figure 4: F1-score@20 results obtained when using TF-IDF, Doc2Vec or Word2Vec as embedding strategies, with the two-step model (𝑘 = √︀ |𝐷 𝑇 | • 2).

Figure 5 :5Figure 5: F1-score@20 results obtained with the two-step model (with different values of 𝑘) and with the one-step model. As embedding strategy, we considered Word2Vec.

Acknowledgements

GP acknowledges the support of Ministry of Universities and Research through the project "Big Data Analytics", AIM 1852414-1 (line 1).

Measuring similarity among legal court case documents AMandal RChaki SSaha KGhosh APal SGhosh Proc. of the 10th Annual ACM India Compute Conference of the 10th Annual ACM India Compute Conference Association for Computing Machinery 2017 Using machine learning to predict decisions of the european court of human rights MMedvedeva MVols MWieling Artificial Intelligence and Law 28 2020 Survey of clustering data mining techniques, A Survey of Clustering Data Mining Techniques PBerkhin 2002 10 Grouping Multidimensional Data: Recent Advances in Clustering A density-based algorithm for discovering clusters in large spatial databases with noise MEster H.-PKriegel JSander XXu Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD'96 of the 2nd International Conference on Knowledge Discovery and Data Mining, KDD'96 1996 Hierarchical and Overlapping Co-Clustering of mRNA: miRNA Interactions GPio MCeci CLoglisci DD'elia DMalerba Frontiers in Artificial Intelligence and Applications IOS Press 2012 242 ECAI 2012 DENCAST: distributed density-based clustering for multi-target regression RCorizzo GPio MCeci DMalerba J. Big Data 6 43 2019 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GCorrado JDean Advances in Neural Information Processing Systems 26 2013 Distributed representations of sentences and documents, 31st QLe TMikolov International Conference on Machine Learning 2014. 2014 4 Similarity analysis of legal judgments SKumar PKReddy VBReddy ASingh Proceedings of the 4th Bangalore Annual Compute Conference, Compute 2011 the 4th Bangalore Annual Compute Conference, Compute 2011

Bangalore, India

ACM March 25-26, 2011. 2011 17 Finding relevant indian judgments using dispersion of citation network AMinocha NSingh ASrivastava Proceedings of the 24th International Conference on World Wide Web the 24th International Conference on World Wide Web Association for Computing Machinery 2015 Finding similar legal judgements under common law system SKumar PKReddy VBReddy MSuri Databases in Networked Information Systems

Berlin Heidelberg

Springer 2013 Multi-co-training for document classification using various document representations: TF-IDF, LDA, and Doc2Vec KDonghwa DSeo SCho PKang Information Sciences 477 2018 EBernhardsson Annoy at github 2015 WLi YZhang YSun WWang WZhang XLin Approximate nearest neighbor search on high dimensional data -experiments, analyses, and improvement CoRR 2016