1. Introduction

BioLaw Journal

1613-0073

10.1007/978-3-030-89811-3\_3

Topic Similarity of Heterogeneous Legal Sources Supporting the Legislative Process

Michele Corazza

michele.corazza@unibo.it 0 1

Leonardo Zilli

leonardo.zilli@studio.unibo.it 0 1

Monica Palmirani

monica.palmirani@unibo.it 0 1 0 University of Bologna , ALMA-AI, via Galliera 3, Bologna , Italy 1 Unsupervised learning , Sentence Transformers, Hybrid AI, Legal NLP

2019

13048 79 84

The legislative process starts with a deep analysis of the existing regulations at European and national levels to avoid conflicts and fostering the into force norms. Also the Constitutional Court decisions play a fundamental role in this analysis for checking the compliance with the constitutional framework and for including the inputs coming from this relevant court in the law-making process. Finally, it is also significant to compare the forthcoming proposal with the already presented bills regarding the same topic. This comparison is crucial to avoid overlapping and to coordinate the democratic dialogue with the diferent parties. In this light, this paper presents an unsupervised approach for calculating similarity between heterogeneous documents annotated in Akoma Ntoso XML, with the aim to support the information retrieval of similar documents using thematic taxonomy used in legal domain. The prototype has been developed for answering to a call for manifestation of interests launched by the Chamber of Deputy of Italy in order to adopt hybrid AI in the legislation process. It uses a completely unsupervised approach based on Sentence Transformers, meaning that neither annotated data or any ifne-tuning process is required.

1. Introduction

The legislative process inside parliaments and oficial assemblies includes an initial phase of preliminary discovery of the existing regulations and rules in the same domain of the proposal, in order to synchronize lficting norms.

Secondly, a legal preliminary study must be conducted for applying legislative drafting techniques that have the aim of creating transparent and evidence-based legislation (e.g., Better Regulation planning-and-proposing-law/better-regulation_en). On the other hand, the fragmentation of the legal system imposes the task of an accurate preliminary legal analysis and research at diferent levels of legislation to the legislative department: at the European level in order to discover the norms in Regulations and Directives; at the national level to avoid overlapping with other existing acts; at the ministerial level to synchronize the technical and operative rules. Notably, it is crucial to check the decisions of the Constitutional Court to avoid to produce norms that are unconstitutional. On the other hand, the legal sources, considering their nEvelop-O (M. Palmirani) Ntoso XML [ 1 ] for creating a common framework for their representation that is capable of capturing the legal knowledge and metadata (e.g., jurisdiction, hierarchy, temporal model).

Additionally, we provide an unsupervised approach for classifying legal documents according to their topic, which is used to retrieve the relevant legal documents concerning some main legal topics (e.g., the subject of the

Chamber of Deputies Committees defined by law

1, or EUROVOC top-level thematic classes) from a user input. This work was conducted on the use-case of the Chamber of Deputy of Italy’s needs and documents, answering the CEUR

ceur-ws.org

2. Related Work

call for interests launched in February 2024 concerning the use of AI in Parliament 2.

The legislative language is a peculiar language that in- The documents used for the project have been collected cludes qualified part of the text like the preamble, norma- from diferent sources, resulting in four distinct datasets: tive part, definitions, normative references, exceptions, transitional norms, etc. For this reason, the task is not trivial and should take in consideration these peculiarities.

3. Datasets and resources

• Corte Costituzionale: Contains the orders and judgments of the Italian constitutional court, spanning from 1956 to 2018 (10725 documents), which have been downloaded and converted to

Akoma Ntoso using an ad-hoc tool 3; • Progetti di Legge (PDL): A collection of Italian legislative bills from the legislatures XVIII and XIX (March 2018 to May 2024 - 3615 documents), extracted from the oficial website of the Chamber of Deputies of the Italian Parliament4 in the HTML format and converted to Akoma Ntoso using a batch python parser5. • EUR-Lex: A collection of Regulations and Directives from the European Union, spanning from 2010 to 2021, extracted from the EUR-Lex website6 and converted from Formex to the Akoma

Ntoso format using our conversion tool 7. • Normattiva: A collection of Italian legislative acts extracted from the Normattiva portal8, which contains all legislative documents from the Italian parliament in Akoma Ntoso format. The documents from 2010 to May 2024 were selected, including Primary and Secondary Law.

The creation of models and methods for the legal domain is a challenging endeavour, as this field is characterized by some peculiar aspects that might lead general-purpose approaches to be inaccurate. Nevertheless, a multitude of diferent models and strategies have been proposed in this field, including models that have been trained specifically on this domain like LEGAL-BERT[ 2 ], which was fine-tuned from BERT[ 3 ] on legislative documents from the UK, US and EU, court documents from the European Court of Justice. Another model, called custom LEGAL-BERT[ 4 ] was instead trained on a corpus comprised entirely of Case Law from the Harvard Law Library. Another prominent example of ad-hoc models for the legal domain is called Pile-of-Law (PoL), from the name of the dataset that was used to fine-tune it, which comprises data from 35 diferent sources in English [ 5 ].

Interestingly, in terms of natural language processing applications for the legal domain, most approaches appear When not already in the Akoma Ntoso XML format, to be targeted at the judiciary rather than the legislative as is the case for the PDL and Eur-Lex dataset, the docubranch. Additionally, some approaches include common- ments have been converted to this format. Through this law corpora (UK/US) that for our purpose (EU) could conversion, it is possible for us to extract portions of the create relevant distortions in the dataset. In particular, document according to its hierarchical structure (articles, a common task is the prediction of a judgment for a commas, lists, etc). This structural information is very given case. This task has been attempted using multi- important for the legal domain, as it allows to chunk ple methods, including using a consistency graph and a documents while considering their structure (e.g., legal transformer model to determine which articles have been definitions, article, list of points). Furthermore, normaviolated in a given case [ 6 ]. The research is not limited tive references are also annotated as such, and a unique to the English language, as there are contributions for URI is used to indicate them. The Akoma Ntoso stanChinese court judgments [7] and rulings from the Indian dard also follows the FRBR conceptual model, which is Supreme Court [8]. used to distinguish between works (i.e a specific law),

Another crucial aspect of research in the wider field expressions (the various consolidated versions of each of legal informatics is the creation of formats, ontologies law that have been amended over time) and manifestaand tools that support the machine-readable represen- tions (the physical embodiment of an expression or work). tation of legal documents, from both the legislative and Through the annotation of the hierarchical structure of judiciary branches. Among these, one of the founding el- documents, the references and the URI naming convenements of our approach is the usage of the Akoma Ntoso tion based on FRBR it is possible to resolve normative XML standard [ 1, 9 ], which has been adopted by many references, even when they refer to a part of a document, international institutions [10, 11, 12, 13, 14] to represent like a single article or paragraph. Furthermore, the FRBR legal documents. This standard allows the annotation of legal definitions, references, the hierarchical structure of 43hhttttppss::////gwiwtlawb..ccaomme/CraI.RitS/FID/cortecostituzionale-py legal documents, as well as the temporal aspects of legal 5https://gitlab.com/CIRSFID/html2aknPDL documents. 6https://eur-lex.europa.eu 7http://u2.cirsfid.unibo.it/formexplus2akn/frontend/ 2https://comunicazione.camera.it/archivio-prima-pagina/19-37666 8https://www.normattiva.it/ model allows us to retrieve the consolidated version of a document which is temporally relevant for a given reference. Akoma Ntoso also includes legal metadata (e.g., jurisdiction, temporal information, modifications, definitions, law-making process, life-cycle of the document, classification) which improves the expressiveness of legal knowledge in the XML representation.

Each dataset follows semantically descriptive naming conventions for the documents, which facilitate subsequent data handling and processing steps in the pipeline of the project. Table 1 summarizes the number of documents contained in each dataset.

Dataset Corte Costituzionale PDL EUR-Lex Normattiva

N. of Documents 10725 3615 14305 3195

In order to deal with the highly heterogeneous nature of the datasets, labels describing a number of various topics have been used for categorizing the documents. The documents concerning Italy have been classified according to the labels of the Committees of the Chamber of Deputies. These Committees are represented as a string describing them, which contains their titles (shown in Table 2), as well as their description as presented in the Circolare del Presidente della Camera (16 ottobre 1996, n. 3), the oficial document that regulates the matters of competence for each of them. Only regarding the dataset of the Constitutional Court, the “Giustizia” (Justice) and “Afari costituzionali, della Presidenza del consiglio e interni della Camera dei deputati” (Constitutional Afairs, Presidency of the Council and Internal Afairs of the Chamber of Deputies) commissions were excluded as they apply to the vast majority of Constitutional Court documents.

Concerning the EUR-Lex dataset, the classification leveraged the European multilingual thesaurus, EuroVoc, using the top-level terms (shown in Table 3) and their immediate subcategories separated by semicolons. As for the Constitutional Court, the term “Unione Europea” (European Union) has been excluded as it is too general and relevant to all documents in the dataset.

4. Document Classification

In order to classify documents according to their content, we used an approach based on the SentenceTransformers library [15], and selected the multilingual model “paraphrase-multilingual-mpnet-basev2”[16]. This model is made multilingual from the monolingual Sentence Transformer model “paraphrasempnet-base-v2”, in turn based on MPNet [17], which was Afari esteri e comunitari Difesa Bilancio, tesoro e programmazione Finanze Cultura, scienza ed istruzione Ambiente, territorio e lavori pubblici Trasporti, poste e telecomunicazioni Attività produttive, commercio e turismo Lavoro pubblico e privato Afari sociali Agricoltura

Politiche dell’Unione Europea trained using a contrastive loss and an approach similar to siamese networks to allow the direct application of a metric (cosine similarity) to its output vectors in order to measure the semantic proximity of sentences. The monolingual model is then used as a teacher in a teacher-student configuration to train the multilingual one so that both the original and translated versions of sentences have the same vector representation in the new model. The chosen model, in particular, was trained on parallel data and supports 50+ languages, including Italian and English. Crucially, the usage of a sentence transformer allows us to operate in a completely unsupervised way, without the need to use annotated data or to fine-tune the model for the classification task, since we can directly apply cosine similarity to measure semantic relatedness.

In order to produce a classification of the documents, we selected two components of the normative documents (Eur-Lex, Normattiva, PDL), namely their titles and articles. For the Corte Costituzionale dataset, we selected tion xx/yyyy/EU) we obtain the specific referenced portion of the document as an XML element; • For generic references to an entire document (eg

Regulation xx/yyyy/EU) we use the title and first article of the document to represent it.

Formally, then, an article having children and references is represented by an embedding obtained from the model using the following recursive procedure: () =

1 2 + |()| Where: ( (()) + ∑ ( ()) +

∑ ( ())) 1 () parenthesis, which contains brief descriptions of refer- function that works as follows: tively; respectively. • () is the textual content of the article which is

not included in any of its non inline children; • (), () represent the set of all non inline chil

dren of and the i-th child element of , respec• (), () represent all the references in the text

of the article, and the j-th reference in the text, In order to represent references, then, we can define a (1) (2)

These components were extracted by applying the ap- the function () as defined previously computes an avpute embeddings representing each title of the document. the normative references contained in the text. the introduction as a substitute for the title, while instead of the articles we used the decision portion of the documents, in addition to all textual content between enced documents. The text between parenthesis is fed to the model and the results are averaged to produce a single vector. In the following sections, we use “titles” and “articles” for brevity, but these correspond to introduction and decision + parenthesis for the Corte Costituzionale dataset. propriate Xpath query to the Akoma Ntoso XML tree representing each document. The first step is to comThen, we proceed to compute the article vectors. While in the case of titles we can just apply the sentence transformer directly to the text, the length of articles might prevent the model from producing accurate result, or even exceed the maximum allowed tokens for a given model. For this reason, our approach leverages the structure of articles, represented using Akoma Ntoso, to produce one embedding for each article. In particular, we until we reach the XML elements that are leaves of the tree. We exclude the elements that appear inline in the text (eg dates, references, etc) in order to maintain the textual content of each leaf node (eg paragraph, item of a list, etc) intact. A visualization of the procedure is shown in Figure 1. In addition to its own textual content, each leaf node is associated with a list of the references in its text, which are resolved as follows: proceed traversing the XML tree in a recursive manner, topics and each article. Finally, the maximum similarity • For punctual references (eg Article 3 of Regula- providing information about the more relative topic for () = { 1 () 2 ( ()) + (

if is a punctual reference 1()) otherwise Where () represent the title of the referenced document, while 1() is the first article of the document. Overall, erage vector representation for each article, which aggregates the embeddings of all its children but also considers

Once we obtained the vector representation of each article of each document and its titles embeddings, we can compare them with the vector representations of our topics, the EuroVoc terms for the European legislation and the Chamber commissions for the Italian documents.

Then, the similarity between each document and the subjects is derived from the sum of the cosine similarity between its title and the average similarity between the value obtained by this procedure is used to classify each document using one of the topics.

5. Searching by topic

In order to provide a topic-based search that can be used in the Italian legislative process, the final step is to provide an interface to query each of the four datasets, by

6. Evaluation and Results 9http://u2.cirsfid.unibo.it/portale-camera

In order to evaluate the performance of our subject-based classification, we asked three experts of the legal domain When comparing the result, it is interesting to note to annotate 100 random documents for each dataset be- that among the Italian datasets, which use the same catetween them, and proceeded to measure the accuracy gories, the Normattiva and Corte Costituzionale accuracy of our classification when compared to the annotated seems higher, while the PDL dataset shows a lower perground truth (Table 4). The fact that experts were in- formance. This suggests that the finalized version of volved in the annotation of the results is crucial for the documents issued by the parliament and the Constitulegal domain, since this allows the legal interpretation of tional court might be simpler to classify in an unsuperthe results, which can only be accomplished through an vised way, while the more draft-like qualities of the PDL evaluation by legal experts [18]. dataset hinder the classification eforts.

While this is just a preliminary assessment of the classification performance of our unsupervised model, it is 7. Conclusions and Future Work possible to derive that the label applied to the documents is correct in at least 39% of the cases, meaning that the approach is indeed able to link a document with its more relevant anchor with a good level of approximation.

In this article, we present an unsupervised approach that aims to support the Italian legislative process, by providing useful insights into documents from the relevant European and Italian institutions (European Union, Constitutional Court, Italian Parliament). The system doesn’t 10https://comunicazione.camera.it/archivio-prima-pagina/19-41329

Acknowledgments

only provide with a ranking of relevant documents, but it also returns the two most relevant EuroVoc terms (for EU documents) and Chamber commissions (for Italian This project is funded by the European Union - NextGendocuments). This allows the user a more thorough ex- erationEU under the National Recovery and Resilience ploration of the relevant subjects, while also supplying Plan (PNRR) - Mission 4 Education and research - Comsuggestions in terms of specific documents. ponent 2 From research to business - Investment 1.1 No

Our approach is completely unsupervised and it does tice PRIN 2022 (DD N. 104 del 02/02/2022), title Smart not rely on any form of annotation, meaning that scaling Legal Order in DigiTal Society - SLOTS, proposal code up the approach to more documents, or even using more “Smart Legal Order in DigiTal Society (SLOTS)”, Proposal performant models do not require any fine-tuning, with code 2022LRL2C2, CUP J53D23005610006. We also thank the procedure consisting in obtaining the article and title Salvatore Sapienza, Chantal Bomprezzi, Pier Francesco vectors for all documents. Furthermore, the adopted Bresciani for validating the results. approach leverages the hierarchical nature of legislative documents, as represented in Akoma Ntoso XML in order References to produce embeddings that are based on the structure of the document. Moreover, using a structured format as our input allows us to resolve normative references, without which some of the of a document will be impossible to understand for an automatic system.

The evaluation performed on the classification system showed a promising level of performance for an unsupervised model, which doesn’t rely on any information about the specific task. Additionally, the multilingual model used in our method allows users to work both on English and Italian, both in terms of queries and in terms of results, with satisfying results. Nevertheless, it would be possible to improve the quality of the results by testing other models, which might yield better performance.

The validation of the search by topic task has been assessed by two senior legal researcher in the team, however it is recommendable to organize a session with relevant end-users with some concrete scenarios for returning relevant documents and categories given a user query.

For this task, it would be necessary to involve the relevant stakeholders, meaning experts involved in the drafting of legislative documents in Italy. Nevertheless, the project has been evaluated by scientific experts 10 appointed by the Italian Chamber of Deputies in the context of its manifestation of interest and it was included as part of the work by of one of the two winning consortiums.

The experimental results obtained in this paper constitute a study of the application of pre-existing Sentence Transformer models in an unsupervised way to the classiifcation and search of Italian legal documents. While we achieved satisfactory results, our approach could still be improved by improving upon the base methodology and conducting a more thorough exploration of other multilingual models. Furthermore, a formal evaluation by the stakeholders would also improve our understanding further specific parameters that arise during the legislative process.

[1]

Palmirani ,

Sperberg ,

Vergottini ,

Vitali , Akoma Ntoso Version 1 . 0 Part 1 :

XML

Vocabulary , Technical Report, OASIS Standard , 2018 . URL: http://docs.oasis-open.org/legaldocml/akn-core/ v1 . 0/akn-core-v1. 0 -part1-vocabulary.html.

[2]

Chalkidis ,

Fergadiotis ,

Malakasiotis ,

Aletras , I. Androutsopoulos , LEGAL-BERT: The muppets straight out of law school , in: T. Cohn,

He , Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020 , Association for Computational Linguistics , Online, 2020 , pp. 2898 - 2904 . URL: https://aclanthology.org/ 2020 .findings-emnlp. 261 . doi: 10 .18653/v1/ 2020 . findings- emnlp.261.

[3]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[4]

Zheng ,

Guha ,

B. R.

Anderson ,

Henderson ,

D. E.

Ho , When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings , in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law , ICAIL '21, Association for Computing Machinery, New York, NY, USA, 2021 , p. 159 - 168 .

[5]

Henderson ,

Krass ,

Zheng ,

Guha ,

C. D.

Manning ,

Jurafsky ,

Ho , Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset , Advances in Neural Information Processing Systems 35 ( 2022 ) 29217 - 29234 .

[6]

Dong ,

Niu , Legal judgment prediction via relational learning , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '21, Association for Computing Machinery, New York, NY, USA, 2021 , p. 983 - 992 . URL: https://doi.org/10.1145/ 3404835.3462931. doi: 10 .1145/3404835.3462931.