1. Introduction and Motivation

Mining Information from Legal Sentences in KlonDikE

Andrea Gatti

Viviana Mascardi

Domenico Pellegrini

0 0 Ministry of Justice , Tribunale di Genova , Italy 1 University of Genova , DIBRIS , Italy

7 19

The “Next Generation UPP” (NGUPP) Italian project aims at devising and implementing new collaborative schemes between universities and judicial ofices for improving the eficiency and performance of justice in north-west Italy. The University of Genova was assigned diferent tasks in NGUPP, including the semantic analysis of sentences and the study of potential bias on a massive scale for legal and analytical purposes. KlonDikE (for Knowledge Driven Extraction engine), with its ability to determine the type of divorce sentence, extract the master data of the parties and children, extract and visualize statistical information, and anonymize the original document, contributed to achievement of the NGUPP semantic analysis objective.

eol>NGUPP KlonDikE NLP4Law semantic analysis divorce sentence frame semantics

1. Introduction and Motivation

One of the tasks of the University of Genova in NGUPP was to create a tool for the semantic analysis of sentences and to study potential bias on a massive scale for legal and analytical purposes. KlonDikE, for Knowledge Driven Extraction engine, is the tool based on Natural Language Processing (NLP) that we realized for that aim.

The exploitation of NLP in the legal domain dates back to the seventies of the last century [ 2, 3, 4, 5 ], and is still an extremely active research field [ 6, 7, 8, 9, 10, 11 ] as also witnessed by funded projects and scientific events.

As an example, the MIning and REasoning with Legal texts project2 funded by the European Union’s Horizon 2020 research and innovation programme closed at the end of 2019, while the AI4Lawyers 3, awarded by the European Commission Directorate-General for Justice and Consumers, closed in March 2022 and two out of its five deliverables were related with NLP. The special issue NLP for legal texts of the AI and Law journal was published in 2019 [12], the iffth edition of the Natural Legal Language Processing Workshop will take place at the end of 20234, and the first Workshop on Machine learning, Law and Society is expected to run in Torino in a few days5. In comes with no surprise that the success of applying NLP techniques to the legal domain also raises ethical concerns, as shown by papers [13] and projects6.

Following the FrEX approach for extracting semantic frames from PDF files of expropriation sentences [14], also KlonDikE is inspired by Frame Semantics [15] and exploits NLP to identify the main actors and their role in divorce cases. Given that the FrameNet portal7 provides no frame for the “Divorce” concept, we designed our own frame as follow, assuming that the type Person should at least be characterized by first name, family name, date and place of birth, and gender. The design of the Divorce_Frame was driven by magistrates from the Tribunal of Genova and takes their real needs into account. For example, knowing if there are underage children and if domestic violence was reported is fundamental to properly approach the case. As a side efect of being able to extract the values for filling the Divorce_Frame, KlonDikE can also identify those values in the document for anonymization purposes and collect them for statistical analysis.

Divorce_Frame – Core Entities – Husband (type Person) – Wife (type Person) – Children (type Person) – Underage children (type boolean) – Domestic Violence (type boolean) – Involvement of Social Services (type boolean) 2https://www.mirelproject.eu/index.html, accessed on September 2023. 3https://ai4lawyers.eu/, accessed on September 2023. 4https://nllpw.org/workshop/, accessed on September 2023. 5https://sites.google.com/view/ws-ethics-ecml23/, accessed on September 2023. 6https://wwwmatthes.in.tum.de/pages/ztm206o67g3q/NLawP-Natural-Language-Processing-and-Legal-Tech, accessed on September 2023. 7FrameNet is a lexical database containing over 1,200 semantic frames, 13,000 lexical units, more than 200,000 manually annotated sentences, http://berkeleyfn.framenetbr.ufjf.br/, accessed on September, 2023.

– Share of Extraordinary Expenses (type integer in 0-100) – Non-Core Entities – Lawyer (type Person) – Judge (type Person)

The software architecture of KlonDikE is general enough to make the tool suitable for processing any kind of document once properly customized, and indeed some of its modules have been designed and implemented ad hoc for dealing with the divorce domain, with sentences written in Italian.

Although the adoption of generative AI approaches (GPT-3.5, GPT-4) on divorce sentences gave promising results, with the GPT-4 APIs performing as well as KlonDikE, GPT-3.5 Turbo and GPT-4 APIs are not open source and data are processed on the OpenAI servers. The guidelines for acquiring and reusing software in Italian public administrations, delivered in May 2019, state that software developed for public administrations must be, by default, open source8. While, according to the most permissive Open Source licenses, it is legal to use closed source libraries in an open source project, this raises many practical and economic implications which suggest to avoid this practice. Also, although the transmission of data over the network for being processed at the OpenAI servers is guaranteed to be secure and compliant with privacy requirements9, computers in tribunals have many limitations on their access to the Internet.

For the two reasons above KlonDikE exploits technologies that are fully open source and do not require any access to the network. This gives KlonDikE the full control over analyzed data, in compliance with current general regulations in terms of data use and storage and with the more restrictive regulations holding for judicial ofices. Being open source also allows to properly address ethical concerns and its modular and clean architecture make KlonDiKe easily inspectable, reusable, and extendable by third parties.

The paper is organized in the following way: Section 2 illustrates the KlonDikE architecture and implementation, experiments are presented in Section 3, and Section 4 concludes.

2. Architecture and Implementation

Given a PDF file, KlonDikE processes it to reach several objectives, namely: 1. Determine the type of sentence; 2. Extract the master data of the parties and any children; 3. Extract statistical information; 4. Anonymize the document; 5. Produce statistical graphs on the set of sentences considered.

To do this, KlonDikE features a modular architecture, with one module for each of the presented items as shown in Figure 1. It is implemented as a Python library that counts approximately 900 lines of code. 8https://www.agid.gov.it/it/design-servizi/riuso-open-source/linee-guida-acquisizione-riuso-software-pa, accessed on September 2023. 9https://openai.com/enterprise-privacy, accessed on September 2023.

Sentence File Text Type (Sep./Div.) Address Book Person 1 Person 2 Person 3

...

Person Metadata Violence Minor Age Social Services Extra expenses Person Name Birthday Birthplace Gender Role CSV File Sentence Text Sentence Text Type Sentence Text Address Book Sentence File Address Book Klondike Classifier Registry Metadata Anonymizer CSV File Statistics Charts and Data Anonymized PDF File

Classifier. The classifier module performs a search within the document and classifies the sentence according to its type, namely separation or divorce. To do this, it takes the text of the sentence as input and performs a simple search in its initial part, looking for one of the predefined types. The type names are fixed, so no further checking is necessary.

At the implementation level, KlonDikE creates a Python sentence object, defined for the project, for each sentence. This module adds the type information to this object. Registry. The Registry module constructs a tagged address book of people within the sentence and a list of addresses. To do so, it combines Machine Learning tools and regular expressions to study the form of sentences. The architecture is visible in Fig 2.

Sentence Text

Type n o it c a r t x E

Address Book

r e lit F r e y w a L r e lit F d li h C

Address Book

First, using the spaCy10 library it extracts all strings identifiable as names and using Regex it ifnds the tax codes. After cleaning the names with a simple filter, a search for name and tax code matches is performed. For each of the matches, a Person object is created that contains all the information about the person and added to the address book.

Names for which no match was found undergo a second processing step. The module uses regex to find other names that appear in sentences in meaningful places. When the name list is complete we check whether they are present in sentences in which their place and/or date of birth are indicated. If this is the case, a Person object is added to the address book.

If, finally, no match is found in this way either, a Person object is added to the address book with the indication that only the name was found.

A second filtering step is then performed on the rubric to find its role within the sentence. Only regex is used to do this. The possible roles are: (1) Part, (2) Child, (3) Lawyer, (4) Judge and (5) Other.

As a last step, the module searches within the text for all ways and saves them in a list to be anonymised.

Metadata. The metadata module looks for information about the sentence for statistical reasons. In particular, it looks for the data that characterize the Divorce_Frame: • the presence of any type of violence; • the minor age of at least one of the children; • the involvement of social services; • the percentage of extra expenses attributed to one of the parties; • the age of the two parties.

These pieces of information are stored inside a CSV file. For each sentence the module adds a new line and saves it.

Anonymizer. The Anonymizer module takes as input the PDF file and the data extracted by the registry module and anonymizes them inside the file. It removes the name of the two parties, all the dates and the addresses. It replaces them with an appropriate anonymous placeholder that makes the text still readable and coherent. It produces a new PDF file.

This is done using the PyMuPdf11 library, which allows you to search for strings within the ifle and replace them.

Statistics. The statistics module takes as input the CSV file with the anonymous data and produces diferent kind of plots.

3. Experiments

To show KlonDikE at work, we devised a diferent and less tragic conclusion to William Shakespeare’s “Romeo and Juliet” masterpiece. After eighteen years of marriage and three children, Romeo and Juliet decide to divorce.

The first page of the divorce sentence, written in Italian as all the documents managed in the NGUPP project, is shown in Figure 3 and follows the standard structure of such sentences. It mentions the magistrates in charge for the sentence, the involved parties along with their personal information and their lawyers, and what the first part asks to the second one. The second page, not shown, states that the Tribunal of Verona 1. pronounces the personal separation of Romeo and Giulietta; 2. entrusts the children Valentino, Giovanna and Marisa in a shared form to both parents; 3. assigns the former family home located in Piazza Bra, 1, to Giulietta; 4. declares the maintenance allowance due to Giulietta by Romeo, and that Romeo is required to participate in the amount of 50% of the extraordinary expenses relating to the children.

As explained in the previous sections, KlonDikE’s goal is twofold: first, it looks for the values that fill the Divorce_Frame, and then it looks for these values in the document, to anonymize them.

The result of the first activity is shown in Figure 4: personal details of Giulietta and Romeo like date and place of birth are successfully extracted from their fiscal code (“codice fiscale”, abbreviated as “C.F.”, “CF”, “cf” in the sentence), children are recognized as all being underage at the time of running the system, extraordinary expenses are recognized to be shared in the amount of 50% among the parties, and neither domestic violence nor social services intervention emerges from the sentence.

The anonymization is also accurate, as shown in Figure 5. Not all the details about places are removed (Verona and its postal code are kept visible) but the relevant entities are identified and the occurrence of their names is correctly and consistently replaced with “Parte 1”, “Parte 2”, “Figlio 1”, etc., across the paper.

Although GPT-3.5 and GPT-4 cannot be used in NGUPP for the reasons already discussed in the introduction, we compared KlonDikE with ChatGPT accessed online that, diferently from the GPT-3.5 Turbo and GPT-4 APIs is – at least – free. We fed ChatGPT with the divorce sentence and we asked it to extract the same pieces of information. The result in Figure 6 clearly shows that ChatGPT cannot infer data from the fiscal code, which is instead a precious source of information for KlonDikE.

On the other hand, ChatGPT anonymization capabilities are almost smart, as shown in Figure 7. The request we made to ChatGPT was to remove all the names of persons and places, GPT-3.5 cannot remove the names of the magistrates, which were instead correctly anonymized in KlonDikE’s output.

To cope with the need to extract statistical data from sentences, KlonDikE also generates some charts from the anonymized data. Figure 8 shows the presence of underage children and domestic violence, and Figure 9 shows the average age at which the parties divorce, and domestic violence related with age of the parties. The charts are produced from a domain of 100 sentences.

So far, the proposed charts do not show complex data or historical trends. However, they demonstrate the feasibility of a statistical study, also starting from anonymized documents if documents cannot be shared as they are. Given that sentences are time-stamped and that the data we can access, under NDA with the Tribunal of Genova, includes thousand sentences in the last 20 years, analyzing trends over years is indeed possible.

A manual analysis of some of the outputs computed by KlonDikE has been carried out by one of the authors – and final user of KlonDikE –, the magistrate Dr. Domenico Pellegrini from the Tribunal of Genova. Due to time constraints that analysis was not systematic, but it was exhaustive enough to suggest that KlonDikE satisfies the Tribunal needs. KlonDike has been installed in the Tribunal of Genova to allow a deeper evaluation.

4. Conclusions and Future Work

In this paper we presented KlonDikE, one of the outputs of the NGUPP project, its technical features, and the results of some experiments. KlonDikE is inspired by frame semantics, and this represents an original approach. Apart from the already mentioned FrEX tool that we developed in 2020, in fact, very few and almost old attempts adopt frame semantics for the legal domain [16, 17, 18].

KlonDikE is able to extract master data of sentence participants, statistical data, anonymize and produce charts. The tool is thus complete from a functionality point of view. Although GPT-4 performed very well on the same tasks faced by KlonDikE, it is not acceptable for use in judicial ofices on sensitive sentences possibly involving underage children, and it lacks native support to reading PDF files. Also, GPT-4 ability to produce an output in the required format, namely CSV for statistical analysis and PDF for the anonymized sentence, is very limited.

The application domain of KlonDikE is, at the moment, restricted to divorce and separation sentences only, but it can be expanded to other kinds of cases; in fact, the only module that should be modified for extracting data from other types of sentences is the one that assigns roles to the persons found. All computations are performed locally and with technologies that can be corrected and refined as needed.

Sentences anonymized using KlonDikE can be made public without infringing the privacy of the involved parties; anonymized sentences can still be used by other users to extract statistical data or as a benchmark for their own NLP applications in the Italian legal domain. Given that very few benchmarks in this domain exist, and even fewer for the Italian law12, this would represent a further useful application of KlonDikE.

Acknowledgments

The authors gratefully acknowledge the support of the “Next Generation UPP: nuovi schemi collaborativi fra Università e ufici giudiziari per il miglioramento dell’eficienza e delle prestazioni della giustizia nell’Italia nord-ovest” project, funded with the contribution of the European Union, Programma Operativo Nazionale Governance e Capacità Istituzionale 2014-2020, Fondo Sociale europeo and Fondo europeo di sviluppo regionale, project code CUP D19J22000240006.

D. Amyot (Eds.), 26th IEEE International Requirements Engineering Conference, RE 2018, Banf, AB, Canada, August 20-24, 2018, IEEE Computer Society, 2018, pp. 124–135. URL: https://doi.org/10.1109/RE.2018.00022. doi:10.1109/RE.2018.00022. [8] G. Ferraro, H. Lam, S. C. Tosatto, F. Olivieri, M. B. Islam, N. van Beest, G. Governatori, Automatic extraction of legal norms: Evaluation of natural language processing tools, in: M. Sakamoto, N. Okazaki, K. Mineshima, K. Satoh (Eds.), New Frontiers in Artificial Intelligence - JSAI-isAI International Workshops, JURISIN, AI-Biz, LENLS, Kansei-AI, Yokohama, Japan, November 10-12, 2019, Revised Selected Papers, volume 12331 of Lecture Notes in Computer Science, Springer, 2019, pp. 64–81. URL: https://doi.org/10.1007/978-3-030-58790-1_5. doi:10.1007/978-3-030-58790-1\_5. [9] F. Fagan, Natural language processing for lawyers and judges, Mich. L. Rev. 119 (2020) 1399. [10] E. Mumcuoglu, C. E. Öztürk, H. M. Özaktas, A. Koç, Natural language processing in law: Prediction of outcomes in the higher courts of turkey, Inf. Process. Manag. 58 (2021) 102684.

URL: https://doi.org/10.1016/j.ipm.2021.102684. doi:10.1016/j.ipm.2021.102684. [11] D. M. Katz, D. Hartung, L. Gerlach, A. Jana, M. J. B. II, Natural language processing in the legal domain, CoRR abs/2302.12039 (2023). URL: https://doi.org/10.48550/arXiv.2302.12039. doi:10.48550/arXiv.2302.12039. arXiv:2302.12039. [12] L. Robaldo, S. Villata, A. Wyner, M. Grabmair, Introduction for artificial intelligence and law: special issue "natural language processing for legal texts", Artif. Intell. Law 27 (2019) 113–115. URL: https://doi.org/10.1007/s10506-019-09251-2. doi:10.1007/ s10506-019-09251-2. [13] D. Tsarapatsanis, N. Aletras, On the ethical limits of natural language processing on legal text, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, Association for Computational Linguistics, 2021, pp. 3590–3599. URL: https://doi.org/10.18653/v1/2021.findings-acl.314. doi: 10.18653/v1/ 2021.findings-acl.314. [14] R. Salvaneschi, D. Muradore, A. Stanchi, V. Mascardi, FrEX: Extracting property expropriation frame entities from real cases, in: P. Basile, V. Basile, D. Croce, E. Cabrio (Eds.), Proceedings of the 4th Workshop on Natural Language for Artificial Intelligence (NL4AI 2020) co-located with the 19th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2020), Anywhere, November 25th-27th, 2020, volume 2735 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 87–103. URL: https://ceur-ws.org/Vol-2735/paper35.pdf. [15] C. J. Fillmore, Frame semantics and the nature of language, in: Annals of the New York Academy of Sciences, volume 280, 1976, pp. 20–32. doi:10.1111/j.1749-6632.1976. tb25467.x. [16] G. Venturi, A. Lenci, S. Montemagni, E. M. Vecchi, M. T. Sagri, D. Tiscornia, T. Agnoloni,

Towards a FrameNet resource for the legal domain, Number 465 in CEUR, 2009, pp. 67–76. [17] A. Bertoldi, R. L. de Oliveira Chishman, The limits of using framenet frames to build a legal ontology, in: R. Vieira, G. Guizzardi, S. R. Fiorini (Eds.), Proceedings of Joint IV Seminar on Ontology Research in Brazil and VI International Workshop on Metamodels, Ontologies and Semantic Technologies, Gramado, Brazil, September 12-14, 2011, volume 776 of CEUR Workshop Proceedings, CEUR-WS.org, 2011, pp. 207–212. URL: http://ceur-ws. org/Vol-776/ontobras-most2011_paper26.pdf. [18] A. Bertoldi, R. L. de Oliveira Chishman, Developing a frame-based lexicon for the brazilian legal language: The case of the criminal_process frame, in: M. Palmirani, U. Pagallo, P. Casanovas, G. Sartor (Eds.), AI Approaches to the Complexity of Legal Systems. Models and Ethical Challenges for Legal Systems, Legal Language and Legal Ontologies, Argumentation and Software Agents - International Workshop AICOL-III, Held as Part of the 25th IVR Congress, Frankfurt am Main, Germany, August 15-16, 2011. Revised Selected Papers, volume 7639 of Lecture Notes in Computer Science, Springer, 2011, pp. 256–270. URL: https: //doi.org/10.1007/978-3-642-35731-2_18. doi:10.1007/978-3-642-35731-2\_18.

[1]

Bassignana ,

Brunato ,

Polignano ,

Ramponi , Preface to the Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI) , in: Proceedings of the Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI 2023 ) co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2023 ), 2023 .

[2]

Haft ,

Jones ,

Wetter , A natural language based legal expert system for consultation and tutoring - the LEX project , in: Proceedings of the 1st International Conference on Artificial Intelligence and Law , 1987 , pp. 75 - 83 .

[3]

Lambiris , G. Oberem, Natural language techniques in computer-assisted legal instruction: a comparison of alternative approaches , J. Legal Educ . 43 ( 1993 ) 60 .

[4]

J. A.

Meldman , A preliminary study in computer-aided legal analysis , Ph.D. thesis , Massachusetts Institute of Technology, Cambridge, MA, USA, 1975 . URL: http://hdl.handle. net/ 1721 .1/27423.

[5]

Turtle , Text retrieval in the legal world , Artificial Intelligence and Law 3 ( 1995 ) 5 - 54 .

[6] M. J. B. II , D. M. Katz , E. M. Detterman , LexNLP: Natural language processing and information extraction for legal and regulatory texts , CoRR abs/ 1806 .03688 ( 2018 ). URL: http://arxiv.org/abs/ 1806 .03688. arXiv: 1806 .03688.

[7]

Sleimi ,

Sannier ,

Sabetzadeh ,

L. C.

Briand ,

Dann , Automated extraction of semantic legal metadata using natural language processing , in: G. Ruhe, W. Maalej, 12https://www.eui.eu/Research/Library/ResearchGuides/Law/Legal-Databases, accessed on September 2023 .