Automated Knowledge Extraction from Legal Texts using ASKE⋆ (Discussion Paper) Silvana Castano1 , Alfio Ferrara1 , Stefano Montanelli1 , Sergio Picascia1,** and Davide Riva1 1 Università degli Studi di Milano, Department of Computer Science, Via Celoria, 18 - 20133 Milano, Italy Abstract In this paper, we present the ASKE (Automated System for Knowledge Extraction) approach to legal knowledge extraction, based on a combination of context-aware embedding models and zero-shot learning techniques into a three-phase extraction cycle, which is executed a number of times to progressively extract concepts representative of the different meanings of terminology used in legal documents chunks. We show ASKE in action in a case study of legal knowledge extraction from a real corpus of case law decisions in the framework of the NGUPP project. Keywords Legal Knowledge Extraction, Natural Language Processing, Digital Justice. 1. Introduction To cope with the growing volume, complexity, and articulation of legal documents as well as to foster digital justice and digital law, increasing effort is being devoted to AI-based techniques for legal knowledge extraction. The availability of techniques for extracting knowledge from legal documents is not only desirable but even necessary, and the benefits and concrete outcomes that could result from the diffusion of such technology are many and different for both legal practitioners (i.e., lawyers, judges and Courts), administrations, and general public. Legal search through legal knowledge extraction is an extremely important instrument for legal practitioners in both common law [2] and civil law [3] systems. For example, legal search over precedent case law may be useful for a lawyer to retrieve a decision rendered in a case similar to the case at hand, where the Court decided in a way that is favorable to its client position, or a decision rendered in a different case on the basis of a reasoning that, applied to the case at hand, leads to a favorable interpretation of its client position [4]. When conducting case law research, it is important to focus on both the decision of the case, but also the motivation and the reasoning SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy ⋆ This paper presents an extended abstract of [1]. ** Corresponding author. $ silvana.castano@unimi.it (S. Castano); alfio.ferrara@unimi.it (A. Ferrara); stefano.montanelli@unimi.it (S. Montanelli); sergio.picascia@unimi.it (S. Picascia); davide.riva1@unimi.it (D. Riva)  0000000238262407 (S. Castano); 0000-0002-4991-4984 (A. Ferrara); 0000-0002-6594-6644 (S. Montanelli); 0000-0001-6863-0082 (S. Picascia); 0009000396819423 (D. Riva) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings (called “rationale”) behind the decision. During this process, great help may come from “context- aware” knowledge extraction systems, based on Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI), to deal with challenging requirements posed by the legal documentation (e.g., language complexity, significant length of the legal texts; lack of sufficiently-large annotated corpora for model training). In this paper, we present ASKE (Automated System for Knowledge Extraction), an approach to legal knowledge extraction with focus on abstract concept discovery using a combination of context-aware embedding models and zero-shot learning techniques. ASKE takes a corpus of legal documents as input and it extracts a graph of concepts which are used to classify the given documents at the chunk-level (e.g., paragraph) granularity. Through context-aware embedding, document chunks and concept definitions are projected in the same semantic space, to appropriately capture and manage the meaning of legal terminology by taking into account the context in which terms are used. Through zero-shot learning, a multi-label classification process is performed in an unsupervised way, without relying on any pre-existing annotation of legal documents. The distinguishing feature of ASKE is the implementation of a cyclic extraction process that, at each cycle, progressively incorporates newly extracted legal knowledge into the ASKE Conceptual Graph - ACG, a graph-based data structure initially populated through a data preparation step. After describing the ASKE process in a nutshell, we show ASKE in action in the context of two use cases of knowledge extraction considering a corpus of 50 Italian case law decisions in the framework of the Next Generation UPP (NGUPP) project. NGUPP, funded by the Italian Ministry of Justice, aims at providing artificial intelligence and advanced information management techniques for the digital transformation of Italian legal processes and digital justice in general. 2. ASKE in a nutshell ASKE is conceived to build a conceptual view over a considered corpus of legal documents. A data preparation step is initially executed, and it is followed by an iterative three-step extraction process characterized by i) document chunk classification, ii) terminological enrichment, and iii) concept derivation (see Figure 1). Data preparation consists in the application of conventional text processing techniques, that are tokenization, lemmatization, and embedding. Tokenization has the goal to separate a document 𝑑 into chunks. A chunk 𝑘 represents the text unit to consider for classification and it determines the granularity of the document that can be associated with a concept. A chunk consists of a few sentence/phrase detected in a document, up to a maximum size of 512 words1 . After tokenization, the terms appearing in chunks are lemmatized and a vector-based representation of each chunk is finally built. To this end, a chunk 𝑘 is associated with a set of terms 𝑊𝑘 therein contained. Any term 𝑤 ∈ 𝑊𝑘 is described as 𝑤 = (𝑤𝑙 , 𝑤𝑑 , 𝑤 ¯ ), where 𝑤𝑙 is the label of the term (i.e., the lemma), 𝑤𝑑 is a description of the term taken from a reference dictionary/vocabulary (e.g., WordNet), and 𝑤 ¯ is the corresponding vector-based representation 1 The size of the document chunk is experimentally determined according to the features of the considered corpus. A chunk should be large enough, so that the context can be captured, but not too much extended to avoid segments that are long to read and potentially noisy due to the presence of multiple concepts. Figure 1: The ASKE approach to legal knowledge extraction. according to Sentence-BERT [5], respectively. The use of embedding techniques to represent chunks allows to map the document contents on a semantic vector space where the similarity of two chunks can be measured by comparing the corresponding vector representations through a similarity metric (e.g., cosine similarity). Sentence-BERT has been chosen for ASKE since it is trained in such a way to ensure consistent representation of the meaning of entire paragraphs. This is particularly appropriate in the legal field, where the phrase structure can be highly articulated, and some common terms can have a precise technical meaning when used in a court (e.g., citation, clemency, designation). A chunk has the form 𝑘 = (𝑘𝑑 , 𝑘¯), where 𝑘𝑑 is the original textual content of the chunk and 𝑘¯ is the corresponding vector-based representation calculated as the mean of term vectors 𝑤 ¯ with 𝑤 ∈ 𝑊𝑘 . For triggering the knowledge extraction process, in data preparation, ASKE requires to specify a set of seed concepts. A seed concept can be expressed as a short text (e.g., one or two phrases) providing a gross-grained description of the target. In this case, as a common example, a seed concept can be specified by taking an excerpt from pertinent law/case law documentation. As an alternative, a seed concept can be defined as a list of keywords. As an example, for a seed concept about banking contract, a corresponding list of keywords could be the fol- lowing: bank deposit, safe deposit box, bank credit opening, bank advance, bank account, bank discount. Document chunk classification has the goal to annotate chunks with featuring concepts and zero-shot learning techniques are employed to this end. Zero-shot learning is an unsupervised classification technique, characterized by the ability to work without requiring any pre-existing annotation of the considered documents. Given a set of concepts (i.e., the seed concepts at the beginning of the process), a similarity measure 𝜎, e.g., cosine similarity, is calculated over any pair of embeddings between chunks and concepts. A chunk 𝑘 is classified with the concept 𝑐 when the similarity value satisfies 𝜎(k, c) ≥ 𝛼, with 𝛼 defined as a similarity threshold configured in the system. A concept 𝑐 in ASKE is defined as a pair 𝑐 = (𝑐𝑙 , ¯𝑐), where 𝑐𝑙 is a label featuring the meaning of the concept expressed in a synthetic and human-understandable way, and ¯𝑐 is a vector-based concept representation. Each concept 𝑐 is initially associated with the set of terms 𝑊𝑐 extracted from the textual description of 𝑐. The vector concept ¯𝑐 is built as the mean of the vectors of all the terms in 𝑊𝑐 . The label 𝑐𝑙 corresponds to the label 𝑤𝑙 of the term 𝑤 ∈ 𝑊𝑐 , whose vector representation 𝑤 ¯ is closest to the concept vector ¯𝑐. Terminological enrichment is then enforced to enrich the term set 𝑊𝑐 of a concept 𝑐 by considering the terms 𝑊𝑘 of any chunk 𝑘 classified with 𝑐. The idea is that the initial description of the concept 𝑐 can become more detailed if we add terminology taken from chunks that are pertinent (i.e., classified) with 𝑐. This is done by calculating the similarity between any pair of embeddings 𝑤 ¯ and ¯𝑐 in 𝑊𝑘 and 𝑊𝑐 . The most similar terms of 𝑊𝑘 are inserted in 𝑊𝑐 according to a system-defined 𝛽 similarity threshold. Concept derivation is finally executed to determine new and more fine-grained concepts that can emerge from existing ones after enrichment. Given a concept 𝑐, the Affinity Propagation (AP) algorithm is employed to cluster the embedding vectors 𝑤¯ of terms in 𝑊𝑐 . A new concept 𝑐 is created for each cluster returned by AP. A link is defined between a concept 𝑐′ and 𝑐 to ′ denote that 𝑐′ is derived from 𝑐 and they are somehow similar/related in content. The concept 𝑐 is then updated since the terms in 𝑊𝑐 can be changed due to enrichment. As a consequence, 𝑐𝑙 and ¯𝑐 are re-calculated. ASKE endpoint. The set of concepts obtained after derivation can trigger the execution of a new cycle based on the above three steps. Each cycle execution is called ASKE generation. The new concepts derived in a certain ASKE generation contribute to improve the classification of chunks in more fine-grained concepts. New concepts can also be discovered through a new execution of enrichment and derivation on the basis of a refined classification result. As such, concept extraction terminates when the number of new concepts created in the derivation step is lower than a predefined termination threshold. A final concept graph ACG is populated with all the concepts and corresponding derivation links extracted by ASKE. 3. ASKE in action To show ASKE in action, we consider a case study of legal knowledge extraction from a legal corpus in the context of the NGUPP project. The case study dataset is composed by 50 court decisions from several Italian courts, spanning from year 2008 to year 2022. All decisions are first-degree verdicts selected by legal experts of the project for their relevance to the subject matter unfair competition. We illustrate knowledge extraction from the unfair competition dataset by considering two use cases, modeling the use of ASKE by two categories of users with different levels of expertise, namely, UC-A, related to a legal practitioner user (e.g., a lawyer) with high legal expertise, and UC-B, related to a general subject (e.g., a citizen) with limited legal expertise. To trigger the ASKE extraction process, user A provides a legal definition as a seed concept, i.e., the expression “acts likely to cause confusion”, taken from art. 2598 of the Italian Civil Code. In the second case, user B provides a seed in form of general keywords like “bag, distinctive elements, imitation”. The seeds were in Italian, and translated here in English for readability. We asked two legal experts to evaluate the knowledge output extracted by ASKE in both use cases. In particular, we asked to qualitatively assess i) the pertinence of discovered chunks with respect to the initial seed concept, and ii) the appropriateness of the new ASKE concepts derived from the seed. In both cases, ASKE was running with hyperparameters 𝛼 = 𝛽 = 0.3, number of generations equal to 21, paraphrase-multilingual-MiniLM-L12-v22 as the embedding model and Open Multilingual WordNet as external dictionary to retrieve term definitions3 . 3.1. UC-A: ASKE for legal practitioners Figure 2 shows a portion of the concept graph produced by ASKE in case UC-A. Legal experts Figure 2: A portion of the ACG for the use-case UC-A positively noted that concepts derived from the seed are coherent with the topic of the law provision. Indeed, concepts like exemplify and relevant are related to proof of the uniqueness of a product, while hindrance can be interpreted as a consequence of the acts producing confusion. Concept clarification is related to exemplify, and other derivations, such as that of prudence, debate, caution and dispute from relevant, though not strictly related from a semantic point of view, were judged as appropriate in this context, since relevance is a key feature in legal debate and disputes, while prudence and caution are exercised to prevent the introduction of misleading, irrelevant or prejudicial information and to evaluate the reliability of evidence. 2 https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 3 Further details about the ASKE configuration in this case study are provided in [1]. Next, legal experts analyzed the chunks with the highest similarity with respect to the seed concept. We report below the top-two similar chunks as an example: Suitability to cause confusion, therefore, consists of two elements: 1) the originality of the imitated product, endowed with distinctive capacity, such as to become inherent, in the image with the consumer, of the product itself; 2) the absence of distinctive elements capable of showing that the origin of one product is different from that of the other. [...] b) conversely, infringement only exists where there is a likelihood of confusion for the public, consisting even in a mere danger of association between the distinctive elements. Prerequisites for the aforementioned discipline to operate, therefore, are: i) the existence of substantial identity or similarity between the signs; ii) their use for goods and services that belong to the same sector and are intended to satisfy the same market requirements; iii) identity of characteristics in the eyes of the same average consumer or only relative affinity. Both chunks are definitions of the conditions that characterize the “acts likely to cause confusion”. These and other similar chunks identified by ASKE can therefore be exploited by legal practitioners for interpretation of a new case. Indeed, their main aim is to verify that a certain fact fits or doesn’t fit some conditions established by the law. 3.2. UC-B: ASKE for general subjects Figure 3 shows a portion of the concept graph produced by ASKE in case UC-B. In this graph legal experts noticed the prevalence of general concepts over legal ones. Two concepts emerge among others, related to the two senses of the Italian word “borsa”: “bag” and “stock exchange”. Looking at concepts derived from these ones, it can be noticed that ASKE was able to perform a correct distinction between the two. Looking at the document chunks with the highest similarity with respect to the seed concept, we highlight two chunks that refer to each of the derived concepts. As noted in the aforementioned judgment No. 5443/2017, “in the present case, such reproduction also applies to details such as, for example, the slightly rounded flap situated between the two handles and covering part of the zip fastener which, if they constitute an integral part of the shape of the bag model, nevertheless also appear to be elements in themselves capable of impressing themselves on the mind of the consumer who will be able to distinguish between products even legitimately having similar shapes, the one attributable to the source of production constituted by the present plaintiffs.” Therefore, S.’s clients who intend to make investments of a financial nature first enter into a so-called ’placement contract’ with S. itself, and then enter into the actual contracts relating to their investment (subscription of units of mutual investment Figure 3: A portion of the ACG for the use-case UC-B funds, or shares in SICAVs, or conclusion of an insurance policy, or conclusion of a portfolio management contract) directly with the ’product companies’ contracted with the plaintiff. The retrieved chunks were evaluated positively by legal experts from the point of view of their pertinence to the seed. Also the relatedness of extracted concepts starting from initial seed has been evaluated as satisfactory. The application of ASKE for the exploration of a legal corpus proved promising, as the concept graph simplifies the navigation of the underlying corpus making it accessible to even non-expert users. 4. Related work With the recent progress of digital transformation, increasing interest and efforts have been devoted to the development of advanced, AI-based approaches to process huge volumes of legal digital documents and extract knowledge from them. In [6], an approach to assist legal professionals in comparing relevant precedents is presented; in [7], a method for similarity case retrieval based on the legal facts is proposed, whose model combines topic distribution and legal entity facts to make the document representation vector more suitable for legal scenarios, with focus on text similarity problem for Chinese. Semantic similarity is employed in [8], where documents are grouped into clusters, according to their content, and then regularities in the paragraphs are detected for each cluster. In [9], information extraction approach for named entity recognition has been presented, with focus on German legal documents. The increasing interest towards the application of artificial intelligence techniques to the legal field brought to the proposal of several competitions related to the analysis of legal documents and related datasets. The most relevant for the purpose of this paper is the COLIEE [10] competition, where tasks for legal information extraction from case law and statute law are proposed. It is worth noting that ASKE enforces document classification without the need to rely on pre- existing annotations, and without requiring to pre-define the number of target topics/concepts to discover. Thus, ASKE is particularly appropriate to satisfy exploratory information needs in those situations (e.g., the legal domain) where a-priori knowledge about the corpus is not available. We also note that a key component in ASKE is Sentence-BERT [5], a modification of BERT language model [11] that is specifically aimed at representing sentence meaning in a vector space. LEGAL-BERT, a version of BERT pre-trained on legal corpora [12], has been proposed for the English language, and Italian Legal BERT [13] is under evaluation for the Italian language. Another proposal for Italian legal documents is LamBERTa [14], with a focus on law article retrieval. We eventually decided to adopt Sentence-BERT because it has been trained in such a way to ensure consistent representation of the meaning of entire sentences, which was a major requirement in designing ASKE and for dealing with legal language complexity. With the consolidation of these latter models that combine consistent sentence representation with in-domain pre-training, an extended version of ASKE based on them could be evaluated as future work. 5. Concluding Remarks In the paper, we presented the ASKE approach to legal knowledge extraction, which is based on a combination of context-aware embedding models and zero-shot learning techniques. A deep evaluation of ASKE has been performed considering the EurLex dataset [15] containing 45, 000 EU legislative documents in English, each of which is annotated by the Publication Office of the EU with one or more labels from the EuroVoc thesaurus4 . The goal of the evaluation was twofold: i) to assess the quality of the knowledge extraction process, by assessing the capability of ASKE to reconstruct the EuroVoc labels as extracted concepts, and ii) to evaluate the quality of the document classification process, by assessing the correctness of ASKE concepts assigned to each document against the ground truth labels. The results of the evaluation are positive on both sides (see [1]). Ongoing work is related to the inclusion of ASKE in a service architecture for legal knowledge extraction [16]. Furthermore, we are also working on the use of ASKE for enforcing legal document building, where a new case law document for a target case at hand can be interactively composed starting from the most similar and prominent document chunks extracted by ASKE. Acknowledgments This work was supported in part by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them. 4 https://op.europa.eu/s/yTaY. References [1] S. Castano, A. Ferrara, E. Furiosi, S. Montanelli, S. Picascia, D. Davide, C. Stefanetti, Enforcing Legal Information Extraction through Context-aware Techniques: the ASKE Approach, Computer Law & Security Review 52 (2024). [2] J. Waldron, Stare decisis and the rule of law: A layered approach, L. Rev 1 (2012). [3] R. Tomasino, Il valore del precedente: un’analisi critica, https://www. associazionemagistrati.it/media/79559/08_Tomasino.pdf, 2023. Accessed: 2023. [4] J. Montrose, Distinguishing cases and the limits of ratio decidendi, The Modern Law Review 19 (1956) 525–530. [5] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. arXiv:1908.10084. [6] W. Y. Mok, J. R. Mok, Legal machine-learning analysis: first steps towards a.i. assisted le- gal research, in: Proceedings of the 17th International Conference on Artificial Intelligence and Law, ICAIL ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 266–267. URL: http://doi.org/10.1145/3322640.3326737. doi:10.1145/3322640.3326737. [7] W. Hu, S. Zhao, Q. Zhao, H. Sun, X. Hu, R. Guo, Y. Li, Y. Cui, L. Ma, BERT_LF: a similar case retrieval method based on legal facts, Wireless Communications and Mobile Computing 2022 (2022) 1–9. URL: http://doi.org/10.1155/2022/2511147. doi:10.1155/2022/2511147. [8] G. De Martino, G. Pio, M. Ceci, Prilj: an efficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments, Artificial Intelligence and Law 30 (2022) 359–390. [9] E. Leitner, G. Rehm, J. Moreno-Schneider, Fine-grained named entity recognition in legal documents, in: M. Acosta, P. Cudré-Mauroux, M. Maleshkova, T. Pellegrini, H. Sack, Y. Sure-Vetter (Eds.), Semantic Systems. The Power of AI and Knowledge Graphs, Springer International Publishing, Cham, 2019, pp. 272–287. [10] J. Rabelo, R. Goebel, M.-Y. Kim, Y. Kano, M. Yoshioka, K. Satoh, Overview and Discussion of the Competition on Legal Information Extraction/Entailment (COLIEE) 2021, The Review of Socionetwork Strategies 16 (2022) 111–133. URL: https://ideas.repec.org/a/spr/trosos/ v16y2022i1d10.1007_s12626-022-00105-z.html. doi:10.1007/s12626-022-00105- . [11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. arXiv:1810.04805. [12] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL-BERT: The muppets straight out of law school, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 2898–2904. URL: https://aclanthology.org/2020.findings-emnlp.261. doi:10.18653/v1/ 2020.findings-emnlp.261. [13] D. Licari, G. Comandé, ITALIAN-LEGAL-BERT: A pre-trained transformer language model for italian law, in: Companion Proceedings of the 23rd International Conference on Knowledge Engineering and Knowledge Management, Bozen-Bolzano, Italy, September 26-29, 2022, 2022. URL: https://ceur-ws.org/Vol-3256/km4law3.pdf. [14] A. Tagarelli, A. Simeri, Unsupervised law article mining based on deep pre-trained language representation models with application to the italian civil code, Artificial Intelligence and Law 30 (2021) 417–473. URL: https://doi.org/10.1007%2Fs10506-021-09301-8. doi:10.1007/ s10506-021-09301-8. [15] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. Androutsopoulos, Large-scale multi- label text classification on EU legislation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Lin- guistics, Florence, Italy, 2019, pp. 6314–6322. URL: https://aclanthology.org/P19-1636. doi:10.18653/v1/P19-1636. [16] V. Bellandi, S. Castano, S. Montanelli, D. Riva, et al., A service architecture for ai-based legal knowledge extraction, in: CEUR WORKSHOP PROCEEDINGS, volume 3478, CEUR Workshop Proceedings, 2023, pp. 110–119.