Improving the accessibility of EU laws: the Chat-EUR-Lex project Manola Cherubini1,*,†, Francesco Romano1,*,†, Andrea Bolioli2,*,†, Lorenzo De Mattei2,*,†, and Mattia Sangermano2,*,† 1Institute of Legal Informatics and Judicial Systems (IGSG-CNR), via dei Barucci 20, Florence, 50127, Italy 2Aptus.AI, Largo Padre Renzo Spadoni 1, 56126 Pisa, Italy Abstract In this article we describe the results of an ongoing research project on the use of Chat-Based Large Language Models (Chat LLMs) and Retrieval Augmented Generation (RAG) for the access to legal repositories. We are integrating Chat LLMs and RAG to access a dataset of legal acts in English and Italian (a subset of EUR-Lex collection), and interact through a chatbot. We present the state of the art, the objectives, the use cases, the methodology used in the project, and then we discuss the preliminary results. Keywords Legal Informatics, Large Language Models (LLMs), Retrieval Augmented Generation (RAG) 1 1. Introduction 2. Related works In this article we describe the partial results of an As stated in [1] and many other sources, “Legal ongoing research project on the use of Chat-Based professionals rely on accurate and up-to-date Large Language Models (Chat LLMs) and Retrieval information to make informed decisions, interpret Augmented Generation (RAG) for the access to laws, and provide legal counsel”. The phenomenon of normative repositories. hallucination and nonsensical outputs of systems In the project, we are integrating Chat LLMs and based on LLMs is obviously not acceptable in the legal RAG to access a dataset of legal documents (European context. To the best of our knowledge, the first survey legal acts taken from EUR-Lex repository) and to on the challenges faced by LLMs in the legal domain allow the user to interact through a chatbot. was presented in [2], but mainly for Chinese language. In the next sections, we will present the state of While in other domains, such the financial one, a few the art (2. Related works), the objectives and the LLMs have already been developed [3]. Large methodology used (3. Chat-EUR-Lex methodology), language models are also used in healthcare where the results of a research survey (4. Research survey), LLMs are useful for processing and understanding the system architecture (5. System architecture), and medical text data, providing valuable insights, and then we discuss the results presented in the previous supporting clinical decision-making [4]. sections (6. Discussion and conclusions). LLMs are posing interesting challenges to those who are experimenting with these technologies in the legal field, where the “complexities of legal language, nuanced interpretations, and the ever-evolving nature of legislation present unique challenges that Ital-IA 2024: 4th National Conference on Artificial Intelligence, 0000-0002-0242-6633 (Manola Cherubini); 0000-0001-5250- organized by CINI, May 29-30, 2024, Naples, Italy 7733 (Francesco Romano); 0000-0003-1681-9435 (Andrea ∗ Corresponding author. Bolioli) † These authors contributed equally. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). manola.cherubini@igsg.cnr.it; francesco.romano@igsg.cnr.it; andrea.bolioli@aptus.ai; lorenzo@aptus.ai; mattia.sangermano@aptus.ai CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings require tailored solutions” [1]. There are many conversational interface that deals with complex legal questions and fears about the actual use of these texts (the regulations in English and Italian published artificial intelligence tools, e.g., their opacity [5] and in the EUR-Lex repository), can provide simplified the possibility of hallucinations, but also “legal explanations, and allows the user to conduct context- problems concerning intellectual property, data specific interaction. To present the methodologies privacy, and bias and discrimination” [6]. For this used, we describe the activities performed: reason, in the European Union it has been decided to • Legal and Ethics risk assessment. We regulate the use of artificial intelligence in specific performed a legal and ethics assessment. sectors, but also to adopt a regulation that provides The prototype will be compliant to GDPR for a regulatory framework of reference only for high- and EU AI Act, i.e., we will comply with the risk AI systems [7]. Some experiments conducted on rules set by EU regulations. legal datasets show that LLMs can improve the • UX Research and Survey. We collected data performance of document page classification [8][9], and information from a sample of potential the annotation of legal texts [10], the summarization users through a questionnaire, both in of legal texts [11] [12], the legal rule classification Italian and in English. Objective of the [13], the legal statute identification from facts [14] survey: understand the needs of people and the mining of legal argument [15]. Other trials using digital legal resources, and their level explore the ability of LLMs “to explain legal concepts of satisfaction; identify users' needs and from statutory provisions to legal professionals” [16] desires regarding chatbot interaction; know and to create “a register of obligations from various the fears related to the use of generative AI. types of legislative and regulatory material” [17]. User experience (UX) research involves Other uses can also be mentioned, such as LLMs studying how users interact with the current as legal tutors in the context of legal training [18] and EUR-Lex system and identifying pain points in one of the most basic tasks required of lawyers, the and challenges. so-called “statutory reasoning” [19]. Recently, • Data collection. Chat-EUR-Lex dataset legislative drafting experiments have been carried out comprises a selection of in force legal acts in with ChatGPT, particularly for “the comparison of English and Italian sourced from EUR-Lex, legislation among jurisdictions and the synthesis of covering the period from January 1, 2014, to the best possible policy for the country based on this December 31, 2023. Specifically, it includes comparison” [20]. all historical texts preserved in Celex 3 As it is known, generative AI models have been sector that remain unaltered over time, found to hallucinate, i.e., they can generate false or along with the most recent consolidated nonsensical statements [21] [22]., two strategies for versions in Celex 0 sector for acts that have reducing this problem are Fine-tuning and Retrieval- undergone amendments. Corrigenda are Augmented Generation (RAG) [23]. In both cases, we omitted from this dataset. Additionally, the try to provide the LLM with the relevant information EUR-Lex documents that are not provided (according to a domain or a specific query). Fine- with XML or HTML data are excluded from tuning involves additional training on a specific the selection. Number of documents in dataset, tailoring the model to certain tasks or English: 19062; documents in Italian: domains [24] [25]. This improves accuracy but limits 18164. the model to knowledge up to the last fine-tuning. RAG • Semantic search engine setup. Semantic merges a pre-trained model with a retrieval system, search must allow users to find relevant accessing current data for accurate responses on legal information even if they don't use recent or specific topics. Its success hinges on the precise legal terminology. This involves quality of retrieved information and requires using Natural Language Processing (NLP) maintaining a large, updated database. Both methods techniques, particularly neural embedding, enhance model performance in specific areas, such as the one introduced by (Lai et al. balancing current information and resource needs. 2023). • RAG-based Chat system development. RAG 3. Chat-EUR-Lex metodology combines retrieval-based methods with In this section, we present the main problems faced generative language models to provide and the methodologies used in the ongoing Chat-EUR- accurate and contextually relevant Lex project. Our objective is to create an AI-powered responses to user queries. The user can read both the generated answer and the relevant people that viewed the questionnaire (Views) in sources, i.e. the portions of regulations used Italian and English, as of March 30, 2024. to generate the answer. • First version release (June 2024). The first Table 1 Questionnaire results in Italy (language: Italian), version of the prototype is released to a other EU countries (language: English), and total selected group of users. This version should results provide basic functionality and serve as a starting point for further improvements. LANG VIEWS STARTS SUBMISSIONS • Feedback collection and tuning. User feedback is actively collected and analysed. Italian 769 315 192 This feedback is used to identify areas for improvement and fine-tune both the chat English 530 184 105 system and the user interface. This iterative process continues to enhance the system's Total 1299 499 297 effectiveness and user satisfaction. We report here some statistics on the Italian 4. Research survey responses: 54% of the respondents are legal experts In this section we present the results of the (law researchers, jurists, lawyers, compliance questionnaire distributed from December 28, 2023, to specialists, etc.), while 46% are not legal experts. 66% March 31, 2024, aimed at legal professionals, law say that they consulted the EUR-Lex repository at researchers, public officials in the legal sector, least once. When asked which tool they mainly use to compliance specialists, and other people interested in search for legal documents and regulations, 48% the use of Generative AI in the legal domain, in Italy answer mainly Google search, 37% mainly EUR-Lex and other European countries. The objectives of the search engine, 15% mainly other tools (we do not questionnaire were to understand the needs of people report here the answers on the other legal sources). using digital legal resources (EUR-Lex in particular) 60.4% say that a generative AI chatbot could help and their level of satisfaction; identify users' needs search and interaction, 33.3% don’t know, 6.2% say and desires regarding chatbot interaction; know the No. The question “Do you know what generative AI is fears related to the use of generative AI. The and/or are you a user of generative AI tools?” is questionnaire was anonymous; the languages used answered: 51% “Little”, 27% say “Yes, I use them were Italian and English. We distributed it online on regularly”, 22% “Not at all” (remember that these websites and with targeted e-mail activity. percentages concern responses in Italy). Finally, 87% The questionnaire contained 22 questions: 4 think generative AI must be regulated, 8% don’t know, questions for demographic information (age, gender, 5% answer No. For reasons of space, we do not report education, profession); 9 multiple choice questions; 6 here the answers to the question “What kind of open-ended questions; 2 yes/no questions, and 1 requests would you make to the chatbot?”. rating question. Regarding the topic of the use of LLMs In summary, these responses allow us to assess for accessing European laws, the most important the level of knowledge of legal experts in generative questions are: “7) To search for legal documents, artificial intelligence, to see if there are differences regulations and rulings, do you mainly use the EUR- between legal experts and other people, to know their Lex search engine, or do you use Google Search or fears on these issues, and, above all, to collect the something else?”. “17) In the legal domain, could a needs and requirements of potential users of the generative AI chatbot help search and interaction?”. chatbot. “18) What kind of requests would you make to the A detailed report containing the complete chatbot? Write one or more example requests.”. “19) questionnaire, the aggregate results and a detailed Do you know what generative AI is and/or are you a analysis will be published in May 2024 on the GitHub user of generative AI tools?”. “21) Do you have any project repository. concerns about the use of generative AI in the legal field?”. 5. System architecture The following table (Table 1) presents the numbers of Submissions, number of people that did The pipeline of Chat-EUR-Lex prototype is divided not complete the questionnaire (Starts), number of into main parts: • An asynchronous batched pipeline which While our dataset contains about 37000 legal collects and indexes the documents from acts, the need for partitioning these laws for a EUR-Lex into a search engine. granular retrieval process amplifies the total count of retrievable documents into about 371000 texts • A synchronous pipeline that gets the users' (“chunks”). This extensive partitioning provides a queries, retrieves relevant contextual more detailed context for the RAG system, allowing information and provides a response to the for more accurate answer generation. On the other users. side the increased number of documents naturally The asynchronous batched pipeline comprises presents a challenge for our retrieval process. three main components: 1. A crawler that collects the data from EUR- Lex. 6. Discussion and conclusions 2. A chunker who chunks the documents into In this project, we are trying different combinations of smaller segments. the mentioned parameters using both open-source 3. An embedding model that transforms the and closed-source models to investigate the readiness segments into dense vectors to be indexed in of LLMs to build a system for legislative research. the vector DB. We are performing two evaluation steps to The synchronous pipeline comprises two main compare different models and parameters: components: 1. Search engine evaluation: we are comparing • A retriever that transforms the query into a different Embedding models, chunking vector using the same embedding models strategies and k-nearest neighbors search used by the asynchronous pipeline and looks techniques to select the best combinations into the vector DB for similar contents. to retrieve good-quality results. • An LLM that gets both the query and the 2. Response generator evaluation: having fixed context inserted in a prompt template and the best combination for contextual produces a response to be provided to the information retrieval thanks to step 1 users evaluation, we will compare the quality of Each time the user does a new query, the whole chat history is passed to the LLM until the maximum the generated response using different prompt length is reached; in that case, older chat parts prompt templates, LLMs and LLMs are truncated. parameters. This process involves several parameters to be For step 1 evaluation, we are creating a gold selected, such as: dataset using expert annotators and use standard • Chunking techniques and size. search engine evaluation metrics such as the • Embedding models. Mean Reciprocal Rank. For step 2, evaluation of • K-nearest neighbor search techniques. different settings will be proposed to experts • Prompt templates. who will ask the same questions and attribute • LLM and its parameters. scores to each response. Preliminary results have shown that when the context provided to gpt-4 In summary, the RAG approach is a blend of two by the retrieval system is consistent with the key components: a retrieval system and a generator. question asked, the generated answer is concise, The retrieval system scans through a database of comprehensible, and accurate. This consistency documents to fetch the most relevant ones in significantly minimizes the problem of response to a user query. The most recent solutions hallucination, wherein the model might generate for retrieval systems employed in RAGs rely on semantic search utilizing embeddings. false or nonsensical information. The generator, on the other hand, uses these The large number of chunks that can retrieved documents to generate a well-informed contribute to the generation of the answer answer. This process ensures that the system naturally presents a challenge for our retrieval provides responses that are both informative and process. In this context, we are actively exploring contextually accurate. In the current project setup strategies to improve the efficiency of this crucial (April 2024), the gpt-4 model powers the generation component. One of the promising directions we of responses. For the creation of embeddings, we are considering involves leveraging not only the utilize text-embedding-ada-002 semantic content of the normative sources but (https://platform.openai.com/docs/models/embedd also the boundary information, such as metadata. ings). The inclusion of metadata in our retrieval process Acknowledgements could potentially imbue our system with the ability to hone in on the most relevant documents, thereby Chat-EUR-Lex project is funded within the framework optimizing the retrieval process and improving the of the NGI Search project under grant agreement No overall performance of the chat system. On the other 101069364. Views and opinions expressed are hand, the utilization of a specific embedding model however those of the author(s) only and do not built on legal data could be beneficial, as opposed to a necessarily reflect those of the European Union or generic embedder. Indeed, this model could provide a European Commission. Neither the European Union more nuanced understanding of the legal texts, thus nor the granting authority can be held responsible for enhancing the retrieval process. them. The evaluation of the preliminary results is promising: on simple tasks, i.e. simple queries, the References system perform on par with human experts, as [1] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: attested in similar researches [26]. Open-source legal large language model with We must highlight the need for legal experts to integrated external knowledge bases, arXiv make qualitative assessments, as well as quantitative preprint arXiv:2306.16092, (2023). evaluations, both due to the complexity of the legal [2] J. Lai, W. Gan, J. Wu, Z. Qi, P.S. Yu, Large Language domain and because the evaluation can be done in Models in Law: A Survey. arXiv preprint different ways depending on the use case, aim and arXiv:2312.03718. (2023). user target. It does not seem possible to create a [3] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, unique dataset to evaluate the quality of the results in S. Gehrmann, P. Kambadur, D. Rosenberg, G. an automatic way for QA tasks in the legal domain. Mann, BloombergGPT: A Large Language Model Multiple points of view may be equally valid, and a for Finance, arXiv preprint arXiv:2303.17564v2 unique ‘ground truth’ may not exist, as discussed for (2023). other tasks in Perspectivist Approaches to NLP [27] [4] S. Reddy, Evaluating large language models for A general problem for experiments with LLMs is use in healthcare: A framework for translational “non-repeatability”: the experiments are not exactly value assessment, in volume 41 of Informatics in reproducible because the systems are not Medicine Unlocked, (2023). deterministic; moreover, in proprietary commercial https://doi.org/10.1016/j.imu.2023.101304 systems, we do not know the LLM’s parameters and [5] A. Contaldo, F. Campara, Intelligenza artificiale e the dataset used for training; new versions and new Diritto. Dai sistemi esperti “classici” ai sistemi LLMs are released quickly. esperti “evoluti”: tecnologia e implementazione Here are some additional critical issues that we giuridica, in: G. Taddei Elmi, A. Contaldo (Eds.), are addressing: Intelligenza artificiale. Algoritmi giuridici. Ius • text length limitations: currently, there's a condendum o “fantadiritto”, Pacini editore, Pisa, limit on the length of text the LLM can handle 2020, p. 24. effectively. This necessitates breaking down [6] Z. Sun, A short survey of viewing large language longer texts, which can be cumbersome; models in legal aspect, arXiv preprint • impact of short queries: the chatbot's arXiv:2303.09136 (2023). accuracy and precision suffer when [7] G. Finocchiaro, Artificial intelligence. What are responding to very short or poorly defined the rules? Il Mulino, Bologna, 2024. queries. More detailed user queries lead to [8] P. Fragkogiannis, M. Forster, G.E. Lee, D. Zhang, better results; Context-Aware Classification of Legal Document • single vs. multiple documents: the chatbot Pages. In: Proceedings of the 46th International performs best when responding to queries ACM SIGIR Conference on Research and that target information from a single Development in Information Retrieval (SIGIR document, rather than synthesizing '23), Association for Computing Machinery, NY, information from multiple sources. pp 3285-3289, 2023, doi: 10.1145/3539618.3591839 [9] D. Trautmann, Large Language Model Prompt Chaining for Long Legal Document Classification, arXiv preprint arXiv:2308.04138.(2023). [10] J. Savelka, K.D. Ashley, The unreasonable [20] G. Hill, The emerging artificial intelligence (AI) effectiveness of large language models in zero- and national uniform legislation, in volume 97.5 shot semantic annotation of legal texts, of Australian Law Journal, (2023), pp. 303-306. Frontiers Artificial Intelligence, 6: 1-14, 2023, [21] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. doi: 10.3389/frai.2023.1279794. Wang, T. Liu, A survey on hallucination in large [11] M. Cherubini, F. Romano, A. Bolioli, N. De language models: Principles, taxonomy, Francesco, I. Benedetto, The summarization of challenges, and open questions. arXiv preprint legal texts: an experiment with GPT-3. Rivista arXiv:2311.05232 (2023). italiana di informatica e diritto, 5, 1: 191-204 [22] S. M. Tonmoy, S.M. Zaman, V. Jain, A. Rani, V. (2023) https://doi.org/10.32091/RIID0103 Rawte, A. Chadha, A. Das, A comprehensive [12] D. Datta, S. Soni, R. Mukherjee, S. Ghosh, survey of hallucination mitigation techniques in MILDSum: A Novel Benchmark Dataset for large language models. arXiv preprint Multilingual Summarization of Indian Legal Case arXiv:2401.01313 (2024). Judgments, arXiv preprint arXiv:2310.18600v1, [23] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. (2023) Karpukhin, N. Goya, D. Kiela, Retrieval- https://doi.org/10.48550/arXiv.2310.18600 augmented generation for knowledge-intensive [13] D. Liga, L. Robaldo, Fine-tuning GPT-3 for legal nlp tasks. Advances, in volume 33 of Neural rule classification, in volume 51 of Computer Information Processing Systems (2020), pp. Law & Security Review, (2023) doi: 9459-9474. 10.1016/j.clsr.2023.105864. [24] L. Xu, H. Xie, S.Z.J. Qin, X. Tao, F.L. Wang, [14] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-trained Parameter-efficient fine-tuning methods for language models for the legal domain: a case pretrained language models: A critical review study on Indian law. In: Proceedings of the and assessment. arXiv preprint Nineteenth International Conference on arXiv:2312.12148 (2023). Artificial Intelligence and Law, (2023), pp 187- [25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. 196. Wang, W. Chen, Lora: Low-rank adaptation of [15] A. Al Zubaer, M. Granitzer, J. Mitrović, large language models. arXiv preprint Performance analysis of large language models arXiv:2106.09685 (2021). in the domain of legal argument mining”, [26] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. Frontiers Artificial Intelligence, 6, (2023), doi: McKeown, T.B. Hashimoto, Benchmarking large 10.3389/frai.2023.1278796. language models for news summarization, in [16] J. Savelka, K.D. Ashley, M.A. Gray, H. volume 12 of Transactions of the Association for Westermann, H. Xu, Explaining Legal Concepts Computational Linguistics, (2024), pp. 39-57 with Augmented Large Language Models (GPT- [27] F. Cabitza, A. Campagner, V. Basile, Toward a 4), (2023) arXiv preprint arXiv:2306.09525v2. perspectivist turn in ground truthing for [17] J. Ioannidis, J. Harper, M.S. Quah, D. Hunter, predictive computing, in: Proceedings of the Gracenote.ai: Legal Generative AI for Regulatory AAAI Conference on Artificial Intelligence, Compliance. In: Proceedings of the Third volume 37, No. 6, 2023, pp. 6860-6868. International Workshop on Artificial Intelligence and Intelligent Assistance for Legal Professionals in the Digital Workplace A. Online Resources (LegalAIIA 2023) co-located with (ICAIL 2023), CEUR-WS.org, Elsevier, pp. 20-31. The GitHub project repository can be consulted at [18] D. Charlotin, Large Language Models and the https://github.com/Aptus-AI/chat-eur-lex. The Future of Law, SSRN dataset has been published https://papers.ssrn.com/sol3/papers.cfm?abst on https://huggingface.co/datasets/AptusAI/chat- ract_id=4548258, 2023, Accessed 04 april, 2024 eur-lex . [19] A. Blair-Stanek, N. Holzenberger, B. Van Durme, Can GPT-3 Perform Statutory Reasoning? In: The Nineteenth International Conference on Artificial Intelligence and Law (ICAIL 2023), ACM, NY, USA, pp. 22-31, 2023, doi: 10.1145/3594536.3595163.