Improving the accessibility of EU laws: the Chat-EUR-Lex
                                project
                                Manola Cherubini1,*,†, Francesco Romano1,*,†, Andrea Bolioli2,*,†, Lorenzo De
                                Mattei2,*,†, and Mattia Sangermano2,*,†

                                1Institute of Legal Informatics and Judicial Systems (IGSG-CNR), via dei Barucci 20, Florence, 50127, Italy
                                2Aptus.AI, Largo Padre Renzo Spadoni 1, 56126 Pisa, Italy


                                                   Abstract
                                                   In this article we describe the results of an ongoing research project on the use of Chat-Based
                                                   Large Language Models (Chat LLMs) and Retrieval Augmented Generation (RAG) for the access
                                                   to legal repositories. We are integrating Chat LLMs and RAG to access a dataset of legal acts in
                                                   English and Italian (a subset of EUR-Lex collection), and interact through a chatbot. We present
                                                   the state of the art, the objectives, the use cases, the methodology used in the project, and then
                                                   we discuss the preliminary results.

                                                   Keywords
                                                   Legal Informatics, Large Language Models (LLMs), Retrieval Augmented Generation (RAG) 1


                                1. Introduction                                                     2. Related works
                                In this article we describe the partial results of an               As stated in [1] and many other sources, “Legal
                                ongoing research project on the use of Chat-Based                   professionals rely on accurate and up-to-date
                                Large Language Models (Chat LLMs) and Retrieval                     information to make informed decisions, interpret
                                Augmented Generation (RAG) for the access to                        laws, and provide legal counsel”. The phenomenon of
                                normative repositories.                                             hallucination and nonsensical outputs of systems
                                    In the project, we are integrating Chat LLMs and                based on LLMs is obviously not acceptable in the legal
                                RAG to access a dataset of legal documents (European                context. To the best of our knowledge, the first survey
                                legal acts taken from EUR-Lex repository) and to                    on the challenges faced by LLMs in the legal domain
                                allow the user to interact through a chatbot.                       was presented in [2], but mainly for Chinese language.
                                    In the next sections, we will present the state of              While in other domains, such the financial one, a few
                                the art (2. Related works), the objectives and the                  LLMs have already been developed [3]. Large
                                methodology used (3. Chat-EUR-Lex methodology),                     language models are also used in healthcare where
                                the results of a research survey (4. Research survey),              LLMs are useful for processing and understanding
                                the system architecture (5. System architecture), and               medical text data, providing valuable insights, and
                                then we discuss the results presented in the previous               supporting clinical decision-making [4].
                                sections (6. Discussion and conclusions).                                LLMs are posing interesting challenges to those
                                                                                                    who are experimenting with these technologies in the
                                                                                                    legal field, where the “complexities of legal language,
                                                                                                    nuanced interpretations, and the ever-evolving
                                                                                                    nature of legislation present unique challenges that


                                Ital-IA 2024: 4th National Conference on Artificial Intelligence,       0000-0002-0242-6633 (Manola Cherubini); 0000-0001-5250-
                                organized by CINI, May 29-30, 2024, Naples, Italy                     7733 (Francesco Romano); 0000-0003-1681-9435 (Andrea
                                ∗ Corresponding author.                                               Bolioli)
                                † These authors contributed equally.                                              © 2024 Copyright for this paper by its authors. Use permitted under
                                                                                                                  Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                   manola.cherubini@igsg.cnr.it; francesco.romano@igsg.cnr.it;
                                andrea.bolioli@aptus.ai; lorenzo@aptus.ai;
                                mattia.sangermano@aptus.ai


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
require tailored solutions” [1]. There are many              conversational interface that deals with complex legal
questions and fears about the actual use of these            texts (the regulations in English and Italian published
artificial intelligence tools, e.g., their opacity [5] and   in the EUR-Lex repository), can provide simplified
the possibility of hallucinations, but also “legal           explanations, and allows the user to conduct context-
problems concerning intellectual property, data              specific interaction. To present the methodologies
privacy, and bias and discrimination” [6]. For this          used, we describe the activities performed:
reason, in the European Union it has been decided to              • Legal and Ethics risk assessment. We
regulate the use of artificial intelligence in specific                performed a legal and ethics assessment.
sectors, but also to adopt a regulation that provides                  The prototype will be compliant to GDPR
for a regulatory framework of reference only for high-                 and EU AI Act, i.e., we will comply with the
risk AI systems [7]. Some experiments conducted on                     rules set by EU regulations.
legal datasets show that LLMs can improve the                     • UX Research and Survey. We collected data
performance of document page classification [8][9],                    and information from a sample of potential
the annotation of legal texts [10], the summarization                  users through a questionnaire, both in
of legal texts [11] [12], the legal rule classification                Italian and in English. Objective of the
[13], the legal statute identification from facts [14]                 survey: understand the needs of people
and the mining of legal argument [15]. Other trials                    using digital legal resources, and their level
explore the ability of LLMs “to explain legal concepts                 of satisfaction; identify users' needs and
from statutory provisions to legal professionals” [16]                 desires regarding chatbot interaction; know
and to create “a register of obligations from various                  the fears related to the use of generative AI.
types of legislative and regulatory material” [17].                    User experience (UX) research involves
      Other uses can also be mentioned, such as LLMs                   studying how users interact with the current
as legal tutors in the context of legal training [18] and              EUR-Lex system and identifying pain points
in one of the most basic tasks required of lawyers, the                and challenges.
so-called “statutory reasoning” [19]. Recently,                   • Data collection. Chat-EUR-Lex dataset
legislative drafting experiments have been carried out                 comprises a selection of in force legal acts in
with ChatGPT, particularly for “the comparison of                      English and Italian sourced from EUR-Lex,
legislation among jurisdictions and the synthesis of                   covering the period from January 1, 2014, to
the best possible policy for the country based on this                 December 31, 2023. Specifically, it includes
comparison” [20].                                                      all historical texts preserved in Celex 3
      As it is known, generative AI models have been                   sector that remain unaltered over time,
found to hallucinate, i.e., they can generate false or                 along with the most recent consolidated
nonsensical statements [21] [22]., two strategies for
                                                                       versions in Celex 0 sector for acts that have
reducing this problem are Fine-tuning and Retrieval-                   undergone amendments. Corrigenda are
Augmented Generation (RAG) [23]. In both cases, we                     omitted from this dataset. Additionally, the
try to provide the LLM with the relevant information                   EUR-Lex documents that are not provided
(according to a domain or a specific query). Fine-                     with XML or HTML data are excluded from
tuning involves additional training on a specific                      the selection. Number of documents in
dataset, tailoring the model to certain tasks or                       English: 19062; documents in Italian:
domains [24] [25]. This improves accuracy but limits                   18164.
the model to knowledge up to the last fine-tuning. RAG
                                                                  • Semantic search engine setup. Semantic
merges a pre-trained model with a retrieval system,
                                                                       search must allow users to find relevant
accessing current data for accurate responses on
                                                                       legal information even if they don't use
recent or specific topics. Its success hinges on the
                                                                       precise legal terminology. This involves
quality of retrieved information and requires
                                                                       using Natural Language Processing (NLP)
maintaining a large, updated database. Both methods
                                                                       techniques, particularly neural embedding,
enhance model performance in specific areas,
                                                                       such as the one introduced by (Lai et al.
balancing current information and resource needs.
                                                                       2023).
                                                                  • RAG-based Chat system development. RAG
3. Chat-EUR-Lex metodology                                             combines retrieval-based methods with
In this section, we present the main problems faced                    generative language models to provide
and the methodologies used in the ongoing Chat-EUR-                    accurate     and     contextually    relevant
Lex project. Our objective is to create an AI-powered                  responses to user queries. The user can read
         both the generated answer and the relevant       people that viewed the questionnaire (Views) in
         sources, i.e. the portions of regulations used   Italian and English, as of March 30, 2024.
         to generate the answer.
    •    First version release (June 2024). The first     Table 1
                                                          Questionnaire results in Italy (language: Italian),
         version of the prototype is released to a
                                                          other EU countries (language: English), and total
         selected group of users. This version should     results
         provide basic functionality and serve as a
         starting point for further improvements.          LANG        VIEWS      STARTS        SUBMISSIONS
    •    Feedback collection and tuning. User
         feedback is actively collected and analysed.
                                                           Italian         769           315                192
         This feedback is used to identify areas for
         improvement and fine-tune both the chat
                                                           English         530           184                105
         system and the user interface. This iterative
         process continues to enhance the system's         Total          1299           499                297
         effectiveness and user satisfaction.

                                                               We report here some statistics on the Italian
4. Research survey
                                                          responses: 54% of the respondents are legal experts
In this section we present the results of the             (law researchers, jurists, lawyers, compliance
questionnaire distributed from December 28, 2023, to      specialists, etc.), while 46% are not legal experts. 66%
March 31, 2024, aimed at legal professionals, law         say that they consulted the EUR-Lex repository at
researchers, public officials in the legal sector,        least once. When asked which tool they mainly use to
compliance specialists, and other people interested in    search for legal documents and regulations, 48%
the use of Generative AI in the legal domain, in Italy    answer mainly Google search, 37% mainly EUR-Lex
and other European countries. The objectives of the       search engine, 15% mainly other tools (we do not
questionnaire were to understand the needs of people      report here the answers on the other legal sources).
using digital legal resources (EUR-Lex in particular)     60.4% say that a generative AI chatbot could help
and their level of satisfaction; identify users' needs    search and interaction, 33.3% don’t know, 6.2% say
and desires regarding chatbot interaction; know the       No. The question “Do you know what generative AI is
fears related to the use of generative AI. The            and/or are you a user of generative AI tools?” is
questionnaire was anonymous; the languages used           answered: 51% “Little”, 27% say “Yes, I use them
were Italian and English. We distributed it online on     regularly”, 22% “Not at all” (remember that these
websites and with targeted e-mail activity.               percentages concern responses in Italy). Finally, 87%
    The questionnaire contained 22 questions: 4           think generative AI must be regulated, 8% don’t know,
questions for demographic information (age, gender,       5% answer No. For reasons of space, we do not report
education, profession); 9 multiple choice questions; 6    here the answers to the question “What kind of
open-ended questions; 2 yes/no questions, and 1           requests would you make to the chatbot?”.
rating question. Regarding the topic of the use of LLMs       In summary, these responses allow us to assess
for accessing European laws, the most important           the level of knowledge of legal experts in generative
questions are: “7) To search for legal documents,         artificial intelligence, to see if there are differences
regulations and rulings, do you mainly use the EUR-       between legal experts and other people, to know their
Lex search engine, or do you use Google Search or         fears on these issues, and, above all, to collect the
something else?”. “17) In the legal domain, could a       needs and requirements of potential users of the
generative AI chatbot help search and interaction?”.      chatbot.
“18) What kind of requests would you make to the              A detailed report containing the complete
chatbot? Write one or more example requests.”. “19)       questionnaire, the aggregate results and a detailed
Do you know what generative AI is and/or are you a        analysis will be published in May 2024 on the GitHub
user of generative AI tools?”. “21) Do you have any       project repository.
concerns about the use of generative AI in the legal
field?”.                                                  5. System architecture
    The following table (Table 1) presents the
numbers of Submissions, number of people that did         The pipeline of Chat-EUR-Lex prototype is divided
not complete the questionnaire (Starts), number of        into main parts:
    •     An asynchronous batched pipeline which               While our dataset contains about 37000 legal
          collects and indexes the documents from          acts, the need for partitioning these laws for a
          EUR-Lex into a search engine.                    granular retrieval process amplifies the total count of
                                                           retrievable documents into about 371000 texts
     • A synchronous pipeline that gets the users'         (“chunks”). This extensive partitioning provides a
          queries, retrieves relevant contextual           more detailed context for the RAG system, allowing
          information and provides a response to the       for more accurate answer generation. On the other
          users.                                           side the increased number of documents naturally
    The asynchronous batched pipeline comprises            presents a challenge for our retrieval process.
three main components:
     1. A crawler that collects the data from EUR-
          Lex.
                                                           6. Discussion and conclusions
     2. A chunker who chunks the documents into            In this project, we are trying different combinations of
          smaller segments.                                the mentioned parameters using both open-source
     3. An embedding model that transforms the             and closed-source models to investigate the readiness
          segments into dense vectors to be indexed in     of LLMs to build a system for legislative research.
          the vector DB.                                       We are performing two evaluation steps to
    The synchronous pipeline comprises two main            compare different models and parameters:
components:                                                     1. Search engine evaluation: we are comparing
    • A retriever that transforms the query into a                   different Embedding models, chunking
         vector using the same embedding models                      strategies and k-nearest neighbors search
         used by the asynchronous pipeline and looks                 techniques to select the best combinations
         into the vector DB for similar contents.                    to retrieve good-quality results.
    • An LLM that gets both the query and the                   2. Response generator evaluation: having fixed
         context inserted in a prompt template and                   the best combination for contextual
         produces a response to be provided to the
                                                                     information retrieval thanks to step 1
         users
                                                                     evaluation, we will compare the quality of
    Each time the user does a new query, the whole
chat history is passed to the LLM until the maximum                  the generated response using different
prompt length is reached; in that case, older chat parts             prompt templates, LLMs and LLMs
are truncated.                                                       parameters.
    This process involves several parameters to be              For step 1 evaluation, we are creating a gold
selected, such as:                                              dataset using expert annotators and use standard
    • Chunking techniques and size.                             search engine evaluation metrics such as the
    • Embedding models.                                         Mean Reciprocal Rank. For step 2, evaluation of
    • K-nearest neighbor search techniques.                     different settings will be proposed to experts
    • Prompt templates.                                         who will ask the same questions and attribute
    • LLM and its parameters.                                   scores to each response. Preliminary results have
                                                                shown that when the context provided to gpt-4
     In summary, the RAG approach is a blend of two             by the retrieval system is consistent with the
key components: a retrieval system and a generator.             question asked, the generated answer is concise,
The retrieval system scans through a database of                comprehensible, and accurate. This consistency
documents to fetch the most relevant ones in                    significantly minimizes the problem of
response to a user query. The most recent solutions
                                                                hallucination, wherein the model might generate
for retrieval systems employed in RAGs rely on
semantic search utilizing embeddings.                           false or nonsensical information.
     The generator, on the other hand, uses these                    The large number of chunks that can
retrieved documents to generate a well-informed                 contribute to the generation of the answer
answer. This process ensures that the system                    naturally presents a challenge for our retrieval
provides responses that are both informative and                process. In this context, we are actively exploring
contextually accurate. In the current project setup             strategies to improve the efficiency of this crucial
(April 2024), the gpt-4 model powers the generation             component. One of the promising directions we
of responses. For the creation of embeddings, we                are considering involves leveraging not only the
utilize                     text-embedding-ada-002              semantic content of the normative sources but
(https://platform.openai.com/docs/models/embedd
                                                                also the boundary information, such as metadata.
ings).
    The inclusion of metadata in our retrieval process     Acknowledgements
could potentially imbue our system with the ability to
hone in on the most relevant documents, thereby            Chat-EUR-Lex project is funded within the framework
optimizing the retrieval process and improving the         of the NGI Search project under grant agreement No
overall performance of the chat system. On the other       101069364. Views and opinions expressed are
hand, the utilization of a specific embedding model        however those of the author(s) only and do not
built on legal data could be beneficial, as opposed to a   necessarily reflect those of the European Union or
generic embedder. Indeed, this model could provide a       European Commission. Neither the European Union
more nuanced understanding of the legal texts, thus        nor the granting authority can be held responsible for
enhancing the retrieval process.                           them.
    The evaluation of the preliminary results is
promising: on simple tasks, i.e. simple queries, the       References
system perform on par with human experts, as
                                                           [1]   J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw:
attested in similar researches [26].
                                                                 Open-source legal large language model with
    We must highlight the need for legal experts to
                                                                 integrated external knowledge bases, arXiv
make qualitative assessments, as well as quantitative
                                                                 preprint arXiv:2306.16092, (2023).
evaluations, both due to the complexity of the legal
                                                           [2]   J. Lai, W. Gan, J. Wu, Z. Qi, P.S. Yu, Large Language
domain and because the evaluation can be done in
                                                                 Models in Law: A Survey. arXiv preprint
different ways depending on the use case, aim and
                                                                 arXiv:2312.03718. (2023).
user target. It does not seem possible to create a
                                                           [3]   S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
unique dataset to evaluate the quality of the results in
                                                                 S. Gehrmann, P. Kambadur, D. Rosenberg, G.
an automatic way for QA tasks in the legal domain.
                                                                 Mann, BloombergGPT: A Large Language Model
Multiple points of view may be equally valid, and a
                                                                 for Finance, arXiv preprint arXiv:2303.17564v2
unique ‘ground truth’ may not exist, as discussed for
                                                                 (2023).
other tasks in Perspectivist Approaches to NLP [27]
                                                           [4]   S. Reddy, Evaluating large language models for
    A general problem for experiments with LLMs is
                                                                 use in healthcare: A framework for translational
“non-repeatability”: the experiments are not exactly
                                                                 value assessment, in volume 41 of Informatics in
reproducible because the systems are not
                                                                 Medicine               Unlocked,               (2023).
deterministic; moreover, in proprietary commercial
                                                                 https://doi.org/10.1016/j.imu.2023.101304
systems, we do not know the LLM’s parameters and
                                                           [5]   A. Contaldo, F. Campara, Intelligenza artificiale e
the dataset used for training; new versions and new
                                                                 Diritto. Dai sistemi esperti “classici” ai sistemi
LLMs are released quickly.
                                                                 esperti “evoluti”: tecnologia e implementazione
    Here are some additional critical issues that we
                                                                 giuridica, in: G. Taddei Elmi, A. Contaldo (Eds.),
are addressing:
                                                                 Intelligenza artificiale. Algoritmi giuridici. Ius
     • text length limitations: currently, there's a             condendum o “fantadiritto”, Pacini editore, Pisa,
          limit on the length of text the LLM can handle
                                                                 2020, p. 24.
          effectively. This necessitates breaking down
                                                           [6]   Z. Sun, A short survey of viewing large language
          longer texts, which can be cumbersome;
                                                                 models in legal aspect, arXiv preprint
     • impact of short queries: the chatbot's                    arXiv:2303.09136 (2023).
          accuracy and precision suffer when               [7]   G. Finocchiaro, Artificial intelligence. What are
          responding to very short or poorly defined             the rules? Il Mulino, Bologna, 2024.
          queries. More detailed user queries lead to      [8]   P. Fragkogiannis, M. Forster, G.E. Lee, D. Zhang,
          better results;                                        Context-Aware Classification of Legal Document
     • single vs. multiple documents: the chatbot                Pages. In: Proceedings of the 46th International
          performs best when responding to queries               ACM SIGIR Conference on Research and
          that target information from a single                  Development in Information Retrieval (SIGIR
          document, rather than synthesizing                     '23), Association for Computing Machinery, NY,
          information from multiple sources.                     pp           3285-3289,              2023,        doi:
                                                                 10.1145/3539618.3591839
                                                           [9]   D. Trautmann, Large Language Model Prompt
                                                                 Chaining       for     Long        Legal    Document
                                                                 Classification,             arXiv             preprint
                                                                 arXiv:2308.04138.(2023).
[10] J. Savelka, K.D. Ashley, The unreasonable               [20] G. Hill, The emerging artificial intelligence (AI)
     effectiveness of large language models in zero-              and national uniform legislation, in volume 97.5
     shot semantic annotation of legal texts,                     of Australian Law Journal, (2023), pp. 303-306.
     Frontiers Artificial Intelligence, 6: 1-14, 2023,       [21] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H.
     doi: 10.3389/frai.2023.1279794.                              Wang, T. Liu, A survey on hallucination in large
[11] M. Cherubini, F. Romano, A. Bolioli, N. De                   language models: Principles, taxonomy,
     Francesco, I. Benedetto, The summarization of                challenges, and open questions. arXiv preprint
     legal texts: an experiment with GPT-3. Rivista               arXiv:2311.05232 (2023).
     italiana di informatica e diritto, 5, 1: 191-204        [22] S. M. Tonmoy, S.M. Zaman, V. Jain, A. Rani, V.
     (2023) https://doi.org/10.32091/RIID0103                     Rawte, A. Chadha, A. Das, A comprehensive
[12] D. Datta, S. Soni, R. Mukherjee, S. Ghosh,                   survey of hallucination mitigation techniques in
     MILDSum: A Novel Benchmark Dataset for                       large language models. arXiv preprint
     Multilingual Summarization of Indian Legal Case              arXiv:2401.01313 (2024).
     Judgments, arXiv preprint arXiv:2310.18600v1,           [23] P. Lewis, E. Perez, A. Piktus, F. Petroni, V.
     (2023)                                                       Karpukhin, N. Goya, D. Kiela, Retrieval-
     https://doi.org/10.48550/arXiv.2310.18600                    augmented generation for knowledge-intensive
[13] D. Liga, L. Robaldo, Fine-tuning GPT-3 for legal             nlp tasks. Advances, in volume 33 of Neural
     rule classification, in volume 51 of Computer                Information Processing Systems (2020), pp.
     Law & Security Review, (2023) doi:                           9459-9474.
     10.1016/j.clsr.2023.105864.                             [24] L. Xu, H. Xie, S.Z.J. Qin, X. Tao, F.L. Wang,
[14] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-trained          Parameter-efficient fine-tuning methods for
     language models for the legal domain: a case                 pretrained language models: A critical review
     study on Indian law. In: Proceedings of the                  and        assessment.         arXiv      preprint
     Nineteenth International Conference on                       arXiv:2312.12148 (2023).
     Artificial Intelligence and Law, (2023), pp 187-        [25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S.
     196.                                                         Wang, W. Chen, Lora: Low-rank adaptation of
[15] A. Al Zubaer, M. Granitzer, J. Mitrović,                     large language models. arXiv preprint
     Performance analysis of large language models                arXiv:2106.09685 (2021).
     in the domain of legal argument mining”,                [26] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K.
     Frontiers Artificial Intelligence, 6, (2023), doi:           McKeown, T.B. Hashimoto, Benchmarking large
     10.3389/frai.2023.1278796.                                   language models for news summarization, in
[16] J. Savelka, K.D. Ashley, M.A. Gray, H.                       volume 12 of Transactions of the Association for
     Westermann, H. Xu, Explaining Legal Concepts                 Computational Linguistics, (2024), pp. 39-57
     with Augmented Large Language Models (GPT-              [27] F. Cabitza, A. Campagner, V. Basile, Toward a
     4), (2023) arXiv preprint arXiv:2306.09525v2.                perspectivist turn in ground truthing for
[17] J. Ioannidis, J. Harper, M.S. Quah, D. Hunter,               predictive computing, in: Proceedings of the
     Gracenote.ai: Legal Generative AI for Regulatory             AAAI Conference on Artificial Intelligence,
     Compliance. In: Proceedings of the Third                     volume 37, No. 6, 2023, pp. 6860-6868.
     International       Workshop       on      Artificial
     Intelligence and Intelligent Assistance for Legal
     Professionals in the Digital Workplace                  A. Online Resources
     (LegalAIIA 2023) co-located with (ICAIL 2023),
     CEUR-WS.org, Elsevier, pp. 20-31.                       The GitHub project repository can be consulted at
[18] D. Charlotin, Large Language Models and the             https://github.com/Aptus-AI/chat-eur-lex.       The
     Future             of          Law,           SSRN      dataset         has        been           published
     https://papers.ssrn.com/sol3/papers.cfm?abst            on https://huggingface.co/datasets/AptusAI/chat-
     ract_id=4548258, 2023, Accessed 04 april, 2024          eur-lex .
[19] A. Blair-Stanek, N. Holzenberger, B. Van Durme,
     Can GPT-3 Perform Statutory Reasoning? In:
     The Nineteenth International Conference on
     Artificial Intelligence and Law (ICAIL 2023),
     ACM, NY, USA, pp. 22-31, 2023, doi:
     10.1145/3594536.3595163.