=Paper=
{{Paper
|id=Vol-3762/545
|storemode=property
|title=Improving the accessibility of EU laws: the Chat-EUR-Lex project
|pdfUrl=https://ceur-ws.org/Vol-3762/545.pdf
|volume=Vol-3762
|authors=Manola Cherubini,Francesco Romano,Andrea Bolioli,Lorenzo De Mattei,Mattia Sangermano
|dblpUrl=https://dblp.org/rec/conf/ital-ia/CherubiniRBMS24
}}
==Improving the accessibility of EU laws: the Chat-EUR-Lex project==
Improving the accessibility of EU laws: the Chat-EUR-Lex
project
Manola Cherubini1,*,†, Francesco Romano1,*,†, Andrea Bolioli2,*,†, Lorenzo De
Mattei2,*,†, and Mattia Sangermano2,*,†
1Institute of Legal Informatics and Judicial Systems (IGSG-CNR), via dei Barucci 20, Florence, 50127, Italy
2Aptus.AI, Largo Padre Renzo Spadoni 1, 56126 Pisa, Italy
Abstract
In this article we describe the results of an ongoing research project on the use of Chat-Based
Large Language Models (Chat LLMs) and Retrieval Augmented Generation (RAG) for the access
to legal repositories. We are integrating Chat LLMs and RAG to access a dataset of legal acts in
English and Italian (a subset of EUR-Lex collection), and interact through a chatbot. We present
the state of the art, the objectives, the use cases, the methodology used in the project, and then
we discuss the preliminary results.
Keywords
Legal Informatics, Large Language Models (LLMs), Retrieval Augmented Generation (RAG) 1
1. Introduction 2. Related works
In this article we describe the partial results of an As stated in [1] and many other sources, “Legal
ongoing research project on the use of Chat-Based professionals rely on accurate and up-to-date
Large Language Models (Chat LLMs) and Retrieval information to make informed decisions, interpret
Augmented Generation (RAG) for the access to laws, and provide legal counsel”. The phenomenon of
normative repositories. hallucination and nonsensical outputs of systems
In the project, we are integrating Chat LLMs and based on LLMs is obviously not acceptable in the legal
RAG to access a dataset of legal documents (European context. To the best of our knowledge, the first survey
legal acts taken from EUR-Lex repository) and to on the challenges faced by LLMs in the legal domain
allow the user to interact through a chatbot. was presented in [2], but mainly for Chinese language.
In the next sections, we will present the state of While in other domains, such the financial one, a few
the art (2. Related works), the objectives and the LLMs have already been developed [3]. Large
methodology used (3. Chat-EUR-Lex methodology), language models are also used in healthcare where
the results of a research survey (4. Research survey), LLMs are useful for processing and understanding
the system architecture (5. System architecture), and medical text data, providing valuable insights, and
then we discuss the results presented in the previous supporting clinical decision-making [4].
sections (6. Discussion and conclusions). LLMs are posing interesting challenges to those
who are experimenting with these technologies in the
legal field, where the “complexities of legal language,
nuanced interpretations, and the ever-evolving
nature of legislation present unique challenges that
Ital-IA 2024: 4th National Conference on Artificial Intelligence, 0000-0002-0242-6633 (Manola Cherubini); 0000-0001-5250-
organized by CINI, May 29-30, 2024, Naples, Italy 7733 (Francesco Romano); 0000-0003-1681-9435 (Andrea
∗ Corresponding author. Bolioli)
† These authors contributed equally. © 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
manola.cherubini@igsg.cnr.it; francesco.romano@igsg.cnr.it;
andrea.bolioli@aptus.ai; lorenzo@aptus.ai;
mattia.sangermano@aptus.ai
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
require tailored solutions” [1]. There are many conversational interface that deals with complex legal
questions and fears about the actual use of these texts (the regulations in English and Italian published
artificial intelligence tools, e.g., their opacity [5] and in the EUR-Lex repository), can provide simplified
the possibility of hallucinations, but also “legal explanations, and allows the user to conduct context-
problems concerning intellectual property, data specific interaction. To present the methodologies
privacy, and bias and discrimination” [6]. For this used, we describe the activities performed:
reason, in the European Union it has been decided to • Legal and Ethics risk assessment. We
regulate the use of artificial intelligence in specific performed a legal and ethics assessment.
sectors, but also to adopt a regulation that provides The prototype will be compliant to GDPR
for a regulatory framework of reference only for high- and EU AI Act, i.e., we will comply with the
risk AI systems [7]. Some experiments conducted on rules set by EU regulations.
legal datasets show that LLMs can improve the • UX Research and Survey. We collected data
performance of document page classification [8][9], and information from a sample of potential
the annotation of legal texts [10], the summarization users through a questionnaire, both in
of legal texts [11] [12], the legal rule classification Italian and in English. Objective of the
[13], the legal statute identification from facts [14] survey: understand the needs of people
and the mining of legal argument [15]. Other trials using digital legal resources, and their level
explore the ability of LLMs “to explain legal concepts of satisfaction; identify users' needs and
from statutory provisions to legal professionals” [16] desires regarding chatbot interaction; know
and to create “a register of obligations from various the fears related to the use of generative AI.
types of legislative and regulatory material” [17]. User experience (UX) research involves
Other uses can also be mentioned, such as LLMs studying how users interact with the current
as legal tutors in the context of legal training [18] and EUR-Lex system and identifying pain points
in one of the most basic tasks required of lawyers, the and challenges.
so-called “statutory reasoning” [19]. Recently, • Data collection. Chat-EUR-Lex dataset
legislative drafting experiments have been carried out comprises a selection of in force legal acts in
with ChatGPT, particularly for “the comparison of English and Italian sourced from EUR-Lex,
legislation among jurisdictions and the synthesis of covering the period from January 1, 2014, to
the best possible policy for the country based on this December 31, 2023. Specifically, it includes
comparison” [20]. all historical texts preserved in Celex 3
As it is known, generative AI models have been sector that remain unaltered over time,
found to hallucinate, i.e., they can generate false or along with the most recent consolidated
nonsensical statements [21] [22]., two strategies for
versions in Celex 0 sector for acts that have
reducing this problem are Fine-tuning and Retrieval- undergone amendments. Corrigenda are
Augmented Generation (RAG) [23]. In both cases, we omitted from this dataset. Additionally, the
try to provide the LLM with the relevant information EUR-Lex documents that are not provided
(according to a domain or a specific query). Fine- with XML or HTML data are excluded from
tuning involves additional training on a specific the selection. Number of documents in
dataset, tailoring the model to certain tasks or English: 19062; documents in Italian:
domains [24] [25]. This improves accuracy but limits 18164.
the model to knowledge up to the last fine-tuning. RAG
• Semantic search engine setup. Semantic
merges a pre-trained model with a retrieval system,
search must allow users to find relevant
accessing current data for accurate responses on
legal information even if they don't use
recent or specific topics. Its success hinges on the
precise legal terminology. This involves
quality of retrieved information and requires
using Natural Language Processing (NLP)
maintaining a large, updated database. Both methods
techniques, particularly neural embedding,
enhance model performance in specific areas,
such as the one introduced by (Lai et al.
balancing current information and resource needs.
2023).
• RAG-based Chat system development. RAG
3. Chat-EUR-Lex metodology combines retrieval-based methods with
In this section, we present the main problems faced generative language models to provide
and the methodologies used in the ongoing Chat-EUR- accurate and contextually relevant
Lex project. Our objective is to create an AI-powered responses to user queries. The user can read
both the generated answer and the relevant people that viewed the questionnaire (Views) in
sources, i.e. the portions of regulations used Italian and English, as of March 30, 2024.
to generate the answer.
• First version release (June 2024). The first Table 1
Questionnaire results in Italy (language: Italian),
version of the prototype is released to a
other EU countries (language: English), and total
selected group of users. This version should results
provide basic functionality and serve as a
starting point for further improvements. LANG VIEWS STARTS SUBMISSIONS
• Feedback collection and tuning. User
feedback is actively collected and analysed.
Italian 769 315 192
This feedback is used to identify areas for
improvement and fine-tune both the chat
English 530 184 105
system and the user interface. This iterative
process continues to enhance the system's Total 1299 499 297
effectiveness and user satisfaction.
We report here some statistics on the Italian
4. Research survey
responses: 54% of the respondents are legal experts
In this section we present the results of the (law researchers, jurists, lawyers, compliance
questionnaire distributed from December 28, 2023, to specialists, etc.), while 46% are not legal experts. 66%
March 31, 2024, aimed at legal professionals, law say that they consulted the EUR-Lex repository at
researchers, public officials in the legal sector, least once. When asked which tool they mainly use to
compliance specialists, and other people interested in search for legal documents and regulations, 48%
the use of Generative AI in the legal domain, in Italy answer mainly Google search, 37% mainly EUR-Lex
and other European countries. The objectives of the search engine, 15% mainly other tools (we do not
questionnaire were to understand the needs of people report here the answers on the other legal sources).
using digital legal resources (EUR-Lex in particular) 60.4% say that a generative AI chatbot could help
and their level of satisfaction; identify users' needs search and interaction, 33.3% don’t know, 6.2% say
and desires regarding chatbot interaction; know the No. The question “Do you know what generative AI is
fears related to the use of generative AI. The and/or are you a user of generative AI tools?” is
questionnaire was anonymous; the languages used answered: 51% “Little”, 27% say “Yes, I use them
were Italian and English. We distributed it online on regularly”, 22% “Not at all” (remember that these
websites and with targeted e-mail activity. percentages concern responses in Italy). Finally, 87%
The questionnaire contained 22 questions: 4 think generative AI must be regulated, 8% don’t know,
questions for demographic information (age, gender, 5% answer No. For reasons of space, we do not report
education, profession); 9 multiple choice questions; 6 here the answers to the question “What kind of
open-ended questions; 2 yes/no questions, and 1 requests would you make to the chatbot?”.
rating question. Regarding the topic of the use of LLMs In summary, these responses allow us to assess
for accessing European laws, the most important the level of knowledge of legal experts in generative
questions are: “7) To search for legal documents, artificial intelligence, to see if there are differences
regulations and rulings, do you mainly use the EUR- between legal experts and other people, to know their
Lex search engine, or do you use Google Search or fears on these issues, and, above all, to collect the
something else?”. “17) In the legal domain, could a needs and requirements of potential users of the
generative AI chatbot help search and interaction?”. chatbot.
“18) What kind of requests would you make to the A detailed report containing the complete
chatbot? Write one or more example requests.”. “19) questionnaire, the aggregate results and a detailed
Do you know what generative AI is and/or are you a analysis will be published in May 2024 on the GitHub
user of generative AI tools?”. “21) Do you have any project repository.
concerns about the use of generative AI in the legal
field?”. 5. System architecture
The following table (Table 1) presents the
numbers of Submissions, number of people that did The pipeline of Chat-EUR-Lex prototype is divided
not complete the questionnaire (Starts), number of into main parts:
• An asynchronous batched pipeline which While our dataset contains about 37000 legal
collects and indexes the documents from acts, the need for partitioning these laws for a
EUR-Lex into a search engine. granular retrieval process amplifies the total count of
retrievable documents into about 371000 texts
• A synchronous pipeline that gets the users' (“chunks”). This extensive partitioning provides a
queries, retrieves relevant contextual more detailed context for the RAG system, allowing
information and provides a response to the for more accurate answer generation. On the other
users. side the increased number of documents naturally
The asynchronous batched pipeline comprises presents a challenge for our retrieval process.
three main components:
1. A crawler that collects the data from EUR-
Lex.
6. Discussion and conclusions
2. A chunker who chunks the documents into In this project, we are trying different combinations of
smaller segments. the mentioned parameters using both open-source
3. An embedding model that transforms the and closed-source models to investigate the readiness
segments into dense vectors to be indexed in of LLMs to build a system for legislative research.
the vector DB. We are performing two evaluation steps to
The synchronous pipeline comprises two main compare different models and parameters:
components: 1. Search engine evaluation: we are comparing
• A retriever that transforms the query into a different Embedding models, chunking
vector using the same embedding models strategies and k-nearest neighbors search
used by the asynchronous pipeline and looks techniques to select the best combinations
into the vector DB for similar contents. to retrieve good-quality results.
• An LLM that gets both the query and the 2. Response generator evaluation: having fixed
context inserted in a prompt template and the best combination for contextual
produces a response to be provided to the
information retrieval thanks to step 1
users
evaluation, we will compare the quality of
Each time the user does a new query, the whole
chat history is passed to the LLM until the maximum the generated response using different
prompt length is reached; in that case, older chat parts prompt templates, LLMs and LLMs
are truncated. parameters.
This process involves several parameters to be For step 1 evaluation, we are creating a gold
selected, such as: dataset using expert annotators and use standard
• Chunking techniques and size. search engine evaluation metrics such as the
• Embedding models. Mean Reciprocal Rank. For step 2, evaluation of
• K-nearest neighbor search techniques. different settings will be proposed to experts
• Prompt templates. who will ask the same questions and attribute
• LLM and its parameters. scores to each response. Preliminary results have
shown that when the context provided to gpt-4
In summary, the RAG approach is a blend of two by the retrieval system is consistent with the
key components: a retrieval system and a generator. question asked, the generated answer is concise,
The retrieval system scans through a database of comprehensible, and accurate. This consistency
documents to fetch the most relevant ones in significantly minimizes the problem of
response to a user query. The most recent solutions
hallucination, wherein the model might generate
for retrieval systems employed in RAGs rely on
semantic search utilizing embeddings. false or nonsensical information.
The generator, on the other hand, uses these The large number of chunks that can
retrieved documents to generate a well-informed contribute to the generation of the answer
answer. This process ensures that the system naturally presents a challenge for our retrieval
provides responses that are both informative and process. In this context, we are actively exploring
contextually accurate. In the current project setup strategies to improve the efficiency of this crucial
(April 2024), the gpt-4 model powers the generation component. One of the promising directions we
of responses. For the creation of embeddings, we are considering involves leveraging not only the
utilize text-embedding-ada-002 semantic content of the normative sources but
(https://platform.openai.com/docs/models/embedd
also the boundary information, such as metadata.
ings).
The inclusion of metadata in our retrieval process Acknowledgements
could potentially imbue our system with the ability to
hone in on the most relevant documents, thereby Chat-EUR-Lex project is funded within the framework
optimizing the retrieval process and improving the of the NGI Search project under grant agreement No
overall performance of the chat system. On the other 101069364. Views and opinions expressed are
hand, the utilization of a specific embedding model however those of the author(s) only and do not
built on legal data could be beneficial, as opposed to a necessarily reflect those of the European Union or
generic embedder. Indeed, this model could provide a European Commission. Neither the European Union
more nuanced understanding of the legal texts, thus nor the granting authority can be held responsible for
enhancing the retrieval process. them.
The evaluation of the preliminary results is
promising: on simple tasks, i.e. simple queries, the References
system perform on par with human experts, as
[1] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw:
attested in similar researches [26].
Open-source legal large language model with
We must highlight the need for legal experts to
integrated external knowledge bases, arXiv
make qualitative assessments, as well as quantitative
preprint arXiv:2306.16092, (2023).
evaluations, both due to the complexity of the legal
[2] J. Lai, W. Gan, J. Wu, Z. Qi, P.S. Yu, Large Language
domain and because the evaluation can be done in
Models in Law: A Survey. arXiv preprint
different ways depending on the use case, aim and
arXiv:2312.03718. (2023).
user target. It does not seem possible to create a
[3] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
unique dataset to evaluate the quality of the results in
S. Gehrmann, P. Kambadur, D. Rosenberg, G.
an automatic way for QA tasks in the legal domain.
Mann, BloombergGPT: A Large Language Model
Multiple points of view may be equally valid, and a
for Finance, arXiv preprint arXiv:2303.17564v2
unique ‘ground truth’ may not exist, as discussed for
(2023).
other tasks in Perspectivist Approaches to NLP [27]
[4] S. Reddy, Evaluating large language models for
A general problem for experiments with LLMs is
use in healthcare: A framework for translational
“non-repeatability”: the experiments are not exactly
value assessment, in volume 41 of Informatics in
reproducible because the systems are not
Medicine Unlocked, (2023).
deterministic; moreover, in proprietary commercial
https://doi.org/10.1016/j.imu.2023.101304
systems, we do not know the LLM’s parameters and
[5] A. Contaldo, F. Campara, Intelligenza artificiale e
the dataset used for training; new versions and new
Diritto. Dai sistemi esperti “classici” ai sistemi
LLMs are released quickly.
esperti “evoluti”: tecnologia e implementazione
Here are some additional critical issues that we
giuridica, in: G. Taddei Elmi, A. Contaldo (Eds.),
are addressing:
Intelligenza artificiale. Algoritmi giuridici. Ius
• text length limitations: currently, there's a condendum o “fantadiritto”, Pacini editore, Pisa,
limit on the length of text the LLM can handle
2020, p. 24.
effectively. This necessitates breaking down
[6] Z. Sun, A short survey of viewing large language
longer texts, which can be cumbersome;
models in legal aspect, arXiv preprint
• impact of short queries: the chatbot's arXiv:2303.09136 (2023).
accuracy and precision suffer when [7] G. Finocchiaro, Artificial intelligence. What are
responding to very short or poorly defined the rules? Il Mulino, Bologna, 2024.
queries. More detailed user queries lead to [8] P. Fragkogiannis, M. Forster, G.E. Lee, D. Zhang,
better results; Context-Aware Classification of Legal Document
• single vs. multiple documents: the chatbot Pages. In: Proceedings of the 46th International
performs best when responding to queries ACM SIGIR Conference on Research and
that target information from a single Development in Information Retrieval (SIGIR
document, rather than synthesizing '23), Association for Computing Machinery, NY,
information from multiple sources. pp 3285-3289, 2023, doi:
10.1145/3539618.3591839
[9] D. Trautmann, Large Language Model Prompt
Chaining for Long Legal Document
Classification, arXiv preprint
arXiv:2308.04138.(2023).
[10] J. Savelka, K.D. Ashley, The unreasonable [20] G. Hill, The emerging artificial intelligence (AI)
effectiveness of large language models in zero- and national uniform legislation, in volume 97.5
shot semantic annotation of legal texts, of Australian Law Journal, (2023), pp. 303-306.
Frontiers Artificial Intelligence, 6: 1-14, 2023, [21] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H.
doi: 10.3389/frai.2023.1279794. Wang, T. Liu, A survey on hallucination in large
[11] M. Cherubini, F. Romano, A. Bolioli, N. De language models: Principles, taxonomy,
Francesco, I. Benedetto, The summarization of challenges, and open questions. arXiv preprint
legal texts: an experiment with GPT-3. Rivista arXiv:2311.05232 (2023).
italiana di informatica e diritto, 5, 1: 191-204 [22] S. M. Tonmoy, S.M. Zaman, V. Jain, A. Rani, V.
(2023) https://doi.org/10.32091/RIID0103 Rawte, A. Chadha, A. Das, A comprehensive
[12] D. Datta, S. Soni, R. Mukherjee, S. Ghosh, survey of hallucination mitigation techniques in
MILDSum: A Novel Benchmark Dataset for large language models. arXiv preprint
Multilingual Summarization of Indian Legal Case arXiv:2401.01313 (2024).
Judgments, arXiv preprint arXiv:2310.18600v1, [23] P. Lewis, E. Perez, A. Piktus, F. Petroni, V.
(2023) Karpukhin, N. Goya, D. Kiela, Retrieval-
https://doi.org/10.48550/arXiv.2310.18600 augmented generation for knowledge-intensive
[13] D. Liga, L. Robaldo, Fine-tuning GPT-3 for legal nlp tasks. Advances, in volume 33 of Neural
rule classification, in volume 51 of Computer Information Processing Systems (2020), pp.
Law & Security Review, (2023) doi: 9459-9474.
10.1016/j.clsr.2023.105864. [24] L. Xu, H. Xie, S.Z.J. Qin, X. Tao, F.L. Wang,
[14] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-trained Parameter-efficient fine-tuning methods for
language models for the legal domain: a case pretrained language models: A critical review
study on Indian law. In: Proceedings of the and assessment. arXiv preprint
Nineteenth International Conference on arXiv:2312.12148 (2023).
Artificial Intelligence and Law, (2023), pp 187- [25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S.
196. Wang, W. Chen, Lora: Low-rank adaptation of
[15] A. Al Zubaer, M. Granitzer, J. Mitrović, large language models. arXiv preprint
Performance analysis of large language models arXiv:2106.09685 (2021).
in the domain of legal argument mining”, [26] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K.
Frontiers Artificial Intelligence, 6, (2023), doi: McKeown, T.B. Hashimoto, Benchmarking large
10.3389/frai.2023.1278796. language models for news summarization, in
[16] J. Savelka, K.D. Ashley, M.A. Gray, H. volume 12 of Transactions of the Association for
Westermann, H. Xu, Explaining Legal Concepts Computational Linguistics, (2024), pp. 39-57
with Augmented Large Language Models (GPT- [27] F. Cabitza, A. Campagner, V. Basile, Toward a
4), (2023) arXiv preprint arXiv:2306.09525v2. perspectivist turn in ground truthing for
[17] J. Ioannidis, J. Harper, M.S. Quah, D. Hunter, predictive computing, in: Proceedings of the
Gracenote.ai: Legal Generative AI for Regulatory AAAI Conference on Artificial Intelligence,
Compliance. In: Proceedings of the Third volume 37, No. 6, 2023, pp. 6860-6868.
International Workshop on Artificial
Intelligence and Intelligent Assistance for Legal
Professionals in the Digital Workplace A. Online Resources
(LegalAIIA 2023) co-located with (ICAIL 2023),
CEUR-WS.org, Elsevier, pp. 20-31. The GitHub project repository can be consulted at
[18] D. Charlotin, Large Language Models and the https://github.com/Aptus-AI/chat-eur-lex. The
Future of Law, SSRN dataset has been published
https://papers.ssrn.com/sol3/papers.cfm?abst on https://huggingface.co/datasets/AptusAI/chat-
ract_id=4548258, 2023, Accessed 04 april, 2024 eur-lex .
[19] A. Blair-Stanek, N. Holzenberger, B. Van Durme,
Can GPT-3 Perform Statutory Reasoning? In:
The Nineteenth International Conference on
Artificial Intelligence and Law (ICAIL 2023),
ACM, NY, USA, pp. 22-31, 2023, doi:
10.1145/3594536.3595163.