Adapting a Large Language Model to the Legal Domain: A Case Study in Italian

Adapting a Large Language Model to the Legal Domain: A Case Study in Italian FlavioValerio f.valerio6@studenti.uniba.it Department of Computer Science University of Bari Aldo Moro

Bari ( Italy

PierpaoloBasile pierpaolo.basile@uniba.it Department of Computer Science University of Bari Aldo Moro

Bari ( Italy

AI2B srl Spin-Off University of Bari Aldo Moro

Via E. Orabona Bari Italy

MarcoDe Gemmis marco.degemmis@uniba.it Department of Computer Science University of Bari Aldo Moro

Bari ( Italy

AI2B srl Spin-Off University of Bari Aldo Moro

Via E. Orabona Bari Italy

Adapting a Large Language Model to the Legal Domain: A Case Study in Italian 1613-0073 0EC8C58CDDEC459B17E9D6F0E2E7D9AA GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Legal Artificial Intelligence Public Administration (M. d. Gemmis) 0000-0002-0545-1105 (P. Basile); 0000-0002-2007-9559 (M. d. Gemmis)

This work presents a methodology for adapting an open Large Language Model (LLM) to the Italian legal domain. We construct a legal document corpus from the Normattiva website and develop a custom scraper to ensure high-quality text extraction. The resulting corpus is used to adapt the Llama-3.1-8b model through continuous pre-training and Low-Rank Adaptation (LoRA). The adapted model's performance is evaluated by assessing its ability to complete sentences coherently within the new domain. Results demonstrate that the adapted model surpasses the original model across all metrics, considering various prompt lengths and different sizes of the training corpus.

Large Language Models have proven effective in understanding and generating text in several domains. However, language in some domains is characterized by specific structure or word usage. For example, the legal domain relies on precise language, nuanced interpretation of laws, and a vast body of evolving jurisprudence. Typical LLMs are trained on an extensive collection of documents from several domains, which can affect the capability of understanding and generating text in a specific context, such as the legal domain. Moreover, some languages are less represented than others, and the legal domain of a particular language probably needs to be added. This is a critical issue since the legal language is strongly dependent on the legislation of the specific country. Finally, using a closed LLM can be critical in a public administration domain, and then an adaptation of an open LLM can be the only alternative.

Recent works have investigated the usage of LLMs in legal domains. In [2], authors propose a few-shot entity relation extraction method in the legal domain based on large language models without training the model on domain-dependent data. In [3], several LLMs are tested on a specific dataset related to numerical estimation in the legal domain. Similarly, [4] evaluate chatGPT performance in semantic annotation of legal texts, finding that also, in zero-shot, the model can provide promising results. Following the same idea, authors in [5] evaluate the performance of chatGPT in the context of legal argument mining and underline the importance of formulating the correct prompt and how it impacts the overall performance. However, all previous works investigate close LLMs and do not consider fine-tuning or adapting existing open LLMs to the legal domain. Less recent works considered the training of the BERT-like language models specific to the legal domain or fine-tuning BERT-like models to specific legal takes [6]. Also for the Italian, a BERT model trained on Italian documents has been proposed [7]. However, these works are outside the scope of this paper since we want to focus on large language models.

To overcome these limitations, this work proposes adapting an open LLM (Meta Llama-3.1) to the Italian legal domain. To pursue this goal, we have created a corpus of documents by collecting legal texts written in Italian. Then, using an adaptation strategy, we continue training the LLM on the corpus of collected documents. To evaluate the effectiveness of the proposed approach, we measure the quality and coherence of the generated text before and after the training.

The paper is structured as follows: Section 2 describes the construction process of the legal corpus, while Section 3 provides methodological details about the adaptation of the LLM to the legal domain. Section 4 describes the evaluation and discusses the results and Section 5 closes the paper with final remarks and proposes future research directions.

Corpus creation

A suitable corpus of documents is necessary to improve the capabilities of an existing LLM in understanding and producing the language used in the Italian legal domain. However, we create a new one due to the absence of a publicly available dataset. To this end, we develop a web crawler for the Normattiva1 website. This Italian website is an essential resource for the consultation of current legislation, offering access to national laws, decrees, and regulations. In addition, advanced search capabilities and multivigency consultation of acts are provided.

A crawler is a software designed to browse web pages and gather information systematically. Its applications range from search engine indexing to data analysis to content updating. Designing an effective crawler requires a customized approach tailored to the site of interest. In our case, crawler development began with an in-depth study of the structure of the Normattiva site to understand how to structure the data collection process.

The Normattiva site has two pages relevant to our purpose: pages that serve as "containers of useful links" and pages containing legislative texts of interest. The crawling requires three steps:

1. Collection of links: The first step is to identify a "main page" containing relevant links, such as articles of the Italian Constitution or legal acts. Once this page is identified, links are collected.

To optimize the process, it is necessary to examine the page through inspection tools to identify sections (e.g., divs) containing links of interest. This approach avoided irrelevant sections of the page, thus reducing the time required for scraping and post-processing the data. Without such preliminary analysis, a crawler would have had to take a more "raw" approach, analyzing the entire page and producing a less refined output, which would have required further post-processing. 2. Text Capture: Pages containing legislative texts on Normattiva feature a sidebar on the left side that allows users to navigate between articles or legal acts via calls to a JavaScript function. This function dynamically modifies the displayed content. This feature affects the choice of libraries and approaches taken in designing the crawler. Once again, the inspection tool is used to locate the specific div containing the relevant text. We do not download the entire page, as this would have captured a great deal of irrelevant information, requiring post-processing, a step that was preferred to avoid.

Saving information:

The information extracted by the crawler is saved in a JSON lines format.

Each line in the file contains a JSON object with the following fields: text, url, timestamp and source.

The main libraries used to implement the crawler are Selenium 2 and Beautiful Soup 3 . Selenium is an advanced browser automation tool widely used in academia and industry to perform automated testing and manage complex web operations. Because of its versatility, Selenium supports a wide range of platforms, browsers, and programming languages, enabling the precise simulation of a real user's actions, such as selecting links, entering text, and interacting with dynamic elements through mouse clicks. An essential feature of Selenium is its ability to interact with JavaScript, a crucial aspect of managing dynamic Web sites. In developing the Normattiva crawler, Selenium is crucial for automating navigation and interaction with dynamic content, thus ensuring accurate and efficient capture of legislative data. For example, it allows clicking on links to move from one article to another, waiting for page elements to load properly before performing further actions. Beautiful soap is a library for parsing and extracting data from HTML and XML documents. It is praised for its ease of use and the intuitive interface for navigating and manipulating the structure of HTML documents. Beautiful Soup makes it possible to identify and extract structured data from web pages accurately. In our work, Beautiful Soup is used together with Selenium: Selenium handles the loading and interaction with dynamic page elements, while Beautiful Soup parses HTML source code to extract relevant textual content.

The Normattiva crawler 4 enables the systematic and targeted collection of legislative documents from the website, providing a suitable corpus for fine-tuning and adapting an existing LLM to the legal domain. The tailoring approach adopted, based on a preliminary analysis of web page structure, ensured greater efficiency than more generic methods, reducing the workload required for cleaning and processing the collected data. The final corpus 5 contains 396,592 text passages extracted from the Normattiva website for a total of about 108 million occurrences 6 .

Large Language Model Adaptation

LLMs have demonstrated remarkable capabilities across various natural language processing tasks and languages. However, their performance is often limited when applied to specialized domains with distinct terminology, unique stylistic features, or specific contextual knowledge not appropriately covered in the training data. To bridge this gap, it is essential to adapt an LLM to new domains by exposing them to domain-specific data. A prominent approach to this adaptation is continuous pretraining using domain-specific corpora, enhanced by parameter-efficient techniques such as Low-Rank Adaptation (LoRA) [8]. We have already successfully investigated this approach in adapting BLOOM [9] and LLaMA-2 [10] and LLaMA-3 [11] models to the Italian language [12,13,14].

LoRA is designed to fine-tune pre-trained LLMs with minimal additional computational and memory overhead. The main idea behind LoRA is to introduce low-rank matrices into the architecture of the LLM during fine-tuning. These matrices capture task-specific or domain-specific adaptations while keeping most original model parameters frozen. This method is particularly advantageous when computational resources are limited or when there is a need to preserve the general knowledge embedded in the original LLM while introducing domain-specific knowledge. These characteristics are critical in our approach since we can introduce new knowledge related to the new domain (legal) with a low computational cost.

Our methodology is based on continuous pre-training. This process involves sequentially fine-tuning the LLM on a corpus of documents relevant to the target domain. The corpus is carefully curated to reflect the domain's linguistic patterns, terminologies, and contextual nuances. This process can be iterative, allowing the model to gradually adapt to the new domain while retaining its ability to perform well on general tasks. Moreover, we can update the model with new knowledge when, for example, new laws are added or removed. Moreover, LoRA is a Parameter-Efficient Fine-Tuning (PEFT) [15] technique and then works on a subset of the original parameters during the training. This allows the release of only the weights modified during the training, reducing the space needed to store the model. This approach makes it possible to adapt the original model on several domains by performing different LoRA training steps and producing an adapter for each. The adapter stores only the weights modified in each training step. Each adapter is interchangeable and can be loaded upon the original model.

In this work, we start from the LLaMA-3 8 billion model. We selected this model to adapt a state-ofthe-art model using reasonable computing resources. In detail, all the training process is performed on a single GPU 7 . To reduce the computation cost we use the unsloth8 library. We fine-tune the model using LoRA with 𝑟𝑎𝑛𝑘 = 16 and 𝑎𝑙𝑝ℎ𝑎 = 32, considering all the linear layers with a max sequence of 2,048 tokens. The corpus built according to Section 2 is used to feed the training with text samples with a batch size of 16 and an accumulation step of 2. The model is trained for one epoch due to the large number of examples. The output of the training process is an adapter that can be loaded upon the LLaMA-3.1 model. Finally, we evaluate the quality of the training process according to the experimental setting described in Section 4. Results show that fine-tuned models always overcome the base model for all metrics. We observe a slight decrease in BLUE when the training corpus size increases. However, the differences between Llama3.1-NA-100k and Llama3.1-NA are minimal. Also, the perplexity has the same behaviour. This is an interesting outcome since we can adapt the model using a moderate number of documents. Moreover, we observe that the increase in performance is more evident when the prompt length is equal to 20. This behaviour is obvious since the text generated with a prompt length of 40 has more tokens in common with the original text. Nevertheless, the results prove that the tuned models can also increase the generation performance with a short prompt.

It is essential to highlight that we only measure the coherence of the generated text against the reference text in the test set. We do not check if the generated text is correct and contains accurate information. This is out of the scope of our work. To use the model in a real scenario, it is necessary to instruct the model on specific tasks through instruction tuning. The scope of our work is to provide a language model that can generate text more coherently with a new domain. All the results prove the effectiveness of our methodology.

Table 2 reports the results of each model on a set of standard benchmarks used to evaluate the ability of LLMs to solve several tasks. We consider the same set of benchmarks adopted by the Open Italian LLM leaderboard 11 . The involved benchmarks are:

• HellaSWAG is a dataset for studying grounded commonsense inference. It consists of 70k multiple-choice questions about grounded situations. Each question has four answer choices. • The AI2's Reasoning Challenge (ARC) dataset is a multiple-choice question-answering dataset containing questions from science exams from grade 3 to grade 9. • MMLU (Massive Multitask Language Understanding) is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and fewshot settings. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more.

The Italian benchmark relies on a machine-translated version of the dataset above. The column Δ reports the difference in performance with respect to the original model Llama3.1. We observed a performance decrease as expected since we finetuned the model on new data and a different domain. However, if we consider both results in Table 1 and 2, we can conclude that the best choice is to fine-tune the model on 100k documents since the generation performance on the test set is good and the difference with the original model is about -8.5%. We plan to test the model on specific tasks related to the legal domain to understand better if the fine-tuning can improve both the generation and abilities of the model to solve domain-specific problems.

Conclusions and Future Work

This work proposes a methodology for adapting an open LLM to the Italian legal domain. To achieve this goal, we build a corpus of legal documents extracted from the Normattiva website. We also create an ad hoc scraper to ensure high-quality extracted text. Then, the corpus is exploited to adapt the Llama-3.1-8b model using continuous pre-training and LoRA. We also investigate different training corpus sizes. We measure the adapted models' ability to complete sentences coherently according to the new domain to evaluate the effectiveness. Results prove that adapted models overcome the original model in all metrics, considering different prompt lengths. In future work, we plan to extend the analysis to other open LLMs and test fine-tuned adapted models on specific legal tasks.

.1-NA-100k Llama3.1-NA Llama3.1 Llama3.

Table 11Evaluation results considering different prompts length and training corpus size.1-NA-100k Llama3.1-NA

Table 22Performance of each model according to the Open Italian LLM leaderboard.Modelhellaswag_it arc_it mmlu_it avgΔ (%)Llama3.10.62560.4559 0.55930.5469 -Llama3.1-NA-100k 0.59190.4166 0.49240.5003 8.53Llama3.1-NA0.55050.3807 0.45490.4620 15.52

https://www.normattiva.it/ https://www.selenium.dev/ https://pypi.org/project/beautifulsoup4/ The crawler is available on GitHub: https://github.com/FValerio96/NormattivaCrawling/ The corpus is available on Hugging Face: https://huggingface.co/datasets/swap-uniba/normattiva-dump. Words are counted considering sequences of alphanumeric characters. The exact number of tokens depends on the specific LLM tokenizer. NVIDIA RTX A6000 with 48GB of VRAM https://unsloth.ai/ The model adapter is available on Hugging Face: https://huggingface.co/swap-uniba/llama3-it-pa-100k-adapter. The model adapter is available on Hugging Face: https://huggingface.co/swap-uniba/llama3-it-pa-300k-adapter.

Acknowledgments

We acknowledge the support of the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

Evaluation

In this section, a comparative evaluation is conducted between the Llama-3.1 model and its fine-tuned version Llama-3.1-NA. The experiment examines the models' generative capabilities using a partial prompt approach. Sentences from the test dataset are partially used as input (prompts) for the models, which subsequently generated the full text from this initial portion. The results are stored in JSON files containing three distinct fields:

1. text: the original full sentence; 2. prompt: the initial portion of the text provided to the model; 3. generated: the complete output generated by the model from the prompt.

We build two fine-tuned models by training the model on two different portions of the corpus. The model Llama-3.1-NA-100k 9 is trained on 100,000 text passages randomly selected from the corpus, while the model Llama-3.1-NA 10 exploits the whole corpus.

We randomly select 1,000 other sentences for the evaluation. For the evaluation, each testing sentence is tokenized using the model tokenizer, and we retain only the first 𝑘 tokens for each sentence as the prompt. In our case, the sentence is removed if it exceeds the maximum input length, 2,048 tokens.

Finally, all models were used to complete the generated texts. We refer to the base model using the label Llama3.1.

For the evaluation, the generated text is compared against the original text by using three metrics:

1. BLEU (Bilingual Evaluation Understudy) is a metric that quantifies the similarity between the text generated by a model and a reference text, using the geometric mean of the n-grams shared between the two texts. 2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics that assess the quality of generated texts, particularly summaries, by comparing them with reference texts. Common variants include ROUGE-N, which measures the correspondence of n-grams, ROUGE-L, which considers the longest common sub-sequences, and ROUGE-W, which takes into account the weight of correspondences. 3. BERTScore relies on pre-trained language models to assess the semantic similarity between the generated and reference texts, going beyond mere superficial word matching. 4. Perplexity is one of the most common metrics for evaluating language models. It is defined as the exponentiated average negative log-likelihood of a sequence, in our case, the sequence of generated tokens. The perplexity measures the model's ability to predict uniformly among the set of specified tokens in a corpus. A low perplexity indicates a good model. The final perplexity is obtained by averaging the perplexity of each generated text in the test set and we do not consider tokens occurring in the prompt during the computation.

The metrics are calculated for each model: Llama3.1, Llama3.1-NA-100k and llama3-NA. Results of the evaluation are reported in Table 1.

Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI) GBonetta CDHromei LSiciliani MAStranisci Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024) the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024) 2024 A few-shot entity relation extraction method in the legal domain based on large language models SLi LYi 10.1145/3675417.3675513 doi:10.1145/3675417.3675513 Proceedings of the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence, DEAI '24 the 2024 Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence, DEAI '24

New York, NY, USA

Association for Computing Machinery 2024 Optimizing numerical estimation and operational efficiency in the legal domain through large language models J.-HHuang C.-CYang YShen AMPacces EKanoulas 2024 Unlocking practical applications in legal domain: Evaluation of gpt for zero-shot semantic annotation of legal texts JSavelka 10.1145/3594536.3595161 doi:10.1145/3594536. 3595161 Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL '23 the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL '23

New York, NY, USA

Association for Computing Machinery 2023 Performance analysis of large language models in the domain of legal argument mining AAl Zubaer MGranitzer JMitrović 10.3389/frai.2023.1278796 Frontiers in Artificial Intelligence 6 2023 Legal-bert: The muppets straight out of law school IChalkidis MFergadiotis PMalakasiotis NAletras IAndroutsopoulos 2020 Italian-legal-bert: A pre-trained transformer language model for italian law DLicari GComandè EKAW (Companion) 2022 EJHu YShen PWallis ZAllen-Zhu YLi SWang LWang WChen arXiv:2106.09685 Lora: Low-rank adaptation of large language models 2021 arXiv preprint TLeScao AFan CAkiki EPavlick SIlić DHesslow RCastagné ASLuccioni FYvon MGallé hal-03850124f Bloom: A 176b-parameter open-access multilingual language model 2023 HTouvron LMartin KStone PAlbert AAlmahairi YBabaei NBashlykov SBatra PBhargava SBhosale arXiv:2307.09288 Llama 2: Open foundation and fine-tuned chat models 2023 arXiv preprint The llama 3 herd of models ADubey 2024 PBasile EMusacchio MPolignano LSiciliani GFiameni GSemeraro arXiv:2312.09993 Llamantino: Llama 2 models for effective text generation in italian language 2023 arXiv preprint MPolignano PBasile GSemeraro arXiv:2405.07101 Advanced natural-based interaction for the italian language: Llamantino-3-anita 2024 arXiv preprint Adapting bloom to a new language: A case study for the italian, IJCoL PBasile LSiciliani EMusacchio MPolignano GSemeraro Italian Journal of Computational Linguistics 10 2024 ZHu YLan LWang WXu E.-PLim RK.-W. Lee LBing SPoria arXiv:2304.01933 Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models 2023 arXiv preprint