<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>C. I. Muntean);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>marization using Instruction-tuned Large Language Models for Food Safety Regulations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guido Rocchietti</string-name>
          <email>guido.rocchietti@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cosimo Rulli</string-name>
          <email>cosimo.rulli@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Korbinian Randl</string-name>
          <email>korbinian.randl@dsv.su.se</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristina Ioana Muntean</string-name>
          <email>cristina.muntean@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Maria Nardini</string-name>
          <email>francomaria.nardini@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaele</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Janostik</string-name>
          <email>jakub.janostik@digicomply.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Summarization, Large Language Models, Finetuning, Food Safety Regulations</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Agroknow</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <addr-line>Largo B. Pontecorvo, 3 56127 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer and Systems Sciences Stockholm University Postbox 7003</institution>
          ,
          <addr-line>SE-164 07 Kista</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Digicomply</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>IIR 24: Italian Information Retrieval Workshop</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>ISTI-CNR</institution>
          ,
          <addr-line>Via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1831</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>We design and implement a summarization pipeline for regulatory documents, focusing on two main objectives: creating two silver standard datasets using instruction-tuned large language models (LLMs) and finetuning smaller LLMs to perform summarization of regulatory text. In the first task, we employ state-of-the-art models, Cohere C4AI Command-R-4bit and Llama-3-8B, to generate summaries of regulatory documents. These generated summaries serve as ground-truth data for the second task, where we finetune three general-purpose LLMs to specialize in high-quality summary generation for specific documents while reducing the computational requirements. Specifically, we finetune two Google Flan-T5 models using datasets generated by Llama-3-8B and Cohere C4AI, and we create a quantized (4-bit) version of Google Gemma 2-B based on summaries from Cohere C4AI. Additionally, we initiated a pilot activity involving legal experts from SGS-Digicomply to validate the efectiveness of our summarization pipeline.</p>
      </abstract>
      <kwd-group>
        <kwd>Regulations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The legal industry is characterized by an overwhelming influx of textual data, encompassing
case law, statutes, regulations, legal opinions, and contracts. Navigating these vast amounts of
information can be laborious and time-consuming for legal professionals. Hence, there arises
the need for eficient and accurate summarization tools to improve productivity and facilitate
better decision-making.</p>
      <p>
        Recent advancements in Natural Language Processing (NLP) and particularly Large Language
Models (LLMs) have shown significant promise in automating text summarization tasks. The
introduction of the transformer architecture by Vasvani et. al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and models derived from it
have improved the quality of generated summaries [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Yet, summarizing legal texts presents
unique challenges compared to diferent domains. On the one hand, legal texts present a high
level of syntactic and semantic complexity combined with a highly domain-specific vocabulary.
On the other hand, legal text summarization asks for a high level of accuracy and comprehension,
as even minor errors in summarization can lead to significant misinterpretations.
      </p>
      <p>Instruction-tuned Large Language Models (ILLMs) such as ChatGPT1 or Perplexity,2 have
shown remarkable capabilities in various NLP tasks, including text summarization. However,
these models are often large and computationally expensive for practical deployment. There is
a pressing need to develop methods to reduce model size without compromising performance,
especially for domain-specific applications like legal regulation summarization.</p>
      <p>The present research is conducted within the Extreme Food Risk Analytics European (EFRA)
project framework. The project’s main goal is to develop an AI-driven approach to help and
promote food risk prevention. In particular, we address the challenges of summarizing regulatory
documents, ofering insights that can be applied to other domain-specific applications. When
considering the introduction of food safety regulations, it is important to remember that it is a
complex procedure involving several steps. In fact, public authorities and regulators require an
integrated decision framework that allows an automatic evaluation of both the regulatory aspect
and the food and risk-related one. In this framework, our partner, SGS-Digicomply3, plays a
crucial role. They are a company specialized in “regulatory compliance and risk prediction with
modern technology” with the leading software in the Food Safety market. For this research,
they provide us with their extensive set of regulatory data.</p>
      <p>In this paper, we develop and evaluate a method for summarizing regulatory food
safetyrelated documents using instruction-tuned LLMs. Due to the fact there is little annotation
available in this regard, we first create a dataset consisting of document summary pairs. The
dataset captures the complexities and specificities of regulatory text summarization. To this
end, we employ powerful — yet expensive — ILLMs to generate silver standard summaries of
our collection of documents. This weak supervision method allows us to enhance the amount
of data used for creating a fine-tuned summarization model, our second contribution. We then
employ this dataset as a training set for smaller LLMs that are finetuned on the previously
generated output of their larger counterparts. We aim to transfer the reliable knowledge
of billion-sized LLM into smaller models, tearing down the summarization cost at the price
of negligible degradation in the generated summaries. Finally, we provide a comprehensive
evaluation of our models using a dataset of regulatory documents provided by SGS-Digicomply.
Our results demonstrate the efectiveness of our approach in generating accurate and concise
summaries.</p>
      <p>The rest of the paper is organized as follows. In Section 2, we present the current state
of the art available in the literature. In Section 3, we explain our research methods and the</p>
      <sec id="sec-2-1">
        <title>1https://openai.com/chatgpt/</title>
        <p>2https://www.perplexity.ai/
3https://www.digicomply.com
experimental setup. Finally, in Section 4, we present and comment on the results, followed by
the conclusions in Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>Text summarization has been a significant area of research within natural language processing
(NLP), with recent advancements driven by large language models (LLMs). This section reviews
the most relevant contributions in the field.</p>
      <p>
        Large language models have shown remarkable capabilities in generating coherent and
contextually relevant summaries. Zhang et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] explored the use of transformer-based
architectures for summarization tasks, highlighting the superior performance of LLMs in
handling long document contexts. Their work emphasizes the importance of model size and
ifnetuning in achieving high-quality summaries, which aligns with our use of Llama-3-8B and
Cohere Command-R-4bit models for initial summary generation. Many other applications of
ILLMs and finetuned ones can be found in the literature. For instance, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] shows that ILLMs
have excellent rewriting capabilities in the context of query rewriting. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a new
framework to adapt LLMs to diferent domains by injecting legal information during a continual
training stage. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduce Legal Electra, a Language Model specialized in the legal domain.
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], explore new techniques to summarize documents in a low resources setting. The
ifrst use models such as BART [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and GPT-2 [11] to summarize long documents, while the
second investigate how to adapt models to the domain while keeping the resources low.
      </p>
      <p>The concept of instruction-tuning, where models are finetuned with specific instructions to
perform a task, has proven efective in various NLP applications. Wei et al. [12] demonstrated
that instruction-tuning significantly enhances the performance of LLMs across multiple tasks,
including summarization. [13] ofer a good survey on the main techniques in the Natural
Language Processing field, including summarization. Regarding model compression, [ 14]
discussed the efectiveness of quantization in reducing model size and improving inference
speed, which is critical for deploying models in resource-constrained environments.</p>
      <p>The application of LLMs to regulatory text summarization poses unique challenges due to the
complexity and specificity of regulatory documents. Our collaboration with SGS-Digicomply
provides a practical setting for evaluating our summarization pipeline. By involving legal
experts in the pilot phase, we ensure the generated summaries are concise and comply with
regulatory standards and legal requirements. This practical application highlights the
realworld relevance and efectiveness of our proposed methods. In summary, our work builds on
the advancements in instruction-tuned LLMs, finetuning, and model compression to develop
an energy-eficient summarization pipeline tailored to regulatory texts. This integration of
state-of-the-art techniques enhances the summarization quality and addresses the practical
constraints of deploying such models in resource-limited environments.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Experimental Setup</title>
      <p>This section presents the methodology used to conduct the current research. As indicated
in Section 1, our first objective is to generate a dataset of regulatory document summaries
exploiting the capabilities of ILLMs. Subsequently, we finetune several smaller models to learn
how to summarize regulatory documents, with the purpose of distilling the summarization
capabilities of ILLMs into more resource-eficient architectures.</p>
      <p>Data Collection. The primary dataset for this study consists of regulatory documents provided
by one of our industry partners, SGS-Digicomply. The dataset (SGS Dataset) they created for us,
which cannot be made public for copyright reasons, is a large collection of HTML regulatory
documents. It comprises items collected from websites identified by experts as pertinent to the
food industry. For each selected website, a strategy was devised to identify the most relevant
documents, which were then scraped using a proprietary framework built on top of the Scrapy
Python library4. The source documents come in various formats, including HTML, PDFs, and
Docx files. Each document undergoes processing and conversion into HTML and JSON formats
suitable for machine learning applications. The original language of each document is detected,
and non-English documents are translated into English.</p>
      <p>The SGS-Digicomply dataset is intended to serve as a comprehensive collection of documents
relevant to global markets detailing the regulatory landscape. It includes government
publications, news articles, and scientific papers on legislative changes and food safety issues. For
this research, we focus exclusively on the subset of data related to food regulatory frameworks.
The version of the dataset used for this research comprises a total of 14,307 documents in 28
diferent languages. Most of these documents are in Italian, totaling 8,191, while English is the
second most represented language, with 4,034 documents. As stated before, each document in
a language diferent from English has a corresponding version in English, which we use for
our experiments. All of these documents are provided with a summary. Most of them have a
”scraped summary,” while 44 have a manual summary, which human experts wrote. This set of
summaries constitutes part of our test set, and we use it to evaluate our summaries. Finally,
two diferent datasets must be created first to perform the finetuning phase.</p>
      <p>Data Preprocessing. We apply a simple pre-processing step to remove non-textual elements —
metadata, footnotes, and references — as shown in Figure 1. We employ the BeautifulSoup5
Python library to eliminate all the non-HTML elements.</p>
      <p>To deal with the GPU memory limit, we cannot feed the entire textual input to the ILLMs
that otherwise cause an Out-of-Memory GPU error. For this purpose, we create two datasets
with diferent configurations of the same HTML content:
• The first dataset (SGS-Cut) is created using the first 40k characters of each cleaned document
while eliminating the rest.
• The second dataset (SGS-Split) is created by splitting the cleaned HTML documents into
chunks of 30k characters each. In this way, we produce multiple training samples for each
text, notably increasing the total number of samples.</p>
      <p>These two datasets are then used as input for the selected ILLMs to generate a new dataset of
summaries that will be used for the finetuning phase.</p>
      <sec id="sec-4-1">
        <title>4https://scrapy.org/ 5https://www.crummy.com/software/BeautifulSoup/</title>
        <p>Summaries Generation. We use the ILLMs to generate the summaries for each dataset; in
particular, we select Llama-3-8B-Instruct and CohereForAI/c4ai-command-r-v01-4bit, which
were the top-performing open-source model on the HuggingFace Open LLMs Leaderboard 6 at
the time of performing the present research. Llama is used for the SGS-Split dataset to generate
summaries relative to each chunk of the documents, and Cohere for the SGS-Cut to generate a
single summary per document. We provide the ILLMs with a prompt to input the data, asking
them to summarize the regulatory documents, i.e., “I want you to summarize the following legal
document”, followed by the document itself. This results in the creation of two diferent datasets:
the Llama dataset, composed of a training set of 15,101 entries, and validation and test sets of
1,888 entries each. On the other hand, the Cohere dataset consists of a training set of 7,485
entries and a validation and test set of 935 entries each. Both datasets are then used to perform</p>
      </sec>
      <sec id="sec-4-2">
        <title>6https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard</title>
        <p>the finetuning phase.</p>
        <p>Model Finetuning. We finetune several models from the HuggingFace repository. Models
can belong to diferent architectures, and each architecture requires a specific input format.
We rely on encoder-decoder and decoder-only transformers. The former requires clean text as
input and provides a generated text as output. The latter creates a continuation of the input
text token by token. Hence, we employ a separation token that indicates the end of the input,
marking the start of the summary. At training time, we provide the model with the properly
formatted input and, as a target, the summary of the relative document or chunk of it.</p>
        <p>With our generated datasets in hand, the next step is to finetune several models to see
if we could replicate or even improve upon the performance of the ILLMs, but with lower
computational requirements. To pursue this goal, we experiment with three distinct finetuning
paths. First, we use the summaries generated by Llama-3-8B-Instruct to finetune
google/flanT5-large. This involved training the model to understand and replicate the style and precision
of Llama-3-8B-Instruct’s summaries.</p>
        <p>Similarly, we finetune another instance of google/flan-T5-large, using the summaries produced
by CohereForAI/c4ai-command-r-v01-4bit. This allows us to compare the impact of diferent
summary sources on the same base model. Lastly, we finetune a 4-bit quantized version of
google/gemma-2B using the CohereForAI-generated summaries. The quantization significantly
reduced the model size and computational load, making it more eficient while aiming for
high-quality output. The finetuning process was conducted on a Nvidia V-100 80GB GPU to
handle large models and extensive datasets efectively.</p>
        <p>Newly generated summaries go to a final post-processing phase that eliminates all the noise
and errors that the models might produce. For instance, in some cases, the generative models
ift the maximum number of tokens to generate and create sentences that do not conclude. In
those cases, we simply eliminate the latest generated sentence.</p>
        <p>Evaluation Metrics. Summarization is not an easy task to evaluate. The most used metrics,
such as ROUGE, use the lexical overlap to establish the similarity between the documents and
the summaries, which poorly estimate the semantic overlapping between the two. To address
this problem, we incorporate neural evaluation metrics, such as BERTScore and the newly
released LongDocFactScore metric [15]. These metrics overcome the limits of the lexical-based
approach, aiming at assessing the factual accuracy and consistency of the summaries, ensuring
that the finetuned models not only generated concise summaries but also preserved the integrity
and essential facts of the original legal content.</p>
        <p>We list the metrics employed in our evaluation.</p>
        <p>• ROUGE-1 (R1) [16] measures the overlap of unigrams (single words) between the
summaries generated by our models and the reference summaries from the handmade dataset.</p>
        <p>number of overlapping words
 ROUGE-1 = total words in generated summary</p>
        <p>number of overlapping words
 ROUGE-1 = total words in reference summary
(1)
(2)
• ROUGE-L (RL) [16] evaluates the longest common subsequence (LCS), which is the longest
sequence of words (not necessarily contiguous) present in both the generated summary
and the reference.</p>
        <p>number of words in LCS
 ROUGE-L = total words in generated summary</p>
        <p>number of words in LCS
 ROUGE-L = total words in reference summary
(3)
(4)
(5)
(6)
• BERTScore [17] uses a model based on BERT to compare the similarity between pairs of
texts. It creates embeddings for both the automatically generated summaries (i.e., x̂) and
the reference summaries (i.e., x), then evaluates the similarity between these embeddings.
 BERTScore = |1|̂ ∑x̂∈ ̂ m ∈a x x x̂
 BERTScore = |1| ∑x∈ m̂ ∈a x̂ x x̂
• LongDocFactScore (LDFS) [15]: Given the importance of factual accuracy in regulatory
documents, we incorporated LongDocFactScore. This recently developed metric assesses
both factual accuracy and consistency, ensuring that the summaries retained the key facts
and logical flow of the originals.</p>
        <p>Also, for R1, RL, and BERTScore, we calculate the F1-score as shown in Equation 6.
 1 = 2 ∗  ∗</p>
        <p>+</p>
        <p>The metrics that we use to evaluate are the F1-measure for Rouge 1 (F1@R1), Rouge L
(F1@RL), and BertScore (F1@BS), plus the newly introduced one LongDocFACTScore (LDFS).
Although Rouge in the original formulation only represents recall, we argue that using F1 shows
a more comprehensive picture.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Results</title>
      <p>In this section, we present the results of the experiments after the finetuning phase. We compare
ifve diferent LLMs, both instruction-tuned and finetuned. Llama-3-8B and Cohere 4-bit are
the instructed ones that we also use to generate the training datasets for the finetuning phase
(see Sec. 3). On the other hand, we use two finetuned versions of Flan-T5, one finetuned on
Cohere-generated data and one on Llama. Furthermore, we evaluate the performance of a 4-bit
quantized version of Gemma-2B finetuned on the data generated by Cohere.</p>
      <p>In Figures 2a and 2b, we report the average length of the available summaries for the two
datasets we use to evaluate. Figure 2a reports the boxplot indicating the length distribution of
the summaries considering the subset of 44 manual summaries. On the other hand, Figure 2b
reports the length distribution of the summaries in the test set provided by SGS-Digicomply.</p>
      <p>As we can observe, in both cases, the summaries generated by the finetuned and ILLMs are,
on average longer than the reference ones indicated by the English Summary label.</p>
      <p>In Table 1, we report the results of the evaluation phase when comparing the generated
summaries with the manual ones provided by SGS-Digicomply. In this case, we can observe
that the best-performing model is Flan T5, finetuned with the dataset generated using Cohere.</p>
      <p>In Table 2, we report the results of the chosen metrics calculated when comparing all of the
summaries, including the 44 manual ones provided with the dataset, with the content of the
1750
s1500
e
ira1250
m
um1000
S
f
toh 750
g
Len 500
250
0</p>
      <p>Distribution of Summary Lengths</p>
      <p>Distribution of Summary Lengths
Cohere4-Bit
(a)</p>
      <sec id="sec-5-1">
        <title>Silver Standard Fine Tuned</title>
      </sec>
      <sec id="sec-5-2">
        <title>Llama-3-8B Cohere 4bit Flan T5 Llama Flan T5 Cohere Gemma 2B 4bit</title>
        <p>HTML documents. As we can observe, the summaries generated with Llama-3-8B are the ones
that obtain the highest result for all the metrics, with a high gap with the manual ones. All
generated summaries get higher scores than the ones obtained manually. This can be probably
due to the length of the manual summaries, which has a significant influence when evaluating
lexical overlap features and will be assessed in the next iterations of the research.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Silver Standard Fine Tuned</title>
      </sec>
      <sec id="sec-5-4">
        <title>Llama-3-8B Cohere 4bit Flan T5 Llama Flan T5 Cohere Gemma 2B 4bit</title>
        <p>Finally, Tables 3 and 4 show the results obtained when evaluating the summaries on the
external test set provided by SGS-Digicomply. In the first table, we can see the results of the
generated summaries evaluated when compared with the content of the HTML document. In
line with the evaluation shown in Table 2, we observe that the higher scores are achieved by
the summaries generated using Llama-3-8B-Instructed, which seems the best model to grasp
the content of the original HTML better. The only exception is the LongDocFACTScore metric,
which indicates that Flan T5 trained on the Llama summaries is the best way to keep track of
the facts in the original HTML.</p>
        <p>When we consider the scraped summaries contained in the SGS-Digicomply test set, we can
see that Flan T5 finetuned on the Cohere dataset achieves the best results when considering
Rouge L, BertScore, and LongDocFACTScore, while Gemma 2B 4bit is the best performing one
when considering Rouge 1. Also, in this case, we need to remember that a higher metric value
might not involve the fact that the generated summaries are better than those with lower scores,
as the current metrics for evaluating summarization retain little information regarding the
content of the summaries.</p>
      </sec>
      <sec id="sec-5-5">
        <title>Metric</title>
        <p>0.383
0.285
0.870
-3.032
0.231
0.172
0.855
-6.470</p>
        <p>In conclusion, we can state that the ILLMs and the consequent finetuned models achieve
good-quality summarization capabilities for the chosen metrics. Furthermore, when comparing
with the content of the HTML documents, Llama-3-8B is the best performing one, in line with its
size in terms of parameters. It is interesting to notice that Flan T5, finetuned on the summaries
generated by Llama, achieves similar results to the ones obtained by Llama while reducing the
parameter number by approximately 10.2 times.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>This paper presents a new approach to the automatic summarization of regulatory documents
exploiting Istruction-tuned LLMs and finetuning conducted in the Extreme Food Risk Analytics
(EFRA) European project framework. Thanks to our collaboration with SGS-Digicomply, we
were provided with a large dataset of HTML documents containing legal text in the form of
regulations, news, and laws. In this research phase, we exploit the content of these documents,
appropriately cleaning of the noisy HTML tags, to generate two summary datasets exploiting
ILLM to finetune smaller LLMs later. We created these two datasets using the instruction-tuned
version Llama-3 with 8B parameters and CohereForAI/c4ai-command-r-v01-4bit using two
approaches to input these models. We then use the newly generated summaries as targets for
three distinct LLMs to teach them how to summarize the regulatory documents adequately. To
do so, we finetuned two versions of Flan T5, one on the summaries generated by Llama and
the other on the ones generated by the Cohere model. Finally, we finetuned a 4-bit quantized
version of the Google model Gemma with 2B parameters.</p>
      <p>As shown in Sections 4, the results achieved when evaluating with standard metrics for the
summarization task achieve interesting scores. We achieved better scores for every model than
those calculated using manually created summaries of the regulatory documents. At the same
time, the scores achieved by the finetuned models are also comparable, if not better, than the
ones achieved by the two ILLMs.</p>
      <p>This leaves us with good hopes for future research steps. In the following research phase, we
plan on using a pool of legal experts from the SGS Digicomply partner to manually evaluate
and label the summaries generated by the diferent models on the test set they provided us. In
this way, we plan to apply various techniques, such as knowledge distillation, to exploit the
newly labeled data and finetune even better models while reducing their size. Simultaneously,
we plan on applying all the state-of-the-art quantization techniques to further reduce the size
of the models while maintaining a good summarization quality.</p>
      <p>Ackowledgements.</p>
      <p>Funding for this research has been provided by the European Union’s Horizon Europe research
and innovation program EFRA (Grant Agreement Number 101093026). Views and opinions
expressed are those of the author(s) only and do not necessarily reflect those of the European
Union or European Commission-EU. Neither the European Union nor the granting authority
can be held responsible for them. ★★★★★ ★★ ★★★★★
(Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880. URL:
https://aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl- main.703.
[11] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language
models are unsupervised multitask learners, 2019. URL: https://www.semanticscholar.
org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/
9405cc0d6169988371b2755e573cc28650d14dfe.
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al.,
Chainof-thought prompting elicits reasoning in large language models, Advances in neural
information processing systems 35 (2022) 24824–24837.
[13] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz,
D. Roth, Recent advances in natural language processing via large pre-trained language
models: A survey, ACM Computing Surveys 56 (2023) 1–40.
[14] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, K. Keutzer, Q-bert:
Hessian based ultra low precision quantization of bert, in: Proceedings of the AAAI
Conference on Artificial Intelligence, volume 34, 2020, pp. 8815–8821.
[15] J. A. Bishop, S. Ananiadou, Q. Xie, LongDocFACTScore: Evaluating the Factuality of Long
Document Abstractive Summarisation, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci,
S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on
Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and
ICCL, Torino, Italia, 2024, pp. 10777–10789. URL: https://aclanthology.org/2024.lrec-main.
941.
[16] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[17] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text
generation with bert, arXiv preprint arXiv:1904.09675 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/ paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>See</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Get to the point: Summarization with pointer-generator networks</article-title>
          , in: R. Barzilay, M.-Y. Kan (Eds.),
          <source>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>1073</fpage>
          -
          <lpage>1083</lpage>
          . URL: https:// aclanthology.org/P17-1099. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P17</fpage>
          - 1099.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <article-title>Text summarization with pretrained encoders</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , p.
          <fpage>3721</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Liu,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>DifuSum: Generation enhanced extractive summarization with difusion</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>13089</fpage>
          -
          <lpage>13100</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>828</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings- acl.828.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Galimzhanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Muntean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          , G. Rocchietti,
          <article-title>Rewriting Conversational Utterances with Instructed Large Language Models</article-title>
          ,
          <source>in: 2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          . URL: https://ieeexplore.ieee.org/document/10350178. doi:
          <volume>10</volume>
          .1109/WI- IAT59888.
          <year>2023</year>
          .
          <volume>00014</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <source>Lawyer LLaMA Technical Report</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2305.15062. doi:
          <volume>10</volume>
          .48550/arXiv.2305. 15062, arXiv:
          <fpage>2305</fpage>
          .15062 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>Mixed-domain Language Modeling for Processing Long Legal Documents</article-title>
          , in: D.
          <article-title>Preo\textcommabelowtiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Goanta</surname>
            ,
            <given-names>I. Chalkidis</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrett</surname>
          </string-name>
          , G. Spanakis, N. Aletras (Eds.),
          <source>Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2023</year>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>61</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .nllp-
          <volume>1</volume>
          .7. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .nllp-
          <volume>1</volume>
          .7.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dangati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ashok</surname>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Uppaal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Windsor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dotterrer</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>McCallum, Long Document Summarization in a Low Resource Setting using Pretrained Language Models</article-title>
          , in: J.
          <string-name>
            <surname>Kabbara</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Paullada</surname>
          </string-name>
          , J. Vamvas (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop</source>
          , Association for Computational Linguistics, Online,
          <year>2021</year>
          , pp.
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .acl-srw.7. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .acl- srw.7.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Fung,
          <article-title>AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            , D. HakkaniTur, I. Beltagy,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>5892</fpage>
          -
          <lpage>5904</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>471</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl- main.471.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, BART:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>