<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Carta, P. Fariello, A. Giuliani, L. Piano, A. S.</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/J.PROCS.2023</article-id>
      <title-group>
        <article-title>Instruct Large Language Models for Public Administration Document Information Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Carta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Giuliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Manolo Manca</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Piano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessia Pisu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandro Gabriele Tiddia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, University of Cagliari</institution>
          ,
          <addr-line>via Ospedale 72, Cagliari, 09124</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>3536</volume>
      <fpage>6</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>With the rapid digitization of institutions, there is an ever-increasing problem of efectively organizing and accessing information. Public Administrations (PAs) manage large volumes of disparate data from a variety of sources. Thus, these organizations would greatly benefit from AI, particularly Natural Language Processing solutions that help organize, structure, and search for information efectively. In the context of Italian PA, which we address in this paper, there are two main challenges: the lack of ontologies and the limited tools available for Italian information extraction. In this paper, we attempt to advance Information Extraction for Italian PAs by instructing a Large Language Model on a set of automatically labeled triplets of public tenders.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Public Administration</kwd>
        <kwd>Tenders</kwd>
        <kwd>Italian Open Information Extraction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tures, such as knowledge graphs or ontologies, which
represent a powerful solution in many domains, e.g., in
The pervasive impact of Information and Communica- online news platforms [1], health and life sciences [2],
tion Technologies (ICT) on our society over the past two or cultural heritage [3]. In this context, Open
Informadecades is undeniable. This technological revolution has tion Extraction (OIE) [4] represents the unique solution
permeated every aspect of society. Such a revolution to structure and organize PA information. OIE systems
has also afected Public Administrations (PAs), radically usually adopt a domain-agnostic method and can extract
transforming how these entities operate and interact with entities and relationship triples (the main components of
citizens. Digital technologies have enabled PAs to stream- knowledge graphs) from any sentence written in natural
line processes, improve service access, and increase trans- language.
parency. However, along with these opportunities, signif- The second challenge is that a predominant part of the
icant challenges also arise in terms of data management research conducted on OIE is mainly oriented toward the
and internal organization. Public administrations han- English language. While advancements in OIE have been
dle vast amounts of sensitive and often disparate data notable, they often must encompass the complexities
from various sources. Lack of data standardization, in- inherent in non-English languages. This linguistic bias
formation security, and citizen privacy are crucial issues significantly hinders the widespread applicability and
to be addressed. In addition, data fragmentation among efectiveness of OIE systems in multilingual contexts.
diferent systems and departments can inhibit efective in- In this paper, we aim to advance the research on Open
formation sharing and analysis. For the aforementioned Information Extraction applied to PA by testing and
exreasons, PAs would benefit from technology solutions ploiting the potential ofered by Large Language Models
based on Machine Learning and, in particular, Natural (LLMs). In particular, a proper LLM is instructed with an
Language Processing (NLP) to improve the organization efective strategy, employing proper Italian PA data.
of such fragmented information. The rest of the paper is structured as follows: Section 2</p>
      <p>However, there are two major challenges. The first is gives an overview of the state-of-the-art; our
methodthe lack of appropriate resources to adequately organize ology is detailed in Section 3, whereas the experiments
PA documents. Indeed, it is crucial to organize, access, are described in Section 4. Section 5 reports and
disunderstand, and utilize information with proper struc- cusses the results, and Section 6 ends the paper with the
conclusions.
ods aim to identify linguistic extraction patterns, either process is depicted in Figure 1.
hand-crafted or automatically learned from the data [5]. Our method involves two stages. In detail, the process
Therefore, they are subdivided into rule-based or neural ifrst performs a step aimed at obtaining a correctly
anmethods. The former include ClausIE [6], an OIE frame- notated set of triplets (Triplet Auto-Labeling), which is
work based on dependency parsing to detect clauses in an subsequently used to train the LLM (Instruction Tuning).
input sentence and subsequently extract proposition. RE- Each step is described in the following.
VERB [7] extract the tuples by isolating relation phrases
that satisfy syntactic and lexical constraints. Similarly, 3.1. Triplet Auto-Labeling
TEXTRUNNER [8] first identifies a pair of noun phrases
that are not too far apart, and then it applies a classifier The first step of our methodology is training a Sequence
to determine whether or not to extract a relationship. Classifier Language Model to identify meaningful triplets
Further works rely on a proper strategy for combining within the PA context. To accomplish this, we leveraged
diferent OIE tools for triplet generation and filtering [ 9]. the dataset OIE4PA, consisting of a collection of triplets
A pioneering proposal regarding the more recent Neural extracted from Italian tenders of the Apulia region [15].
methods is the work of Stanovsky et al. [10], wherein In particular, each triplet is extracted with the WikiOIE
OIE is treated as a sequence labeling problem, and an framework [16]. Specifically, the dataset is organized
LSTM-transducer automatically extracts triplets. Zhan into two sets: a labeled set ℒ, which contains a subset of
and Zhao [11] introduced a span model for n-ary Open 2000 binary triplets labeled by humans as valid or not,
Information Extraction. More recently, Kolluru et al. [12] and an unlabeled set  of 14,096 triplets, together with
introduced IMOJIE a neural Open Information Extrac- the original sentences. Then, at this stage, we exploited
tion system that follows an iterative approach where the ℒ set to properly train a classifier to distinguish
bethe triplet extraction is conditioned by the previously tween valid and invalid triplets. To do this, we treated
retrieved triplets, with the aim of reducing redundancy. this task as a sentence classification problem,
concatenat</p>
      <p>The methods above have been developed or tested ing triplets into a single sentence and separating subject,
specifically for English textual corpus. Regarding the predicate, and object by a semicolon. To this end, we
idenItalian language, no significant research has been con- tified three suitable Language Models (LMs) for this task,
ducted on Italian Open IE until the last decade. To date, namely Italian-bert, LegalBert [17], and BureauBERTo
only a few works have addressed such a challenge. Dami- [18]. The former is a Bert base model [19] fine-tuned
ano et al. proposed ItalIE [13], a clause-based OIE sys- on Italian corpus, the second is a fine-tuned version of
tem inspired by ClausIE aimed at extracting n-ary co- Italian Bert on Italian civil law corpora, and the last is
herent propositions from simple sentences. Sentences an UmBERTO model fine-tuned on PA, banking, and
inare analyzed to identify and categorize clauses based on surances corpus. Table 1 outlines the results obtained by
seven predefined patterns specific to the Italian language. these three Language Models on the triplet classification
Guarasci et al. [14] presented an OIE method for Italian task. Finally, the trained most accurate classifier has been
single-verb sentences based on Lexicon-Grammar tables. employed to label the triplets of the U set, forming a new
The system employs linguistic structures and patterns of ℒ (Auto-Labeled) set, which in turn will be exploited
verbal behavior to identify arguments, match patterns, to instruct the Large Language Model for the OIE task.
and generate propositions, demonstrating efectiveness
in generating syntactically and semantically valid propo- 3.2. Instruction Tuning
sitions for the Italian language. Finally, [15] proposed
OIE4PA, an Open IE framework that can identify facts Instruction tuning is an innovative strategy that involves
from Public Administration documents. Leveraging the guiding a language model through human-like
instrucproposal of Siciliani et al. [15], in this work, we proposed tions to improve its performance on a specific task.
Unan Instructed Large Language model for Italian Open like traditional methods that rely solely on large-scale
Information Extraction specialized in Public Administra- training data, instruction tuning provides targeted
guidtion Documents. ance, allowing the model to adapt and refine its behaviour
toward desired outcomes. Incorporating human-like
instructions enhances the model’s understanding and
3. Methodology improves its ability to generate contextually relevant
responses. In summary, given a source text and
taskWe propose a novel model for automated Information specific instructions, the model is trained to create a
Extraction for Italian PAs by instructing an LLM on a sequence of tokens representing the desired output.
set of automatically labeled triplets of public tenders. To To instruct an LLM to perform Open Information
Exthis end, we devise a proper strategy to train an LLM traction, we transformed the ℒ triplets set into an
inwith a suitable set of triplets and instructions. The entire struction dataset- In particular, each auto-labeled triplet</p>
    </sec>
    <sec id="sec-2">
      <title>4. Experimental settings</title>
      <p>is used to train the Instruction model following the
template: Task Instruction, Input Text, and Response.</p>
      <p>We adopted the Flan-T5 family [20] as an instruction
3.2.1. Task instruction model. Such a choice is motivated by two reasons: first,
prior research [21] has demonstrated the potential of
Task instructions provide a detailed statement on accom- such models in Information Extraction tasks, eventually
plishing the desired task and properly structuring the outperforming larger models such as LLama2 or similar,
output. In detail, we formulated the following instruction resulting in a perfect trade-of between speed of inference
to query LLM: and prediction quality. The other main benefit is that
Flan-T5 is a multi-language model, which is also
suit&lt;Trova quali triple semantiche esistono able for tasks related to understanding Italian. We tested
nel testo. Formatta l’output come with two diferent T5-Flan sizes flan-xxl (11b) and flan-xl
[Soggetto;Predicato;Oggetto]&gt;. (3b) adopting for both the OIE4PA dataset, relying on a
split of 80% and 20% for training and test, respectively.</p>
      <p>We formulate the instruction in Italian to make the model We fine-tuned the models for eficiency and hardware
immediately understand that we are referring to the Ital- reasons by exploiting QLora1 with a 4-bit quantization,
ian language. The translation in English of the instruction allowing faster training and saving GPU memory. All
is:"Find which semantic triples exist in the text, Format the experiments were conducted with an Nvidia RTX A6000
output as [Subject; Predicate; Object]". GPU machine with 48 GB of VRAM. We train both
models for one epoch, and we adopt the following QLora
3.2.2. Input text settings and hyperparameters:
To properly apply such metrics for the triplets
evaluation, we considered as true positive (TP) a non-empty
triplet that matches with the corresponding triplet in the
ground truth (i.e., the triples belonging to the ℒ set),
true negative (TN) a triple returned as an empty string
by the model and labeled as invalid in the ground truth,
false positive (FP) a triplet that was labeled as invalid
but retrieved by the model, and false negative (FN) when
The input text represents the sentences in which LLM
has to perform the task defined by the instructions. In
detail, each sentence is the original text excerpt from
which a triplet belonging to the dataset OIE4PA has been
extracted.
3.2.3. Response
The response represents the desired output. In our case, 4.1. Evaluation Metrics
the input sentence was transformed into an open triplet.</p>
      <p>We also specify that to instruct the model to distinguish
sentences where a triplet can be extracted from sentences
where no useful triplets exist, we included the triplet as
a response if it was labeled as valid by the classifier;
otherwise, we leave an empty string.</p>
      <p>Lora-rank
Lora-alpha
Lora-dropout
Learning rate
Batch Size
the model returned an empty string rather than a valid
triplet.</p>
      <p>In doing so, we can evaluate the performances in terms
of classical confusion matrix metrics, i.e., accuracy (a),
precision (p), recall (r), and F1 score (F1); whose formulae
are:
6. Conclusions
 
  +  
 = Considering the significant gap between information
extraction available for English and other
resource  constrained languages such as Italian, we explored an
 =   +   Instruction Tuning approach to perform Open
Information Extraction on Italian Public Tenders in this paper. A
 1 = 2 *  + *   epgroyp,ienr wLLhMic hisainlasntrguucategde
wmiothdealn-beafescetdivcelatwssoifie-rstiasgterasitnraedton a proper Italian PA dataset to obtain a set of correct
5. Results triplets, which are used to instruct a suitable LLM. The
promising experiments have validated the assumptions
Table 1 reports the comparisons of three diferent Italian pointed out in the paper and incentivized future
develBert models for the triplet classification task. In detail, the opments aimed at developing new datasets and models
selcted models are LegalBERT-ITA2, BertBase-ITA3, capable of theoretically understanding and structuring
and BureauBERTo4. The best model turns out to be Bu- technical texts in Italian in the form of semantics triplets.
reauBerto, probably due to the fact that it is the only
model pre-trained on Public Administration corpora.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>Table 1 This work has been partially carried out thanks to the
Bert triplet classification results in terms of accuracy (), pre- Ministerial Decree no. 351 of 9th April 2022, based on the
cision (), recall (), and F1 score ( 1). NRRP – funded by the European Union -
NextGenerationEU - Mission 4 “Education and Research”, Component</p>
      <p>Model     1 1 “Enhancement of the ofer of educational services: from
LegalBERT-ITA 0.935 0.953 0.897 0.919 nurseries to universities” - Investment 4.1, that provided
BertBase-ITA 0.927 0.935 0.894 0.911 a financial support for the Leonardo Piano’s doctoral
BureauBERTo 0.945 0.963 0.901 0.932 pathway.</p>
      <p>Also, Alessia Pisu acknowledge MUR and EU-FSE for
Table 2 outlines the result of the two fine-tuned Flan- ifnancial support of the PON Research and Innovation
T5 models on extracting triplets in procurement texts. 2014-2020 (D.M. 1061/2021).</p>
      <p>Both model sizes show excellent results for all metrics; in Furthermore, we acknowledge financial support
unparticular, recall is significantly high, demonstrating that der the National Recovery and Resilience Plan (NRRP),
the models are quite efective in finding a large number Mission 4 Component 2 Investment 1.5 - Call for tender
of true positives (e.g., valid triplets). It is also good to note No.3277 published on December 30, 2021 by the Italian
that the values are higher for the model with a higher Ministry of University and Research (MUR) funded by
number of parameters. Therefore, the promising results the European Union – NextGenerationEU. Project Code
support the thesis of leveraging Instruction Tuning to ECS0000038 – Project Title eINS Ecosystem of Innovation
build strong Open Information Extraction models for for Next Generation Sardinia – CUP
F53C22000430001Italian public administrations. To this end, we plan to Grant Assignment Decree No. 1056 adopted on June 23,
create new datasets in the future to develop a new set of 2022 by the Italian Ministry of University and Research
foundational models for information extraction in Italian, (MUR).
with a particular focus on PAs and other administrative
entities.
2https://huggingface.co/dlicari/Italian-Legal-BERT
3https://huggingface.co/dbmdz/bert-base-italian-uncased
4https://huggingface.co/colinglab/BureauBERTo
umberto to the italian bureaucratic language, in:
Ital-IA, 2023. URL: https://api.semanticscholar.org/</p>
      <p>CorpusID:262088765.
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:</p>
      <p>Pre-training of deep bidirectional transformers for
language understanding, in: North American
Chapter of the Association for Computational
Linguistics, 2019. URL: https://api.semanticscholar.org/</p>
      <p>CorpusID:52967399.
[20] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,</p>
      <p>W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,
et al., Scaling instruction-finetuned language
models, arXiv preprint arXiv:2210.11416 (2022).
[21] S. Wadhwa, S. Amir, B. C. Wallace, Revisiting
relation extraction in the era of large language
models, Proceedings of the conference. Association
for Computational Linguistics. Meeting 2023 (2023)
15566–15589. URL: https://api.semanticscholar.org/
CorpusID:258564662.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>