=Paper=
{{Paper
|id=Vol-3762/485
|storemode=property
|title=Instruct Large Language Models for Public Administration Document Information Extraction
|pdfUrl=https://ceur-ws.org/Vol-3762/485.pdf
|volume=Vol-3762
|authors=Salvatore Carta,Alessandro Giuliani,Marco Manolo Manca,Leonardo Piano,Alessia Pisu,Sandro Gabriele Tiddia
|dblpUrl=https://dblp.org/rec/conf/ital-ia/CartaGMPPT24
}}
==Instruct Large Language Models for Public Administration Document Information Extraction==
Instruct Large Language Models for Public Administration
Document Information Extraction
Salvatore Carta, Alessandro Giuliani* , Marco Manolo Manca, Leonardo Piano* , Alessia Pisu
and Sandro Gabriele Tiddia
Department of Mathematics and Computer Science, University of Cagliari, via Ospedale 72, Cagliari, 09124, Italy
Abstract
With the rapid digitization of institutions, there is an ever-increasing problem of effectively organizing and accessing
information. Public Administrations (PAs) manage large volumes of disparate data from a variety of sources. Thus, these
organizations would greatly benefit from AI, particularly Natural Language Processing solutions that help organize, structure,
and search for information effectively. In the context of Italian PA, which we address in this paper, there are two main
challenges: the lack of ontologies and the limited tools available for Italian information extraction. In this paper, we attempt
to advance Information Extraction for Italian PAs by instructing a Large Language Model on a set of automatically labeled
triplets of public tenders.
Keywords
Large Language Models, Public Administration, Tenders, Italian Open Information Extraction
1. Introduction tures, such as knowledge graphs or ontologies, which
represent a powerful solution in many domains, e.g., in
The pervasive impact of Information and Communica- online news platforms [1], health and life sciences [2],
tion Technologies (ICT) on our society over the past two or cultural heritage [3]. In this context, Open Informa-
decades is undeniable. This technological revolution has tion Extraction (OIE) [4] represents the unique solution
permeated every aspect of society. Such a revolution to structure and organize PA information. OIE systems
has also affected Public Administrations (PAs), radically usually adopt a domain-agnostic method and can extract
transforming how these entities operate and interact with entities and relationship triples (the main components of
citizens. Digital technologies have enabled PAs to stream- knowledge graphs) from any sentence written in natural
line processes, improve service access, and increase trans- language.
parency. However, along with these opportunities, signif- The second challenge is that a predominant part of the
icant challenges also arise in terms of data management research conducted on OIE is mainly oriented toward the
and internal organization. Public administrations han- English language. While advancements in OIE have been
dle vast amounts of sensitive and often disparate data notable, they often must encompass the complexities
from various sources. Lack of data standardization, in- inherent in non-English languages. This linguistic bias
formation security, and citizen privacy are crucial issues significantly hinders the widespread applicability and
to be addressed. In addition, data fragmentation among effectiveness of OIE systems in multilingual contexts.
different systems and departments can inhibit effective in- In this paper, we aim to advance the research on Open
formation sharing and analysis. For the aforementioned Information Extraction applied to PA by testing and ex-
reasons, PAs would benefit from technology solutions ploiting the potential offered by Large Language Models
based on Machine Learning and, in particular, Natural (LLMs). In particular, a proper LLM is instructed with an
Language Processing (NLP) to improve the organization effective strategy, employing proper Italian PA data.
of such fragmented information. The rest of the paper is structured as follows: Section 2
However, there are two major challenges. The first is gives an overview of the state-of-the-art; our method-
the lack of appropriate resources to adequately organize ology is detailed in Section 3, whereas the experiments
PA documents. Indeed, it is crucial to organize, access, are described in Section 4. Section 5 reports and dis-
understand, and utilize information with proper struc- cusses the results, and Section 6 ends the paper with the
conclusions.
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
nized by CINI, May 29-30, 2024, Naples, Italy
*
Corresponding authors. 2. Related Works
$ salvatore@unica.it (S. Carta); alessandro.giuliani@unica.it
(A. Giuliani); marcom.manca@unica.it (M. M. Manca); The advent of Open Information Extraction (OIE) en-
leonardo.piano@unica.it (L. Piano); alessia.pisu96@unica.it
(A. Pisu); sandrog.tiddia@unica.it (S. G. Tiddia)
abled the transcendation of domain-specific constraints
Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License inherent in conventional IE methodologies. OIE meth-
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
ods aim to identify linguistic extraction patterns, either process is depicted in Figure 1.
hand-crafted or automatically learned from the data [5]. Our method involves two stages. In detail, the process
Therefore, they are subdivided into rule-based or neural first performs a step aimed at obtaining a correctly an-
methods. The former include ClausIE [6], an OIE frame- notated set of triplets (Triplet Auto-Labeling), which is
work based on dependency parsing to detect clauses in an subsequently used to train the LLM (Instruction Tuning).
input sentence and subsequently extract proposition. RE- Each step is described in the following.
VERB [7] extract the tuples by isolating relation phrases
that satisfy syntactic and lexical constraints. Similarly, 3.1. Triplet Auto-Labeling
TEXTRUNNER [8] first identifies a pair of noun phrases
that are not too far apart, and then it applies a classifier The first step of our methodology is training a Sequence
to determine whether or not to extract a relationship. Classifier Language Model to identify meaningful triplets
Further works rely on a proper strategy for combining within the PA context. To accomplish this, we leveraged
different OIE tools for triplet generation and filtering [9]. the dataset OIE4PA, consisting of a collection of triplets
A pioneering proposal regarding the more recent Neural extracted from Italian tenders of the Apulia region [15].
methods is the work of Stanovsky et al. [10], wherein In particular, each triplet is extracted with the WikiOIE
OIE is treated as a sequence labeling problem, and an framework [16]. Specifically, the dataset is organized
LSTM-transducer automatically extracts triplets. Zhan into two sets: a labeled set β, which contains a subset of
and Zhao [11] introduced a span model for n-ary Open 2000 binary triplets labeled by humans as valid or not,
Information Extraction. More recently, Kolluru et al. [12] and an unlabeled set π° of 14,096 triplets, together with
introduced IMOJIE a neural Open Information Extrac- the original sentences. Then, at this stage, we exploited
tion system that follows an iterative approach where the β set to properly train a classifier to distinguish be-
the triplet extraction is conditioned by the previously tween valid and invalid triplets. To do this, we treated
retrieved triplets, with the aim of reducing redundancy. this task as a sentence classification problem, concatenat-
The methods above have been developed or tested ing triplets into a single sentence and separating subject,
specifically for English textual corpus. Regarding the predicate, and object by a semicolon. To this end, we iden-
Italian language, no significant research has been con- tified three suitable Language Models (LMs) for this task,
ducted on Italian Open IE until the last decade. To date, namely Italian-bert, LegalBert [17], and BureauBERTo
only a few works have addressed such a challenge. Dami- [18]. The former is a Bert base model [19] fine-tuned
ano et al. proposed ItalIE [13], a clause-based OIE sys- on Italian corpus, the second is a fine-tuned version of
tem inspired by ClausIE aimed at extracting n-ary co- Italian Bert on Italian civil law corpora, and the last is
herent propositions from simple sentences. Sentences an UmBERTO model fine-tuned on PA, banking, and in-
are analyzed to identify and categorize clauses based on surances corpus. Table 1 outlines the results obtained by
seven predefined patterns specific to the Italian language. these three Language Models on the triplet classification
Guarasci et al. [14] presented an OIE method for Italian task. Finally, the trained most accurate classifier has been
single-verb sentences based on Lexicon-Grammar tables. employed to label the triplets of the U set, forming a new
The system employs linguistic structures and patterns of πβ (Auto-Labeled) set, which in turn will be exploited
verbal behavior to identify arguments, match patterns, to instruct the Large Language Model for the OIE task.
and generate propositions, demonstrating effectiveness
in generating syntactically and semantically valid propo- 3.2. Instruction Tuning
sitions for the Italian language. Finally, [15] proposed
OIE4PA, an Open IE framework that can identify facts Instruction tuning is an innovative strategy that involves
from Public Administration documents. Leveraging the guiding a language model through human-like instruc-
proposal of Siciliani et al. [15], in this work, we proposed tions to improve its performance on a specific task. Un-
an Instructed Large Language model for Italian Open like traditional methods that rely solely on large-scale
Information Extraction specialized in Public Administra- training data, instruction tuning provides targeted guid-
tion Documents. ance, allowing the model to adapt and refine its behaviour
toward desired outcomes. Incorporating human-like
instructions enhances the modelβs understanding and
3. Methodology improves its ability to generate contextually relevant
responses. In summary, given a source text and task-
We propose a novel model for automated Information specific instructions, the model is trained to create a
Extraction for Italian PAs by instructing an LLM on a sequence of tokens representing the desired output.
set of automatically labeled triplets of public tenders. To To instruct an LLM to perform Open Information Ex-
this end, we devise a proper strategy to train an LLM traction, we transformed the πβ triplets set into an in-
with a suitable set of triplets and instructions. The entire struction dataset- In particular, each auto-labeled triplet
Figure 1: Instructed model training.
is used to train the Instruction model following the tem- 4. Experimental settings
plate: Task Instruction, Input Text, and Response.
We adopted the Flan-T5 family [20] as an instruction
3.2.1. Task instruction model. Such a choice is motivated by two reasons: first,
prior research [21] has demonstrated the potential of
Task instructions provide a detailed statement on accom- such models in Information Extraction tasks, eventually
plishing the desired task and properly structuring the outperforming larger models such as LLama2 or similar,
output. In detail, we formulated the following instruction resulting in a perfect trade-off between speed of inference
to query LLM: and prediction quality. The other main benefit is that
Flan-T5 is a multi-language model, which is also suit-
.
(3b) adopting for both the OIE4PA dataset, relying on a
split of 80% and 20% for training and test, respectively.
We formulate the instruction in Italian to make the model
We fine-tuned the models for efficiency and hardware
immediately understand that we are referring to the Ital-
reasons by exploiting QLora1 with a 4-bit quantization,
ian language. The translation in English of the instruction
allowing faster training and saving GPU memory. All
is:"Find which semantic triples exist in the text, Format the
experiments were conducted with an Nvidia RTX A6000
output as [Subject; Predicate; Object]".
GPU machine with 48 GB of VRAM. We train both mod-
els for one epoch, and we adopt the following QLora
3.2.2. Input text settings and hyperparameters:
The input text represents the sentences in which LLM
has to perform the task defined by the instructions. In Lora-rank 16
Lora-alpha 32
detail, each sentence is the original text excerpt from
Lora-dropout 0.05
which a triplet belonging to the dataset OIE4PA has been
Learning rate 0.003
extracted.
Batch Size 8
3.2.3. Response
The response represents the desired output. In our case, 4.1. Evaluation Metrics
the input sentence was transformed into an open triplet.
We also specify that to instruct the model to distinguish To properly apply such metrics for the triplets evalua-
sentences where a triplet can be extracted from sentences tion, we considered as true positive (TP) a non-empty
where no useful triplets exist, we included the triplet as triplet that matches with the corresponding triplet in the
a response if it was labeled as valid by the classifier; ground truth (i.e., the triples belonging to the πβ set),
otherwise, we leave an empty string. true negative (TN) a triple returned as an empty string
by the model and labeled as invalid in the ground truth,
false positive (FP) a triplet that was labeled as invalid
but retrieved by the model, and false negative (FN) when
1
https://github.com/artidoro/qlora
the model returned an empty string rather than a valid Table 2
triplet. FLAN-OpenIE results on OIE4PA dataset in terms of accuracy
In doing so, we can evaluate the performances in terms (π), precision (π), recall (π), and F1 score (πΉ 1).
of classical confusion matrix metrics, i.e., accuracy (a),
Model π π π πΉ1
precision (p), recall (r), and F1 score (F1); whose formulae
are: T5-xl 0.78 0.74 0.97 0.84
T5-xxl 0.82 0.78 0.99 0.87
ππ + ππ
π=
ππ + ππ + πΉπ + πΉπ 6. Conclusions
ππ
π= Considering the significant gap between information
ππ + πΉπ extraction available for English and other resource-
ππ constrained languages such as Italian, we explored an
π= Instruction Tuning approach to perform Open Informa-
ππ + πΉπ
tion Extraction on Italian Public Tenders in this paper. A
2*π*π proper LLM is instructed with an effective two-stage strat-
πΉ1 = egy, in which a language model-based classifier is trained
π+π
on a proper Italian PA dataset to obtain a set of correct
triplets, which are used to instruct a suitable LLM. The
5. Results promising experiments have validated the assumptions
Table 1 reports the comparisons of three different Italian pointed out in the paper and incentivized future devel-
Bert models for the triplet classification task. In detail, the opments aimed at developing new datasets and models
selcted models are LegalBERT-ITA2 , BertBase-ITA3 , capable of theoretically understanding and structuring
and BureauBERTo4 . The best model turns out to be Bu- technical texts in Italian in the form of semantics triplets.
reauBerto, probably due to the fact that it is the only
model pre-trained on Public Administration corpora. Acknowledgments
Table 1 This work has been partially carried out thanks to the
Bert triplet classification results in terms of accuracy (π), pre-Ministerial Decree no. 351 of 9th April 2022, based on the
cision (π), recall (π), and F1 score (πΉ 1). NRRP β funded by the European Union - NextGenera-
tionEU - Mission 4 βEducation and Researchβ, Component
Model π π π πΉ1
1 βEnhancement of the offer of educational services: from
LegalBERT-ITA 0.935 0.953 0.897 0.919 nurseries to universitiesβ - Investment 4.1, that provided
BertBase-ITA 0.927 0.935 0.894 0.911 a financial support for the Leonardo Pianoβs doctoral
BureauBERTo 0.945 0.963 0.901 0.932 pathway.
Also, Alessia Pisu acknowledge MUR and EU-FSE for
Table 2 outlines the result of the two fine-tuned Flan- financial support of the PON Research and Innovation
T5 models on extracting triplets in procurement texts. 2014-2020 (D.M. 1061/2021).
Both model sizes show excellent results for all metrics; in Furthermore, we acknowledge financial support un-
particular, recall is significantly high, demonstrating that der the National Recovery and Resilience Plan (NRRP),
the models are quite effective in finding a large number Mission 4 Component 2 Investment 1.5 - Call for tender
of true positives (e.g., valid triplets). It is also good to note No.3277 published on December 30, 2021 by the Italian
that the values are higher for the model with a higher Ministry of University and Research (MUR) funded by
number of parameters. Therefore, the promising results the European Union β NextGenerationEU. Project Code
support the thesis of leveraging Instruction Tuning to ECS0000038 β Project Title eINS Ecosystem of Innovation
build strong Open Information Extraction models for for Next Generation Sardinia β CUP F53C22000430001-
Italian public administrations. To this end, we plan to Grant Assignment Decree No. 1056 adopted on June 23,
create new datasets in the future to develop a new set of 2022 by the Italian Ministry of University and Research
foundational models for information extraction in Italian, (MUR).
with a particular focus on PAs and other administrative
entities.
2
https://huggingface.co/dlicari/Italian-Legal-BERT
3
https://huggingface.co/dbmdz/bert-base-italian-uncased
4
https://huggingface.co/colinglab/BureauBERTo
References traction, in: G. A. Tsihrintzis, C. Toro, S. A. RΓos,
R. J. Howlett, L. C. Jain (Eds.), Knowledge-Based
[1] C. Rudnik, T. Ehrhart, O. Ferret, D. Teyssou, and Intelligent Information & Engineering Systems:
R. Troncy, X. Tannier, Searching news articles using Proceedings of the 27th International Conference
an event knowledge graph leveraged by wikidata, KES-2023, Athens, Greece, 6-8 September 2023, vol-
in: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, ume 225 of Procedia Computer Science, Elsevier,
K. Lerman, J. J. McAuley, R. Baeza-Yates, L. Zia 2023, pp. 2224β2233. URL: https://doi.org/10.1016/
(Eds.), Companion of The 2019 World Wide Web j.procs.2023.10.213. doi:10.1016/J.PROCS.2023.
Conference, WWW 2019, San Francisco, CA, USA, 10.213.
May 13-17, 2019, ACM, 2019, pp. 1232β1239. [10] G. Stanovsky, J. Michael, L. Zettlemoyer, I. Dagan,
[2] P. Ernst, C. Meng, A. Siu, G. Weikum, Knowlife: A Supervised open information extraction, in: North
knowledge graph for health and life sciences, in: American Chapter of the Association for Computa-
2014 IEEE 30th International Conference on Data tional Linguistics, 2018.
Engineering, 2014, pp. 1254β1257. doi:10.1109/ [11] J. Zhan, H. Zhao, Span model for open information
ICDE.2014.6816754. extraction on accurate corpus, in: AAAI Confer-
[3] S. Carta, G. Fenu, A. Giuliani, M. M. Manca, ence on Artificial Intelligence, 2019. URL: https:
M. Marras, L. Piano, A. S. Podda, L. Pompianu, //api.semanticscholar.org/CorpusID:208138002.
S. G. Tiddia, Empowering digital transforma- [12] K. Kolluru, S. Aggarwal, V. Rathore, Mausam,
tion in tourism through intelligent methods S. Chakrabarti, Imojie: Iterative memory-
for representation and exploitation of cultural based joint open information extraction,
heritage knowledge, volume 3536, 2023, p. 83 β 91. ArXiv abs/2005.08178 (2020). URL: https:
URL: https://www.scopus.com/inward/record.uri? //api.semanticscholar.org/CorpusID:218674382.
eid=2-s2.0-85177612618&partnerID=40&md5= [13] E. Damiano, A. Minutolo, M. Esposito, Open infor-
7e8334f126d9385a733fbfb0d1674f19. mation extraction for italian sentences, in: 2018
[4] M. Banko, M. J. Cafarella, S. Soderland, M. Broad- 32nd International Conference on Advanced Infor-
head, O. Etzioni, Open information extraction from mation Networking and Applications Workshops
the web, in: Proceedings of the 20th International (WAINA), 2018, pp. 668β673. doi:10.1109/WAINA.
Joint Conference on Artifical Intelligence, IJCAIβ07, 2018.00165.
Morgan Kaufmann Publishers Inc., San Francisco, [14] R. Guarasci, E. Damiano, A. Minutolo, M. Esposito,
CA, USA, 2007, pp. 2670β2676. G. De Pietro, Lexicon-grammar based open infor-
[5] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A mation extraction from natural language sentences
survey on open information extraction, in: E. M. in italian, Expert Systems with Applications 143
Bender, L. Derczynski, P. Isabelle (Eds.), Proceed- (2020) 112954. URL: https://www.sciencedirect.com/
ings of the 27th International Conference on Com- science/article/pii/S0957417419306724. doi:https:
putational Linguistics, Association for Computa- //doi.org/10.1016/j.eswa.2019.112954.
tional Linguistics, Santa Fe, New Mexico, USA, [15] L. Siciliani, E. Ghizzota, P. Basile, P. Lops, Oie4pa:
2018, pp. 3866β3878. URL: https://aclanthology.org/ open information extraction for the public adminis-
C18-1326. tration, Journal of Intelligent Information Systems
[6] L. Del Corro, R. Gemulla, Clausie: clause-based (2023) 1β22.
open information extraction, in: Proceedings of the [16] L. Siciliani, P. Cassotti, P. Basile, M. de Gemmis,
22nd international conference on World Wide Web, P. Lops, G. Semeraro, A. Moro, Extracting relations
2013, pp. 355β366. from italian wikipedia using self-training (2021).
[7] A. Fader, S. Soderland, O. Etzioni, Identifying re- [17] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT:
lations for open information extraction, in: Con- A Pre-trained Transformer Language Model for
ference on Empirical Methods in Natural Language Italian Law, in: D. Symeonidou, R. Yu, D. Ceolin,
Processing, 2011. M. Poveda-VillalΓ³n, D. Audrito, L. D. Caro, F. Grasso,
[8] A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, N. Troquard
O. Etzioni, S. Soderland, Textrunner: Open infor- (Eds.), Companion Proceedings of the 23rd Interna-
mation extraction on the web, in: North American tional Conference on Knowledge Engineering and
Chapter of the Association for Computational Lin- Knowledge Management, volume 3256 of CEUR
guistics, 2007. URL: https://api.semanticscholar.org/ Workshop Proceedings, CEUR, Bozen-Bolzano, Italy,
CorpusID:1455080. 2022. URL: https://ceur-ws.org/Vol-3256/#km4law3,
[9] S. Carta, P. Fariello, A. Giuliani, L. Piano, A. S. iSSN: 1613-0073.
Podda, S. G. Tiddia, Sailgenie: Sailing expertise [18] S. Auriemma, M. Madeddu, M. Miliani, A. Bondielli,
to knowledge graph through open information ex- L. C. Passaro, A. Lenci, Bureauberto: adapting
umberto to the italian bureaucratic language, in:
Ital-IA, 2023. URL: https://api.semanticscholar.org/
CorpusID:262088765.
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
Pre-training of deep bidirectional transformers for
language understanding, in: North American
Chapter of the Association for Computational Lin-
guistics, 2019. URL: https://api.semanticscholar.org/
CorpusID:52967399.
[20] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,
W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,
et al., Scaling instruction-finetuned language mod-
els, arXiv preprint arXiv:2210.11416 (2022).
[21] S. Wadhwa, S. Amir, B. C. Wallace, Revisiting re-
lation extraction in the era of large language mod-
els, Proceedings of the conference. Association
for Computational Linguistics. Meeting 2023 (2023)
15566β15589. URL: https://api.semanticscholar.org/
CorpusID:258564662.