=Paper= {{Paper |id=Vol-3762/485 |storemode=property |title=Instruct Large Language Models for Public Administration Document Information Extraction |pdfUrl=https://ceur-ws.org/Vol-3762/485.pdf |volume=Vol-3762 |authors=Salvatore Carta,Alessandro Giuliani,Marco Manolo Manca,Leonardo Piano,Alessia Pisu,Sandro Gabriele Tiddia |dblpUrl=https://dblp.org/rec/conf/ital-ia/CartaGMPPT24 }} ==Instruct Large Language Models for Public Administration Document Information Extraction== https://ceur-ws.org/Vol-3762/485.pdf
                                Instruct Large Language Models for Public Administration
                                Document Information Extraction
                                Salvatore Carta, Alessandro Giuliani* , Marco Manolo Manca, Leonardo Piano* , Alessia Pisu
                                and Sandro Gabriele Tiddia
                                Department of Mathematics and Computer Science, University of Cagliari, via Ospedale 72, Cagliari, 09124, Italy


                                                Abstract
                                                With the rapid digitization of institutions, there is an ever-increasing problem of effectively organizing and accessing
                                                information. Public Administrations (PAs) manage large volumes of disparate data from a variety of sources. Thus, these
                                                organizations would greatly benefit from AI, particularly Natural Language Processing solutions that help organize, structure,
                                                and search for information effectively. In the context of Italian PA, which we address in this paper, there are two main
                                                challenges: the lack of ontologies and the limited tools available for Italian information extraction. In this paper, we attempt
                                                to advance Information Extraction for Italian PAs by instructing a Large Language Model on a set of automatically labeled
                                                triplets of public tenders.

                                                Keywords
                                                Large Language Models, Public Administration, Tenders, Italian Open Information Extraction



                                1. Introduction                                                                tures, such as knowledge graphs or ontologies, which
                                                                                                               represent a powerful solution in many domains, e.g., in
                                The pervasive impact of Information and Communica-                             online news platforms [1], health and life sciences [2],
                                tion Technologies (ICT) on our society over the past two                       or cultural heritage [3]. In this context, Open Informa-
                                decades is undeniable. This technological revolution has                       tion Extraction (OIE) [4] represents the unique solution
                                permeated every aspect of society. Such a revolution                           to structure and organize PA information. OIE systems
                                has also affected Public Administrations (PAs), radically                      usually adopt a domain-agnostic method and can extract
                                transforming how these entities operate and interact with                      entities and relationship triples (the main components of
                                citizens. Digital technologies have enabled PAs to stream-                     knowledge graphs) from any sentence written in natural
                                line processes, improve service access, and increase trans-                    language.
                                parency. However, along with these opportunities, signif-                         The second challenge is that a predominant part of the
                                icant challenges also arise in terms of data management                        research conducted on OIE is mainly oriented toward the
                                and internal organization. Public administrations han-                         English language. While advancements in OIE have been
                                dle vast amounts of sensitive and often disparate data                         notable, they often must encompass the complexities
                                from various sources. Lack of data standardization, in-                        inherent in non-English languages. This linguistic bias
                                formation security, and citizen privacy are crucial issues                     significantly hinders the widespread applicability and
                                to be addressed. In addition, data fragmentation among                         effectiveness of OIE systems in multilingual contexts.
                                different systems and departments can inhibit effective in-                       In this paper, we aim to advance the research on Open
                                formation sharing and analysis. For the aforementioned                         Information Extraction applied to PA by testing and ex-
                                reasons, PAs would benefit from technology solutions                           ploiting the potential offered by Large Language Models
                                based on Machine Learning and, in particular, Natural                          (LLMs). In particular, a proper LLM is instructed with an
                                Language Processing (NLP) to improve the organization                          effective strategy, employing proper Italian PA data.
                                of such fragmented information.                                                   The rest of the paper is structured as follows: Section 2
                                   However, there are two major challenges. The first is                       gives an overview of the state-of-the-art; our method-
                                the lack of appropriate resources to adequately organize                       ology is detailed in Section 3, whereas the experiments
                                PA documents. Indeed, it is crucial to organize, access,                       are described in Section 4. Section 5 reports and dis-
                                understand, and utilize information with proper struc-                         cusses the results, and Section 6 ends the paper with the
                                                                                                               conclusions.
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                *
                                  Corresponding authors.                                                        2. Related Works
                                $ salvatore@unica.it (S. Carta); alessandro.giuliani@unica.it
                                (A. Giuliani); marcom.manca@unica.it (M. M. Manca);                                                    The advent of Open Information Extraction (OIE) en-
                                leonardo.piano@unica.it (L. Piano); alessia.pisu96@unica.it
                                (A. Pisu); sandrog.tiddia@unica.it (S. G. Tiddia)
                                                                                                                                       abled the transcendation of domain-specific constraints
                                          Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License inherent in conventional IE methodologies. OIE meth-
                                          Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
ods aim to identify linguistic extraction patterns, either      process is depicted in Figure 1.
hand-crafted or automatically learned from the data [5].           Our method involves two stages. In detail, the process
Therefore, they are subdivided into rule-based or neural        first performs a step aimed at obtaining a correctly an-
methods. The former include ClausIE [6], an OIE frame-          notated set of triplets (Triplet Auto-Labeling), which is
work based on dependency parsing to detect clauses in an        subsequently used to train the LLM (Instruction Tuning).
input sentence and subsequently extract proposition. RE-        Each step is described in the following.
VERB [7] extract the tuples by isolating relation phrases
that satisfy syntactic and lexical constraints. Similarly,      3.1. Triplet Auto-Labeling
TEXTRUNNER [8] first identifies a pair of noun phrases
that are not too far apart, and then it applies a classifier    The first step of our methodology is training a Sequence
to determine whether or not to extract a relationship.          Classifier Language Model to identify meaningful triplets
Further works rely on a proper strategy for combining           within the PA context. To accomplish this, we leveraged
different OIE tools for triplet generation and filtering [9].   the dataset OIE4PA, consisting of a collection of triplets
A pioneering proposal regarding the more recent Neural          extracted from Italian tenders of the Apulia region [15].
methods is the work of Stanovsky et al. [10], wherein           In particular, each triplet is extracted with the WikiOIE
OIE is treated as a sequence labeling problem, and an           framework [16]. Specifically, the dataset is organized
LSTM-transducer automatically extracts triplets. Zhan           into two sets: a labeled set β„’, which contains a subset of
and Zhao [11] introduced a span model for n-ary Open            2000 binary triplets labeled by humans as valid or not,
Information Extraction. More recently, Kolluru et al. [12]      and an unlabeled set 𝒰 of 14,096 triplets, together with
introduced IMOJIE a neural Open Information Extrac-             the original sentences. Then, at this stage, we exploited
tion system that follows an iterative approach where            the β„’ set to properly train a classifier to distinguish be-
the triplet extraction is conditioned by the previously         tween valid and invalid triplets. To do this, we treated
retrieved triplets, with the aim of reducing redundancy.        this task as a sentence classification problem, concatenat-
   The methods above have been developed or tested              ing triplets into a single sentence and separating subject,
specifically for English textual corpus. Regarding the          predicate, and object by a semicolon. To this end, we iden-
Italian language, no significant research has been con-         tified three suitable Language Models (LMs) for this task,
ducted on Italian Open IE until the last decade. To date,       namely Italian-bert, LegalBert [17], and BureauBERTo
only a few works have addressed such a challenge. Dami-         [18]. The former is a Bert base model [19] fine-tuned
ano et al. proposed ItalIE [13], a clause-based OIE sys-        on Italian corpus, the second is a fine-tuned version of
tem inspired by ClausIE aimed at extracting n-ary co-           Italian Bert on Italian civil law corpora, and the last is
herent propositions from simple sentences. Sentences            an UmBERTO model fine-tuned on PA, banking, and in-
are analyzed to identify and categorize clauses based on        surances corpus. Table 1 outlines the results obtained by
seven predefined patterns specific to the Italian language.     these three Language Models on the triplet classification
Guarasci et al. [14] presented an OIE method for Italian        task. Finally, the trained most accurate classifier has been
single-verb sentences based on Lexicon-Grammar tables.          employed to label the triplets of the U set, forming a new
The system employs linguistic structures and patterns of        π’œβ„’ (Auto-Labeled) set, which in turn will be exploited
verbal behavior to identify arguments, match patterns,          to instruct the Large Language Model for the OIE task.
and generate propositions, demonstrating effectiveness
in generating syntactically and semantically valid propo-       3.2. Instruction Tuning
sitions for the Italian language. Finally, [15] proposed
OIE4PA, an Open IE framework that can identify facts            Instruction tuning is an innovative strategy that involves
from Public Administration documents. Leveraging the            guiding a language model through human-like instruc-
proposal of Siciliani et al. [15], in this work, we proposed    tions to improve its performance on a specific task. Un-
an Instructed Large Language model for Italian Open             like traditional methods that rely solely on large-scale
Information Extraction specialized in Public Administra-        training data, instruction tuning provides targeted guid-
tion Documents.                                                 ance, allowing the model to adapt and refine its behaviour
                                                                toward desired outcomes. Incorporating human-like
                                                                instructions enhances the model’s understanding and
3. Methodology                                                  improves its ability to generate contextually relevant
                                                                responses. In summary, given a source text and task-
We propose a novel model for automated Information              specific instructions, the model is trained to create a
Extraction for Italian PAs by instructing an LLM on a           sequence of tokens representing the desired output.
set of automatically labeled triplets of public tenders. To        To instruct an LLM to perform Open Information Ex-
this end, we devise a proper strategy to train an LLM           traction, we transformed the π’œβ„’ triplets set into an in-
with a suitable set of triplets and instructions. The entire    struction dataset- In particular, each auto-labeled triplet
Figure 1: Instructed model training.



is used to train the Instruction model following the tem-    4. Experimental settings
plate: Task Instruction, Input Text, and Response.
                                                              We adopted the Flan-T5 family [20] as an instruction
3.2.1. Task instruction                                       model. Such a choice is motivated by two reasons: first,
                                                              prior research [21] has demonstrated the potential of
Task instructions provide a detailed statement on accom- such models in Information Extraction tasks, eventually
plishing the desired task and properly structuring the outperforming larger models such as LLama2 or similar,
output. In detail, we formulated the following instruction resulting in a perfect trade-off between speed of inference
to query LLM:                                                 and prediction quality. The other main benefit is that
                                                              Flan-T5 is a multi-language model, which is also suit-
  .
                                                              (3b) adopting for both the OIE4PA dataset, relying on a
                                                              split of 80% and 20% for training and test, respectively.
We formulate the instruction in Italian to make the model
                                                              We fine-tuned the models for efficiency and hardware
immediately understand that we are referring to the Ital-
                                                              reasons by exploiting QLora1 with a 4-bit quantization,
ian language. The translation in English of the instruction
                                                              allowing faster training and saving GPU memory. All
is:"Find which semantic triples exist in the text, Format the
                                                              experiments were conducted with an Nvidia RTX A6000
output as [Subject; Predicate; Object]".
                                                              GPU machine with 48 GB of VRAM. We train both mod-
                                                              els for one epoch, and we adopt the following QLora
3.2.2. Input text                                             settings and hyperparameters:
The input text represents the sentences in which LLM
has to perform the task defined by the instructions. In                           Lora-rank      16
                                                                                  Lora-alpha     32
detail, each sentence is the original text excerpt from
                                                                                  Lora-dropout   0.05
which a triplet belonging to the dataset OIE4PA has been
                                                                                  Learning rate  0.003
extracted.
                                                                                     Batch Size      8

3.2.3. Response
The response represents the desired output. In our case,     4.1. Evaluation Metrics
the input sentence was transformed into an open triplet.
We also specify that to instruct the model to distinguish    To properly apply such metrics for the triplets evalua-
sentences where a triplet can be extracted from sentences    tion, we considered as true positive (TP) a non-empty
where no useful triplets exist, we included the triplet as   triplet that matches with the corresponding triplet in the
a response if it was labeled as valid by the classifier;     ground truth (i.e., the triples belonging to the π’œβ„’ set),
otherwise, we leave an empty string.                         true negative (TN) a triple returned as an empty string
                                                             by the model and labeled as invalid in the ground truth,
                                                             false positive (FP) a triplet that was labeled as invalid
                                                             but retrieved by the model, and false negative (FN) when

                                                             1
                                                                 https://github.com/artidoro/qlora
the model returned an empty string rather than a valid Table 2
triplet.                                                     FLAN-OpenIE results on OIE4PA dataset in terms of accuracy
   In doing so, we can evaluate the performances in terms (π‘Ž), precision (π‘Ÿ), recall (π‘Ÿ), and F1 score (𝐹 1).
of classical confusion matrix metrics, i.e., accuracy (a),
                                                                       Model        π‘Ž        𝑝       π‘Ÿ        𝐹1
precision (p), recall (r), and F1 score (F1); whose formulae
are:                                                                   T5-xl      0.78      0.74    0.97    0.84
                                                                           T5-xxl   0.82    0.78    0.99    0.87

                             𝑇𝑃 + 𝑇𝑁
                  π‘Ž=
                        𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁                            6. Conclusions
                          𝑇𝑃
                  𝑝=                                            Considering the significant gap between information
                        𝑇𝑃 + 𝐹𝑃                                 extraction available for English and other resource-
                        𝑇𝑃                                      constrained languages such as Italian, we explored an
              π‘Ÿ=                                                Instruction Tuning approach to perform Open Informa-
                   𝑇𝑃 + 𝐹𝑁
                                                                tion Extraction on Italian Public Tenders in this paper. A
                   2*𝑝*π‘Ÿ                                        proper LLM is instructed with an effective two-stage strat-
            𝐹1 =                                                egy, in which a language model-based classifier is trained
                     𝑝+π‘Ÿ
                                                                on a proper Italian PA dataset to obtain a set of correct
                                                                triplets, which are used to instruct a suitable LLM. The
5. Results                                                      promising experiments have validated the assumptions
Table 1 reports the comparisons of three different Italian pointed out in the paper and incentivized future devel-
Bert models for the triplet classification task. In detail, the opments aimed at developing new datasets and models
selcted models are LegalBERT-ITA2 , BertBase-ITA3 , capable of theoretically understanding and structuring
and BureauBERTo4 . The best model turns out to be Bu- technical texts in Italian in the form of semantics triplets.
reauBerto, probably due to the fact that it is the only
model pre-trained on Public Administration corpora.             Acknowledgments
Table 1                                                           This work has been partially carried out thanks to the
Bert triplet classification results in terms of accuracy (π‘Ž), pre-Ministerial Decree no. 351 of 9th April 2022, based on the
cision (π‘Ÿ), recall (π‘Ÿ), and F1 score (𝐹 1).                       NRRP – funded by the European Union - NextGenera-
                                                                  tionEU - Mission 4 β€œEducation and Research”, Component
       Model             π‘Ž         𝑝          π‘Ÿ        𝐹1
                                                                  1 β€œEnhancement of the offer of educational services: from
  LegalBERT-ITA        0.935     0.953      0.897     0.919       nurseries to universities” - Investment 4.1, that provided
    BertBase-ITA       0.927     0.935      0.894     0.911       a financial support for the Leonardo Piano’s doctoral
   BureauBERTo         0.945 0.963 0.901 0.932                    pathway.
                                                                     Also, Alessia Pisu acknowledge MUR and EU-FSE for
Table 2 outlines the result of the two fine-tuned Flan- financial support of the PON Research and Innovation
T5 models on extracting triplets in procurement texts. 2014-2020 (D.M. 1061/2021).
Both model sizes show excellent results for all metrics; in          Furthermore, we acknowledge financial support un-
particular, recall is significantly high, demonstrating that der the National Recovery and Resilience Plan (NRRP),
the models are quite effective in finding a large number Mission 4 Component 2 Investment 1.5 - Call for tender
of true positives (e.g., valid triplets). It is also good to note No.3277 published on December 30, 2021 by the Italian
that the values are higher for the model with a higher Ministry of University and Research (MUR) funded by
number of parameters. Therefore, the promising results the European Union – NextGenerationEU. Project Code
support the thesis of leveraging Instruction Tuning to ECS0000038 – Project Title eINS Ecosystem of Innovation
build strong Open Information Extraction models for for Next Generation Sardinia – CUP F53C22000430001-
Italian public administrations. To this end, we plan to Grant Assignment Decree No. 1056 adopted on June 23,
create new datasets in the future to develop a new set of 2022 by the Italian Ministry of University and Research
foundational models for information extraction in Italian, (MUR).
with a particular focus on PAs and other administrative
entities.
2
    https://huggingface.co/dlicari/Italian-Legal-BERT
3
    https://huggingface.co/dbmdz/bert-base-italian-uncased
4
    https://huggingface.co/colinglab/BureauBERTo
References                                                       traction, in: G. A. Tsihrintzis, C. Toro, S. A. RΓ­os,
                                                                 R. J. Howlett, L. C. Jain (Eds.), Knowledge-Based
[1] C. Rudnik, T. Ehrhart, O. Ferret, D. Teyssou,                and Intelligent Information & Engineering Systems:
    R. Troncy, X. Tannier, Searching news articles using         Proceedings of the 27th International Conference
    an event knowledge graph leveraged by wikidata,              KES-2023, Athens, Greece, 6-8 September 2023, vol-
    in: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben,           ume 225 of Procedia Computer Science, Elsevier,
    K. Lerman, J. J. McAuley, R. Baeza-Yates, L. Zia             2023, pp. 2224–2233. URL: https://doi.org/10.1016/
    (Eds.), Companion of The 2019 World Wide Web                 j.procs.2023.10.213. doi:10.1016/J.PROCS.2023.
    Conference, WWW 2019, San Francisco, CA, USA,                10.213.
    May 13-17, 2019, ACM, 2019, pp. 1232–1239.              [10] G. Stanovsky, J. Michael, L. Zettlemoyer, I. Dagan,
[2] P. Ernst, C. Meng, A. Siu, G. Weikum, Knowlife: A            Supervised open information extraction, in: North
    knowledge graph for health and life sciences, in:            American Chapter of the Association for Computa-
    2014 IEEE 30th International Conference on Data              tional Linguistics, 2018.
    Engineering, 2014, pp. 1254–1257. doi:10.1109/          [11] J. Zhan, H. Zhao, Span model for open information
    ICDE.2014.6816754.                                           extraction on accurate corpus, in: AAAI Confer-
[3] S. Carta, G. Fenu, A. Giuliani, M. M. Manca,                 ence on Artificial Intelligence, 2019. URL: https:
    M. Marras, L. Piano, A. S. Podda, L. Pompianu,               //api.semanticscholar.org/CorpusID:208138002.
    S. G. Tiddia, Empowering digital transforma-            [12] K. Kolluru, S. Aggarwal, V. Rathore, Mausam,
    tion in tourism through intelligent methods                  S. Chakrabarti,          Imojie: Iterative memory-
    for representation and exploitation of cultural              based joint open information extraction,
    heritage knowledge, volume 3536, 2023, p. 83 – 91.           ArXiv abs/2005.08178 (2020). URL: https:
    URL: https://www.scopus.com/inward/record.uri?               //api.semanticscholar.org/CorpusID:218674382.
    eid=2-s2.0-85177612618&partnerID=40&md5=                [13] E. Damiano, A. Minutolo, M. Esposito, Open infor-
    7e8334f126d9385a733fbfb0d1674f19.                            mation extraction for italian sentences, in: 2018
[4] M. Banko, M. J. Cafarella, S. Soderland, M. Broad-           32nd International Conference on Advanced Infor-
    head, O. Etzioni, Open information extraction from           mation Networking and Applications Workshops
    the web, in: Proceedings of the 20th International           (WAINA), 2018, pp. 668–673. doi:10.1109/WAINA.
    Joint Conference on Artifical Intelligence, IJCAI’07,        2018.00165.
    Morgan Kaufmann Publishers Inc., San Francisco,         [14] R. Guarasci, E. Damiano, A. Minutolo, M. Esposito,
    CA, USA, 2007, pp. 2670–2676.                                G. De Pietro, Lexicon-grammar based open infor-
[5] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A            mation extraction from natural language sentences
    survey on open information extraction, in: E. M.             in italian, Expert Systems with Applications 143
    Bender, L. Derczynski, P. Isabelle (Eds.), Proceed-          (2020) 112954. URL: https://www.sciencedirect.com/
    ings of the 27th International Conference on Com-            science/article/pii/S0957417419306724. doi:https:
    putational Linguistics, Association for Computa-             //doi.org/10.1016/j.eswa.2019.112954.
    tional Linguistics, Santa Fe, New Mexico, USA,          [15] L. Siciliani, E. Ghizzota, P. Basile, P. Lops, Oie4pa:
    2018, pp. 3866–3878. URL: https://aclanthology.org/          open information extraction for the public adminis-
    C18-1326.                                                    tration, Journal of Intelligent Information Systems
[6] L. Del Corro, R. Gemulla, Clausie: clause-based              (2023) 1–22.
    open information extraction, in: Proceedings of the     [16] L. Siciliani, P. Cassotti, P. Basile, M. de Gemmis,
    22nd international conference on World Wide Web,             P. Lops, G. Semeraro, A. Moro, Extracting relations
    2013, pp. 355–366.                                           from italian wikipedia using self-training (2021).
[7] A. Fader, S. Soderland, O. Etzioni, Identifying re-     [17] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT:
    lations for open information extraction, in: Con-            A Pre-trained Transformer Language Model for
    ference on Empirical Methods in Natural Language             Italian Law, in: D. Symeonidou, R. Yu, D. Ceolin,
    Processing, 2011.                                            M. Poveda-VillalΓ³n, D. Audrito, L. D. Caro, F. Grasso,
[8] A. Yates, M. Banko, M. Broadhead, M. J. Cafarella,           R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, N. Troquard
    O. Etzioni, S. Soderland, Textrunner: Open infor-            (Eds.), Companion Proceedings of the 23rd Interna-
    mation extraction on the web, in: North American             tional Conference on Knowledge Engineering and
    Chapter of the Association for Computational Lin-            Knowledge Management, volume 3256 of CEUR
    guistics, 2007. URL: https://api.semanticscholar.org/        Workshop Proceedings, CEUR, Bozen-Bolzano, Italy,
    CorpusID:1455080.                                            2022. URL: https://ceur-ws.org/Vol-3256/#km4law3,
[9] S. Carta, P. Fariello, A. Giuliani, L. Piano, A. S.          iSSN: 1613-0073.
    Podda, S. G. Tiddia, Sailgenie: Sailing expertise       [18] S. Auriemma, M. Madeddu, M. Miliani, A. Bondielli,
    to knowledge graph through open information ex-              L. C. Passaro, A. Lenci, Bureauberto: adapting
     umberto to the italian bureaucratic language, in:
     Ital-IA, 2023. URL: https://api.semanticscholar.org/
     CorpusID:262088765.
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     Pre-training of deep bidirectional transformers for
     language understanding, in: North American
     Chapter of the Association for Computational Lin-
     guistics, 2019. URL: https://api.semanticscholar.org/
     CorpusID:52967399.
[20] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,
     W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,
     et al., Scaling instruction-finetuned language mod-
     els, arXiv preprint arXiv:2210.11416 (2022).
[21] S. Wadhwa, S. Amir, B. C. Wallace, Revisiting re-
     lation extraction in the era of large language mod-
     els, Proceedings of the conference. Association
     for Computational Linguistics. Meeting 2023 (2023)
     15566–15589. URL: https://api.semanticscholar.org/
     CorpusID:258564662.