Instruct Large Language Models for Public Administration Document Information Extraction Salvatore Carta, Alessandro Giuliani* , Marco Manolo Manca, Leonardo Piano* , Alessia Pisu and Sandro Gabriele Tiddia Department of Mathematics and Computer Science, University of Cagliari, via Ospedale 72, Cagliari, 09124, Italy Abstract With the rapid digitization of institutions, there is an ever-increasing problem of effectively organizing and accessing information. Public Administrations (PAs) manage large volumes of disparate data from a variety of sources. Thus, these organizations would greatly benefit from AI, particularly Natural Language Processing solutions that help organize, structure, and search for information effectively. In the context of Italian PA, which we address in this paper, there are two main challenges: the lack of ontologies and the limited tools available for Italian information extraction. In this paper, we attempt to advance Information Extraction for Italian PAs by instructing a Large Language Model on a set of automatically labeled triplets of public tenders. Keywords Large Language Models, Public Administration, Tenders, Italian Open Information Extraction 1. Introduction tures, such as knowledge graphs or ontologies, which represent a powerful solution in many domains, e.g., in The pervasive impact of Information and Communica- online news platforms [1], health and life sciences [2], tion Technologies (ICT) on our society over the past two or cultural heritage [3]. In this context, Open Informa- decades is undeniable. This technological revolution has tion Extraction (OIE) [4] represents the unique solution permeated every aspect of society. Such a revolution to structure and organize PA information. OIE systems has also affected Public Administrations (PAs), radically usually adopt a domain-agnostic method and can extract transforming how these entities operate and interact with entities and relationship triples (the main components of citizens. Digital technologies have enabled PAs to stream- knowledge graphs) from any sentence written in natural line processes, improve service access, and increase trans- language. parency. However, along with these opportunities, signif- The second challenge is that a predominant part of the icant challenges also arise in terms of data management research conducted on OIE is mainly oriented toward the and internal organization. Public administrations han- English language. While advancements in OIE have been dle vast amounts of sensitive and often disparate data notable, they often must encompass the complexities from various sources. Lack of data standardization, in- inherent in non-English languages. This linguistic bias formation security, and citizen privacy are crucial issues significantly hinders the widespread applicability and to be addressed. In addition, data fragmentation among effectiveness of OIE systems in multilingual contexts. different systems and departments can inhibit effective in- In this paper, we aim to advance the research on Open formation sharing and analysis. For the aforementioned Information Extraction applied to PA by testing and ex- reasons, PAs would benefit from technology solutions ploiting the potential offered by Large Language Models based on Machine Learning and, in particular, Natural (LLMs). In particular, a proper LLM is instructed with an Language Processing (NLP) to improve the organization effective strategy, employing proper Italian PA data. of such fragmented information. The rest of the paper is structured as follows: Section 2 However, there are two major challenges. The first is gives an overview of the state-of-the-art; our method- the lack of appropriate resources to adequately organize ology is detailed in Section 3, whereas the experiments PA documents. Indeed, it is crucial to organize, access, are described in Section 4. Section 5 reports and dis- understand, and utilize information with proper struc- cusses the results, and Section 6 ends the paper with the conclusions. Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- nized by CINI, May 29-30, 2024, Naples, Italy * Corresponding authors. 2. Related Works $ salvatore@unica.it (S. Carta); alessandro.giuliani@unica.it (A. Giuliani); marcom.manca@unica.it (M. M. Manca); The advent of Open Information Extraction (OIE) en- leonardo.piano@unica.it (L. Piano); alessia.pisu96@unica.it (A. Pisu); sandrog.tiddia@unica.it (S. G. Tiddia) abled the transcendation of domain-specific constraints Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License inherent in conventional IE methodologies. OIE meth- Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings ods aim to identify linguistic extraction patterns, either process is depicted in Figure 1. hand-crafted or automatically learned from the data [5]. Our method involves two stages. In detail, the process Therefore, they are subdivided into rule-based or neural first performs a step aimed at obtaining a correctly an- methods. The former include ClausIE [6], an OIE frame- notated set of triplets (Triplet Auto-Labeling), which is work based on dependency parsing to detect clauses in an subsequently used to train the LLM (Instruction Tuning). input sentence and subsequently extract proposition. RE- Each step is described in the following. VERB [7] extract the tuples by isolating relation phrases that satisfy syntactic and lexical constraints. Similarly, 3.1. Triplet Auto-Labeling TEXTRUNNER [8] first identifies a pair of noun phrases that are not too far apart, and then it applies a classifier The first step of our methodology is training a Sequence to determine whether or not to extract a relationship. Classifier Language Model to identify meaningful triplets Further works rely on a proper strategy for combining within the PA context. To accomplish this, we leveraged different OIE tools for triplet generation and filtering [9]. the dataset OIE4PA, consisting of a collection of triplets A pioneering proposal regarding the more recent Neural extracted from Italian tenders of the Apulia region [15]. methods is the work of Stanovsky et al. [10], wherein In particular, each triplet is extracted with the WikiOIE OIE is treated as a sequence labeling problem, and an framework [16]. Specifically, the dataset is organized LSTM-transducer automatically extracts triplets. Zhan into two sets: a labeled set β„’, which contains a subset of and Zhao [11] introduced a span model for n-ary Open 2000 binary triplets labeled by humans as valid or not, Information Extraction. More recently, Kolluru et al. [12] and an unlabeled set 𝒰 of 14,096 triplets, together with introduced IMOJIE a neural Open Information Extrac- the original sentences. Then, at this stage, we exploited tion system that follows an iterative approach where the β„’ set to properly train a classifier to distinguish be- the triplet extraction is conditioned by the previously tween valid and invalid triplets. To do this, we treated retrieved triplets, with the aim of reducing redundancy. this task as a sentence classification problem, concatenat- The methods above have been developed or tested ing triplets into a single sentence and separating subject, specifically for English textual corpus. Regarding the predicate, and object by a semicolon. To this end, we iden- Italian language, no significant research has been con- tified three suitable Language Models (LMs) for this task, ducted on Italian Open IE until the last decade. To date, namely Italian-bert, LegalBert [17], and BureauBERTo only a few works have addressed such a challenge. Dami- [18]. The former is a Bert base model [19] fine-tuned ano et al. proposed ItalIE [13], a clause-based OIE sys- on Italian corpus, the second is a fine-tuned version of tem inspired by ClausIE aimed at extracting n-ary co- Italian Bert on Italian civil law corpora, and the last is herent propositions from simple sentences. Sentences an UmBERTO model fine-tuned on PA, banking, and in- are analyzed to identify and categorize clauses based on surances corpus. Table 1 outlines the results obtained by seven predefined patterns specific to the Italian language. these three Language Models on the triplet classification Guarasci et al. [14] presented an OIE method for Italian task. Finally, the trained most accurate classifier has been single-verb sentences based on Lexicon-Grammar tables. employed to label the triplets of the U set, forming a new The system employs linguistic structures and patterns of π’œβ„’ (Auto-Labeled) set, which in turn will be exploited verbal behavior to identify arguments, match patterns, to instruct the Large Language Model for the OIE task. and generate propositions, demonstrating effectiveness in generating syntactically and semantically valid propo- 3.2. Instruction Tuning sitions for the Italian language. Finally, [15] proposed OIE4PA, an Open IE framework that can identify facts Instruction tuning is an innovative strategy that involves from Public Administration documents. Leveraging the guiding a language model through human-like instruc- proposal of Siciliani et al. [15], in this work, we proposed tions to improve its performance on a specific task. Un- an Instructed Large Language model for Italian Open like traditional methods that rely solely on large-scale Information Extraction specialized in Public Administra- training data, instruction tuning provides targeted guid- tion Documents. ance, allowing the model to adapt and refine its behaviour toward desired outcomes. Incorporating human-like instructions enhances the model’s understanding and 3. Methodology improves its ability to generate contextually relevant responses. In summary, given a source text and task- We propose a novel model for automated Information specific instructions, the model is trained to create a Extraction for Italian PAs by instructing an LLM on a sequence of tokens representing the desired output. set of automatically labeled triplets of public tenders. To To instruct an LLM to perform Open Information Ex- this end, we devise a proper strategy to train an LLM traction, we transformed the π’œβ„’ triplets set into an in- with a suitable set of triplets and instructions. The entire struction dataset- In particular, each auto-labeled triplet Figure 1: Instructed model training. is used to train the Instruction model following the tem- 4. Experimental settings plate: Task Instruction, Input Text, and Response. We adopted the Flan-T5 family [20] as an instruction 3.2.1. Task instruction model. Such a choice is motivated by two reasons: first, prior research [21] has demonstrated the potential of Task instructions provide a detailed statement on accom- such models in Information Extraction tasks, eventually plishing the desired task and properly structuring the outperforming larger models such as LLama2 or similar, output. In detail, we formulated the following instruction resulting in a perfect trade-off between speed of inference to query LLM: and prediction quality. The other main benefit is that Flan-T5 is a multi-language model, which is also suit- . (3b) adopting for both the OIE4PA dataset, relying on a split of 80% and 20% for training and test, respectively. We formulate the instruction in Italian to make the model We fine-tuned the models for efficiency and hardware immediately understand that we are referring to the Ital- reasons by exploiting QLora1 with a 4-bit quantization, ian language. The translation in English of the instruction allowing faster training and saving GPU memory. All is:"Find which semantic triples exist in the text, Format the experiments were conducted with an Nvidia RTX A6000 output as [Subject; Predicate; Object]". GPU machine with 48 GB of VRAM. We train both mod- els for one epoch, and we adopt the following QLora 3.2.2. Input text settings and hyperparameters: The input text represents the sentences in which LLM has to perform the task defined by the instructions. In Lora-rank 16 Lora-alpha 32 detail, each sentence is the original text excerpt from Lora-dropout 0.05 which a triplet belonging to the dataset OIE4PA has been Learning rate 0.003 extracted. Batch Size 8 3.2.3. Response The response represents the desired output. In our case, 4.1. Evaluation Metrics the input sentence was transformed into an open triplet. We also specify that to instruct the model to distinguish To properly apply such metrics for the triplets evalua- sentences where a triplet can be extracted from sentences tion, we considered as true positive (TP) a non-empty where no useful triplets exist, we included the triplet as triplet that matches with the corresponding triplet in the a response if it was labeled as valid by the classifier; ground truth (i.e., the triples belonging to the π’œβ„’ set), otherwise, we leave an empty string. true negative (TN) a triple returned as an empty string by the model and labeled as invalid in the ground truth, false positive (FP) a triplet that was labeled as invalid but retrieved by the model, and false negative (FN) when 1 https://github.com/artidoro/qlora the model returned an empty string rather than a valid Table 2 triplet. FLAN-OpenIE results on OIE4PA dataset in terms of accuracy In doing so, we can evaluate the performances in terms (π‘Ž), precision (π‘Ÿ), recall (π‘Ÿ), and F1 score (𝐹 1). of classical confusion matrix metrics, i.e., accuracy (a), Model π‘Ž 𝑝 π‘Ÿ 𝐹1 precision (p), recall (r), and F1 score (F1); whose formulae are: T5-xl 0.78 0.74 0.97 0.84 T5-xxl 0.82 0.78 0.99 0.87 𝑇𝑃 + 𝑇𝑁 π‘Ž= 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 6. Conclusions 𝑇𝑃 𝑝= Considering the significant gap between information 𝑇𝑃 + 𝐹𝑃 extraction available for English and other resource- 𝑇𝑃 constrained languages such as Italian, we explored an π‘Ÿ= Instruction Tuning approach to perform Open Informa- 𝑇𝑃 + 𝐹𝑁 tion Extraction on Italian Public Tenders in this paper. A 2*𝑝*π‘Ÿ proper LLM is instructed with an effective two-stage strat- 𝐹1 = egy, in which a language model-based classifier is trained 𝑝+π‘Ÿ on a proper Italian PA dataset to obtain a set of correct triplets, which are used to instruct a suitable LLM. The 5. Results promising experiments have validated the assumptions Table 1 reports the comparisons of three different Italian pointed out in the paper and incentivized future devel- Bert models for the triplet classification task. In detail, the opments aimed at developing new datasets and models selcted models are LegalBERT-ITA2 , BertBase-ITA3 , capable of theoretically understanding and structuring and BureauBERTo4 . The best model turns out to be Bu- technical texts in Italian in the form of semantics triplets. reauBerto, probably due to the fact that it is the only model pre-trained on Public Administration corpora. Acknowledgments Table 1 This work has been partially carried out thanks to the Bert triplet classification results in terms of accuracy (π‘Ž), pre-Ministerial Decree no. 351 of 9th April 2022, based on the cision (π‘Ÿ), recall (π‘Ÿ), and F1 score (𝐹 1). NRRP – funded by the European Union - NextGenera- tionEU - Mission 4 β€œEducation and Research”, Component Model π‘Ž 𝑝 π‘Ÿ 𝐹1 1 β€œEnhancement of the offer of educational services: from LegalBERT-ITA 0.935 0.953 0.897 0.919 nurseries to universities” - Investment 4.1, that provided BertBase-ITA 0.927 0.935 0.894 0.911 a financial support for the Leonardo Piano’s doctoral BureauBERTo 0.945 0.963 0.901 0.932 pathway. Also, Alessia Pisu acknowledge MUR and EU-FSE for Table 2 outlines the result of the two fine-tuned Flan- financial support of the PON Research and Innovation T5 models on extracting triplets in procurement texts. 2014-2020 (D.M. 1061/2021). Both model sizes show excellent results for all metrics; in Furthermore, we acknowledge financial support un- particular, recall is significantly high, demonstrating that der the National Recovery and Resilience Plan (NRRP), the models are quite effective in finding a large number Mission 4 Component 2 Investment 1.5 - Call for tender of true positives (e.g., valid triplets). It is also good to note No.3277 published on December 30, 2021 by the Italian that the values are higher for the model with a higher Ministry of University and Research (MUR) funded by number of parameters. Therefore, the promising results the European Union – NextGenerationEU. Project Code support the thesis of leveraging Instruction Tuning to ECS0000038 – Project Title eINS Ecosystem of Innovation build strong Open Information Extraction models for for Next Generation Sardinia – CUP F53C22000430001- Italian public administrations. To this end, we plan to Grant Assignment Decree No. 1056 adopted on June 23, create new datasets in the future to develop a new set of 2022 by the Italian Ministry of University and Research foundational models for information extraction in Italian, (MUR). with a particular focus on PAs and other administrative entities. 2 https://huggingface.co/dlicari/Italian-Legal-BERT 3 https://huggingface.co/dbmdz/bert-base-italian-uncased 4 https://huggingface.co/colinglab/BureauBERTo References traction, in: G. A. Tsihrintzis, C. Toro, S. A. RΓ­os, R. J. Howlett, L. C. Jain (Eds.), Knowledge-Based [1] C. Rudnik, T. Ehrhart, O. Ferret, D. Teyssou, and Intelligent Information & Engineering Systems: R. Troncy, X. Tannier, Searching news articles using Proceedings of the 27th International Conference an event knowledge graph leveraged by wikidata, KES-2023, Athens, Greece, 6-8 September 2023, vol- in: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, ume 225 of Procedia Computer Science, Elsevier, K. Lerman, J. J. McAuley, R. Baeza-Yates, L. Zia 2023, pp. 2224–2233. URL: https://doi.org/10.1016/ (Eds.), Companion of The 2019 World Wide Web j.procs.2023.10.213. doi:10.1016/J.PROCS.2023. Conference, WWW 2019, San Francisco, CA, USA, 10.213. May 13-17, 2019, ACM, 2019, pp. 1232–1239. [10] G. Stanovsky, J. Michael, L. Zettlemoyer, I. Dagan, [2] P. Ernst, C. Meng, A. Siu, G. Weikum, Knowlife: A Supervised open information extraction, in: North knowledge graph for health and life sciences, in: American Chapter of the Association for Computa- 2014 IEEE 30th International Conference on Data tional Linguistics, 2018. Engineering, 2014, pp. 1254–1257. doi:10.1109/ [11] J. Zhan, H. Zhao, Span model for open information ICDE.2014.6816754. extraction on accurate corpus, in: AAAI Confer- [3] S. Carta, G. Fenu, A. Giuliani, M. M. Manca, ence on Artificial Intelligence, 2019. URL: https: M. Marras, L. Piano, A. S. Podda, L. Pompianu, //api.semanticscholar.org/CorpusID:208138002. S. G. Tiddia, Empowering digital transforma- [12] K. Kolluru, S. Aggarwal, V. Rathore, Mausam, tion in tourism through intelligent methods S. Chakrabarti, Imojie: Iterative memory- for representation and exploitation of cultural based joint open information extraction, heritage knowledge, volume 3536, 2023, p. 83 – 91. ArXiv abs/2005.08178 (2020). URL: https: URL: https://www.scopus.com/inward/record.uri? //api.semanticscholar.org/CorpusID:218674382. eid=2-s2.0-85177612618&partnerID=40&md5= [13] E. Damiano, A. Minutolo, M. Esposito, Open infor- 7e8334f126d9385a733fbfb0d1674f19. mation extraction for italian sentences, in: 2018 [4] M. Banko, M. J. Cafarella, S. Soderland, M. Broad- 32nd International Conference on Advanced Infor- head, O. Etzioni, Open information extraction from mation Networking and Applications Workshops the web, in: Proceedings of the 20th International (WAINA), 2018, pp. 668–673. doi:10.1109/WAINA. Joint Conference on Artifical Intelligence, IJCAI’07, 2018.00165. Morgan Kaufmann Publishers Inc., San Francisco, [14] R. Guarasci, E. Damiano, A. Minutolo, M. Esposito, CA, USA, 2007, pp. 2670–2676. G. De Pietro, Lexicon-grammar based open infor- [5] C. Niklaus, M. Cetto, A. Freitas, S. Handschuh, A mation extraction from natural language sentences survey on open information extraction, in: E. M. in italian, Expert Systems with Applications 143 Bender, L. Derczynski, P. Isabelle (Eds.), Proceed- (2020) 112954. URL: https://www.sciencedirect.com/ ings of the 27th International Conference on Com- science/article/pii/S0957417419306724. doi:https: putational Linguistics, Association for Computa- //doi.org/10.1016/j.eswa.2019.112954. tional Linguistics, Santa Fe, New Mexico, USA, [15] L. Siciliani, E. Ghizzota, P. Basile, P. Lops, Oie4pa: 2018, pp. 3866–3878. URL: https://aclanthology.org/ open information extraction for the public adminis- C18-1326. tration, Journal of Intelligent Information Systems [6] L. Del Corro, R. Gemulla, Clausie: clause-based (2023) 1–22. open information extraction, in: Proceedings of the [16] L. Siciliani, P. Cassotti, P. Basile, M. de Gemmis, 22nd international conference on World Wide Web, P. Lops, G. Semeraro, A. Moro, Extracting relations 2013, pp. 355–366. from italian wikipedia using self-training (2021). [7] A. Fader, S. Soderland, O. Etzioni, Identifying re- [17] D. Licari, G. ComandΓ¨, ITALIAN-LEGAL-BERT: lations for open information extraction, in: Con- A Pre-trained Transformer Language Model for ference on Empirical Methods in Natural Language Italian Law, in: D. Symeonidou, R. Yu, D. Ceolin, Processing, 2011. M. Poveda-VillalΓ³n, D. Audrito, L. D. Caro, F. Grasso, [8] A. Yates, M. Banko, M. Broadhead, M. J. Cafarella, R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, N. Troquard O. Etzioni, S. Soderland, Textrunner: Open infor- (Eds.), Companion Proceedings of the 23rd Interna- mation extraction on the web, in: North American tional Conference on Knowledge Engineering and Chapter of the Association for Computational Lin- Knowledge Management, volume 3256 of CEUR guistics, 2007. URL: https://api.semanticscholar.org/ Workshop Proceedings, CEUR, Bozen-Bolzano, Italy, CorpusID:1455080. 2022. URL: https://ceur-ws.org/Vol-3256/#km4law3, [9] S. Carta, P. Fariello, A. Giuliani, L. Piano, A. S. iSSN: 1613-0073. Podda, S. G. Tiddia, Sailgenie: Sailing expertise [18] S. Auriemma, M. Madeddu, M. Miliani, A. Bondielli, to knowledge graph through open information ex- L. C. Passaro, A. Lenci, Bureauberto: adapting umberto to the italian bureaucratic language, in: Ital-IA, 2023. URL: https://api.semanticscholar.org/ CorpusID:262088765. [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: North American Chapter of the Association for Computational Lin- guistics, 2019. URL: https://api.semanticscholar.org/ CorpusID:52967399. [20] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language mod- els, arXiv preprint arXiv:2210.11416 (2022). [21] S. Wadhwa, S. Amir, B. C. Wallace, Revisiting re- lation extraction in the era of large language mod- els, Proceedings of the conference. Association for Computational Linguistics. Meeting 2023 (2023) 15566–15589. URL: https://api.semanticscholar.org/ CorpusID:258564662.