1. Introduction

S. Carta, P. Fariello, A. Giuliani, L. Piano, A. S.

1613-0073

10.1016/J.PROCS.2023

Instruct Large Language Models for Public Administration Document Information Extraction

Salvatore Carta

Alessandro Giuliani

Marco Manolo Manca

Leonardo Piano

Alessia Pisu

Sandro Gabriele Tiddia

0 0 Department of Mathematics and Computer Science, University of Cagliari , via Ospedale 72, Cagliari, 09124 , Italy

2019

3536 6 8

With the rapid digitization of institutions, there is an ever-increasing problem of efectively organizing and accessing information. Public Administrations (PAs) manage large volumes of disparate data from a variety of sources. Thus, these organizations would greatly benefit from AI, particularly Natural Language Processing solutions that help organize, structure, and search for information efectively. In the context of Italian PA, which we address in this paper, there are two main challenges: the lack of ontologies and the limited tools available for Italian information extraction. In this paper, we attempt to advance Information Extraction for Italian PAs by instructing a Large Language Model on a set of automatically labeled triplets of public tenders.

eol>Large Language Models Public Administration Tenders Italian Open Information Extraction

1. Introduction

tures, such as knowledge graphs or ontologies, which represent a powerful solution in many domains, e.g., in The pervasive impact of Information and Communica- online news platforms [1], health and life sciences [2], tion Technologies (ICT) on our society over the past two or cultural heritage [3]. In this context, Open Informadecades is undeniable. This technological revolution has tion Extraction (OIE) [4] represents the unique solution permeated every aspect of society. Such a revolution to structure and organize PA information. OIE systems has also afected Public Administrations (PAs), radically usually adopt a domain-agnostic method and can extract transforming how these entities operate and interact with entities and relationship triples (the main components of citizens. Digital technologies have enabled PAs to stream- knowledge graphs) from any sentence written in natural line processes, improve service access, and increase trans- language. parency. However, along with these opportunities, signif- The second challenge is that a predominant part of the icant challenges also arise in terms of data management research conducted on OIE is mainly oriented toward the and internal organization. Public administrations han- English language. While advancements in OIE have been dle vast amounts of sensitive and often disparate data notable, they often must encompass the complexities from various sources. Lack of data standardization, in- inherent in non-English languages. This linguistic bias formation security, and citizen privacy are crucial issues significantly hinders the widespread applicability and to be addressed. In addition, data fragmentation among efectiveness of OIE systems in multilingual contexts. diferent systems and departments can inhibit efective in- In this paper, we aim to advance the research on Open formation sharing and analysis. For the aforementioned Information Extraction applied to PA by testing and exreasons, PAs would benefit from technology solutions ploiting the potential ofered by Large Language Models based on Machine Learning and, in particular, Natural (LLMs). In particular, a proper LLM is instructed with an Language Processing (NLP) to improve the organization efective strategy, employing proper Italian PA data. of such fragmented information. The rest of the paper is structured as follows: Section 2

However, there are two major challenges. The first is gives an overview of the state-of-the-art; our methodthe lack of appropriate resources to adequately organize ology is detailed in Section 3, whereas the experiments PA documents. Indeed, it is crucial to organize, access, are described in Section 4. Section 5 reports and disunderstand, and utilize information with proper struc- cusses the results, and Section 6 ends the paper with the conclusions. ods aim to identify linguistic extraction patterns, either process is depicted in Figure 1. hand-crafted or automatically learned from the data [5]. Our method involves two stages. In detail, the process Therefore, they are subdivided into rule-based or neural ifrst performs a step aimed at obtaining a correctly anmethods. The former include ClausIE [6], an OIE frame- notated set of triplets (Triplet Auto-Labeling), which is work based on dependency parsing to detect clauses in an subsequently used to train the LLM (Instruction Tuning). input sentence and subsequently extract proposition. RE- Each step is described in the following. VERB [7] extract the tuples by isolating relation phrases that satisfy syntactic and lexical constraints. Similarly, 3.1. Triplet Auto-Labeling TEXTRUNNER [8] first identifies a pair of noun phrases that are not too far apart, and then it applies a classifier The first step of our methodology is training a Sequence to determine whether or not to extract a relationship. Classifier Language Model to identify meaningful triplets Further works rely on a proper strategy for combining within the PA context. To accomplish this, we leveraged diferent OIE tools for triplet generation and filtering [ 9]. the dataset OIE4PA, consisting of a collection of triplets A pioneering proposal regarding the more recent Neural extracted from Italian tenders of the Apulia region [15]. methods is the work of Stanovsky et al. [10], wherein In particular, each triplet is extracted with the WikiOIE OIE is treated as a sequence labeling problem, and an framework [16]. Specifically, the dataset is organized LSTM-transducer automatically extracts triplets. Zhan into two sets: a labeled set ℒ, which contains a subset of and Zhao [11] introduced a span model for n-ary Open 2000 binary triplets labeled by humans as valid or not, Information Extraction. More recently, Kolluru et al. [12] and an unlabeled set of 14,096 triplets, together with introduced IMOJIE a neural Open Information Extrac- the original sentences. Then, at this stage, we exploited tion system that follows an iterative approach where the ℒ set to properly train a classifier to distinguish bethe triplet extraction is conditioned by the previously tween valid and invalid triplets. To do this, we treated retrieved triplets, with the aim of reducing redundancy. this task as a sentence classification problem, concatenat

The methods above have been developed or tested ing triplets into a single sentence and separating subject, specifically for English textual corpus. Regarding the predicate, and object by a semicolon. To this end, we idenItalian language, no significant research has been con- tified three suitable Language Models (LMs) for this task, ducted on Italian Open IE until the last decade. To date, namely Italian-bert, LegalBert [17], and BureauBERTo only a few works have addressed such a challenge. Dami- [18]. The former is a Bert base model [19] fine-tuned ano et al. proposed ItalIE [13], a clause-based OIE sys- on Italian corpus, the second is a fine-tuned version of tem inspired by ClausIE aimed at extracting n-ary co- Italian Bert on Italian civil law corpora, and the last is herent propositions from simple sentences. Sentences an UmBERTO model fine-tuned on PA, banking, and inare analyzed to identify and categorize clauses based on surances corpus. Table 1 outlines the results obtained by seven predefined patterns specific to the Italian language. these three Language Models on the triplet classification Guarasci et al. [14] presented an OIE method for Italian task. Finally, the trained most accurate classifier has been single-verb sentences based on Lexicon-Grammar tables. employed to label the triplets of the U set, forming a new The system employs linguistic structures and patterns of ℒ (Auto-Labeled) set, which in turn will be exploited verbal behavior to identify arguments, match patterns, to instruct the Large Language Model for the OIE task. and generate propositions, demonstrating efectiveness in generating syntactically and semantically valid propo- 3.2. Instruction Tuning sitions for the Italian language. Finally, [15] proposed OIE4PA, an Open IE framework that can identify facts Instruction tuning is an innovative strategy that involves from Public Administration documents. Leveraging the guiding a language model through human-like instrucproposal of Siciliani et al. [15], in this work, we proposed tions to improve its performance on a specific task. Unan Instructed Large Language model for Italian Open like traditional methods that rely solely on large-scale Information Extraction specialized in Public Administra- training data, instruction tuning provides targeted guidtion Documents. ance, allowing the model to adapt and refine its behaviour toward desired outcomes. Incorporating human-like instructions enhances the model’s understanding and 3. Methodology improves its ability to generate contextually relevant responses. In summary, given a source text and taskWe propose a novel model for automated Information specific instructions, the model is trained to create a Extraction for Italian PAs by instructing an LLM on a sequence of tokens representing the desired output. set of automatically labeled triplets of public tenders. To To instruct an LLM to perform Open Information Exthis end, we devise a proper strategy to train an LLM traction, we transformed the ℒ triplets set into an inwith a suitable set of triplets and instructions. The entire struction dataset- In particular, each auto-labeled triplet

4. Experimental settings

is used to train the Instruction model following the template: Task Instruction, Input Text, and Response.

We adopted the Flan-T5 family [20] as an instruction 3.2.1. Task instruction model. Such a choice is motivated by two reasons: first, prior research [21] has demonstrated the potential of Task instructions provide a detailed statement on accom- such models in Information Extraction tasks, eventually plishing the desired task and properly structuring the outperforming larger models such as LLama2 or similar, output. In detail, we formulated the following instruction resulting in a perfect trade-of between speed of inference to query LLM: and prediction quality. The other main benefit is that Flan-T5 is a multi-language model, which is also suit<Trova quali triple semantiche esistono able for tasks related to understanding Italian. We tested nel testo. Formatta l’output come with two diferent T5-Flan sizes flan-xxl (11b) and flan-xl [Soggetto;Predicato;Oggetto]>. (3b) adopting for both the OIE4PA dataset, relying on a split of 80% and 20% for training and test, respectively.

We formulate the instruction in Italian to make the model We fine-tuned the models for eficiency and hardware immediately understand that we are referring to the Ital- reasons by exploiting QLora1 with a 4-bit quantization, ian language. The translation in English of the instruction allowing faster training and saving GPU memory. All is:"Find which semantic triples exist in the text, Format the experiments were conducted with an Nvidia RTX A6000 output as [Subject; Predicate; Object]". GPU machine with 48 GB of VRAM. We train both models for one epoch, and we adopt the following QLora 3.2.2. Input text settings and hyperparameters: To properly apply such metrics for the triplets evaluation, we considered as true positive (TP) a non-empty triplet that matches with the corresponding triplet in the ground truth (i.e., the triples belonging to the ℒ set), true negative (TN) a triple returned as an empty string by the model and labeled as invalid in the ground truth, false positive (FP) a triplet that was labeled as invalid but retrieved by the model, and false negative (FN) when The input text represents the sentences in which LLM has to perform the task defined by the instructions. In detail, each sentence is the original text excerpt from which a triplet belonging to the dataset OIE4PA has been extracted. 3.2.3. Response The response represents the desired output. In our case, 4.1. Evaluation Metrics the input sentence was transformed into an open triplet.

We also specify that to instruct the model to distinguish sentences where a triplet can be extracted from sentences where no useful triplets exist, we included the triplet as a response if it was labeled as valid by the classifier; otherwise, we leave an empty string.

Lora-rank Lora-alpha Lora-dropout Learning rate Batch Size the model returned an empty string rather than a valid triplet.

In doing so, we can evaluate the performances in terms of classical confusion matrix metrics, i.e., accuracy (a), precision (p), recall (r), and F1 score (F1); whose formulae are: 6. Conclusions + = Considering the significant gap between information extraction available for English and other resource constrained languages such as Italian, we explored an = + Instruction Tuning approach to perform Open Information Extraction on Italian Public Tenders in this paper. A 1 = 2 * + * epgroyp,ienr wLLhMic hisainlasntrguucategde wmiothdealn-beafescetdivcelatwssoifie-rstiasgterasitnraedton a proper Italian PA dataset to obtain a set of correct 5. Results triplets, which are used to instruct a suitable LLM. The promising experiments have validated the assumptions Table 1 reports the comparisons of three diferent Italian pointed out in the paper and incentivized future develBert models for the triplet classification task. In detail, the opments aimed at developing new datasets and models selcted models are LegalBERT-ITA2, BertBase-ITA3, capable of theoretically understanding and structuring and BureauBERTo4. The best model turns out to be Bu- technical texts in Italian in the form of semantics triplets. reauBerto, probably due to the fact that it is the only model pre-trained on Public Administration corpora.

Acknowledgments

Table 1 This work has been partially carried out thanks to the Bert triplet classification results in terms of accuracy (), pre- Ministerial Decree no. 351 of 9th April 2022, based on the cision (), recall (), and F1 score ( 1). NRRP – funded by the European Union - NextGenerationEU - Mission 4 “Education and Research”, Component

Model 1 1 “Enhancement of the ofer of educational services: from LegalBERT-ITA 0.935 0.953 0.897 0.919 nurseries to universities” - Investment 4.1, that provided BertBase-ITA 0.927 0.935 0.894 0.911 a financial support for the Leonardo Piano’s doctoral BureauBERTo 0.945 0.963 0.901 0.932 pathway.

Also, Alessia Pisu acknowledge MUR and EU-FSE for Table 2 outlines the result of the two fine-tuned Flan- ifnancial support of the PON Research and Innovation T5 models on extracting triplets in procurement texts. 2014-2020 (D.M. 1061/2021).

Both model sizes show excellent results for all metrics; in Furthermore, we acknowledge financial support unparticular, recall is significantly high, demonstrating that der the National Recovery and Resilience Plan (NRRP), the models are quite efective in finding a large number Mission 4 Component 2 Investment 1.5 - Call for tender of true positives (e.g., valid triplets). It is also good to note No.3277 published on December 30, 2021 by the Italian that the values are higher for the model with a higher Ministry of University and Research (MUR) funded by number of parameters. Therefore, the promising results the European Union – NextGenerationEU. Project Code support the thesis of leveraging Instruction Tuning to ECS0000038 – Project Title eINS Ecosystem of Innovation build strong Open Information Extraction models for for Next Generation Sardinia – CUP F53C22000430001Italian public administrations. To this end, we plan to Grant Assignment Decree No. 1056 adopted on June 23, create new datasets in the future to develop a new set of 2022 by the Italian Ministry of University and Research foundational models for information extraction in Italian, (MUR). with a particular focus on PAs and other administrative entities. 2https://huggingface.co/dlicari/Italian-Legal-BERT 3https://huggingface.co/dbmdz/bert-base-italian-uncased 4https://huggingface.co/colinglab/BureauBERTo umberto to the italian bureaucratic language, in: Ital-IA, 2023. URL: https://api.semanticscholar.org/

CorpusID:262088765. [19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:

Pre-training of deep bidirectional transformers for language understanding, in: North American Chapter of the Association for Computational Linguistics, 2019. URL: https://api.semanticscholar.org/

CorpusID:52967399. [20] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,

W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, et al., Scaling instruction-finetuned language models, arXiv preprint arXiv:2210.11416 (2022). [21] S. Wadhwa, S. Amir, B. C. Wallace, Revisiting relation extraction in the era of large language models, Proceedings of the conference. Association for Computational Linguistics. Meeting 2023 (2023) 15566–15589. URL: https://api.semanticscholar.org/ CorpusID:258564662.