Evaluating Pre-Trained Transformers on Italian Administrative Texts Serena Auriemma1 , Martina Miliani1,2 , Alessandro Bondielli3 , Lucia C. Passaro3 and Alessandro Lenci1 1 Department of Philology, Literature, and Linguistics, University of Pisa 2 University for Foreigners of Siena 3 Department of Computer Science, University of Pisa Abstract In recent years, Transformer-based models have been widely used in NLP for various downstream tasks and in different domains. However, a language model explicitly built for the Italian administrative language is still lacking. Therefore, in this paper, we decided to compare the performance of five different Transformer models, pre-trained on general purpose texts, on two main tasks in the Italian administrative domain: Name Entity Recognition and multi-label document classification on Public Administration (PA) documents. We evaluate the performance of each model on both tasks to identify the best model in this particular domain. We also discuss the effect of model size and pre-training data on the performances on domain data. Our evaluation identifies UmBERTo as the best-performing model, with an accuracy of 0.71, an F1 score of 0.89 for multi-label document classification, and an F1 score of 0.87 for NER-PA. Keywords Natural Language Processing, Evaluation of Neural Language Models, Domain Language, Public Admin- istration 1. Introduction Today, language technologies are indispensable for facilitating interaction between citizens and the Public Administration (PA). Natural Language Processing tools represent a practical resource for managing the vast data in the PA domain. They can be leveraged in many ways to extract the implicit information in administrative texts and transform it into structured data that are more easily accessible, manageable, shareable, and secure. In recent years, pre-trained language models are gradually emerging as a new paradigm in Natural Language Processing. The main advantage of these models is that they are already trained on self-supervised tasks and they can be fine-tuned for a variety of downstream tasks with a relatively small amount of data and training iterations. In addition, some of these AIxPA 2022: 1st Workshop on AI for Public Administration, December 2nd, 2022, Udine, IT $ serena.auriemma@phd.unipi.it (S. Auriemma); m.miliani@studenti.unistrasi.it (M. Miliani); alessandro.bondielli@unipi.it (A. Bondielli); lucia.passaro@unipi.it (L. C. Passaro); alessandro.lenci@unipi.it (A. Lenci) € https://colinglab.humnet.unipi.it/people/miliani/ (M. Miliani); https://scholar.google.com/citations?user=zcXQk6YAAAAJ&hl=en (A. Bondielli); https://luciacpassaro.github.io/ (L. C. Passaro); https://people.unipi.it/alessandro_lenci/ (A. Lenci)  0000-0003-1124-9955 (M. Miliani); 0000-0001-5790-4308 (A. Lenci) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) models, such as BERT [1] and RoBERTa [2], have achieved significant improvements in several NLP tasks, and Transformers [3] lie at the core of many recent architectures [4]. Pre-trained language models are advantageous in contexts where labelled data are limited while large amounts of unlabelled data are available. This is the case for underrepresented languages or specific domains. Indeed, several attempts have also been made to adapt these models to specific domains via additional pre-training on domain data, which improved their performance[5]. For example, BERT, one of the most widely used Transformer models, has several variants in different domains: MedBERT [6] and BioBERT [7] are further pre-trained on clinical and biomedical data, respectively, to solve medical-related tasks, such as biomedical NER or disease prediction; SciBERT [8] is fine-tuned on a suite of tasks including sequence tagging, sentence classification and dependency parsing on data from a variety of scientific domains; Legal BERT [9] is pre-trained on English legislative documents, contracts and trial records for legal text classification and sequence tagging. Nevertheless, a language model specifically built for the Italian administrative domain is still lacking. Thus, we sought to explore the use of generic models pre-trained on general text data for a specialized domain, such as PA. Domain- agnostic pre-trained models are often used as a baseline to which compare the performance of domain-adapted models[10, 7]; alternatively, their performance is usually compared with that of non-Transformer-based models, such as Conditional Random Field (CRF), Convolutional Neural Network(CNN), Long Short Term Memory (LSTM) [11, 12]. Very few approaches evaluate the performance of different Transformer models on domain data, before choosing the one to adapt, and most of them concern the medical domain. For instance, Polignano et al.[13] created a hybrid NER model for analyzing textual medical diagnoses in Spanish, based on a configuration of different Transformer-based models and a CRF, to detect the best performing one for the task. For the same purpose, Khan et al. [14] carried out a comparison of nine Transformer-based models on a classification task of mentioning health twitter in English. Hence, we decided to compare the performance of different Transformer-based models on two main NLP downstream tasks, which we consider particularly congenial to the needs of PA: Named Entity Recognition and Document Classification. Named Entity Recognition (NER) is the task of identifying and classifying entities mentioned in unstructured text into predefined categories, e.g., proper names of people, places, organiza- tions, dates, etc. [15]. A NER for PA, such as the one proposed by [16], includes other classes that are particularly relevant to the administrative domain, such as the references to other administrative documents, the legislative references, and organizations related explicitly to PA. Identifying relevant entities in a document can facilitate intelligent access and search within documents by municipalities. Similarly, the document classification task associates each document with a label related to its content. It has extensive applications, including topic labeling, sentiment classification, and spam detection [17]. Since PA handles documents treating different aspects of the municipality, such as Environment, Urban Planning, and Public Education, automatically classifying documents according to their content can smooth the information retrieval process and simplify the management and protocol of PA’s document repository. Transformer-based models have been shown to handle these two tasks with remarkable results [18, 19, 20, 21]. Moreover, various off-the-shelf models are available in Italian or have multilingual versions. Thus, we decided to compare their performances in these two tasks in the domain of Italian administrative documents without any additional domain pre-training. Given the inherent differences of the administrative language style from the standard Italian on which these models were pre-trained, it is undoubtedly interesting to compare their performances on these two tasks to assess the capabilities of such models in the PA scenario. Generally, Transformer-based models vary in size, number of parameters, pre-training objec- tive, and pre-training data. Hence, it is a challenging decision to select one of these models for a specific downstream task. Therefore, we decided to assess their performance in the admin- istrative domain to identify which model best suits each of these tasks and for this particular domain. The main contributions of this work are: • a comparative evaluation of five Transformer-based models on two main downstream tasks, namely NER and multi-label document classification in the administrative domain; • the creation of a new Italian dataset for multi-label document classification in the admin- istrative domain, which we called atto; • a comparison between Transformer-based models and a PA-specialized machine learning model on NER. The rest of this paper is organized as follows. Section 2 provides a brief overview of the Transformer models we considered for comparison. Section 3 presents the experimental details. We describe the datasets, the parameter configuration used for fine-tuning the models on the NER-PA and multi-label document classification tasks, the baselines, and the metrics used for evaluation. Section 4 reports on the results and discusses them. Finally, Section 5 draws some conclusions and highlights future works. 2. Transformer Models In this work we focus on Transformer-based models pre-trained on the Italian language or available in a multilingual version, to identify which model represents the best choice for handling administrative data, even in the absence of adaptive domain pre-training. We compare the performance on NER-PA and PA multi-label document classification of 5 Transformer- based language models: BERT-base-Ita,1 UmBERTo,2 Multilingual BERT,3 XLM-RoBERTa,4 and GePpeTto.5 . BERT [1] is a bidirectional Transformer model pre-trained on a multi-task objective (Masked Language Modeling and Next Sentence Prediction) over vast amounts of text and can be fine- tuned for various NLP downstream tasks. The Italian version of BERT, henceforth BERT-Ita, was pre-trained on a Wikipedia dump and various texts from the OPUS corpora collection6 for a total corpus size of 13 GB of texts. UmBERTo is a RoBERTa-based Language Model trained on a subset of the OSCAR corpus7 1 https://huggingface.co/dbmdz/bert-base-italian-uncased 2 https://huggingface.co/Musixmatch/umberto-wikipedia-uncased-v1 3 https://huggingface.co/bert-base-multilingual-uncased 4 https://huggingface.co/xlm-roberta-base 5 https://huggingface.co/LorenzoDeMattei/GePpeTto 6 https://opus.nlpl.eu 7 https://oscar-corpus.com Table 1 Transformer-based language models characteristics. Model Name Based on Model size Corpus size Multilingual BERT-Ita BERT-base 109M 13GB No mBERT BERT-base 177M - Yes UmBERTo RoBERTa-base 110M 70GB No XLM-RoB. RoBERTa-base 278M 2.5TB Yes GePpeTto GPT-2 108M 13GB No containing about 70 GB of plain text. Compared to BERT, RoBERTa was trained longer, over more data, and with larger training batch size. In addition, the training process was carried out with a dynamic masking of tokens in place of the Next Sentence Prediction task [2]. This resulted in an improved performance of the model on various GLUE [22] benchmark results. UmBERTo is an expanded version of RoBERTa that contains two new features: Sentence Piece Model and Whole Word Masking. Sentence Piece Model is a language independent tokenizer that generates sub-word units specific to the chosen vocabulary size and corpus language. With Whole Word Masking (WWM), if at least one of the tokens created by Sentence Piece Tokenizer is chosen as a mask, the mask is applied to the entire word. Thus, only whole words are masked, not sub-words. The multilingual version of BERT (mBERT) [1] achieves language understanding by training the MLM task with shared vocabulary and weights on Wikipedia texts from the 104 top languages. Each training sample is a monolingual document, and there are no explicitly designed cross- linguistic objectives or cross-linguistic data. Despite this, mBERT is successful in cross-linguistic generalisation [23]. Another multilingual encoder is XLM-RoBERTa. This model was pre-trained on a significantly larger amount of data, 2.5 TB of clean Common Crawl data in 100 different languages. Like mBERT, XLM-RoBERTa pre-training task is solely monolingual MLM. It achieves state-of-the-art results on several multilingual benchmarks, including XNLI, MLQA, and NER, outperforming mBERT [24]. GePpeTto [25] is the first autoregressive language model for Italian. It is built using the GPT-2 architecture [26]. The latter is a scale-up of GPT [27], with 10x parameters and 10x training data. GPT-2 is a unidirectional generative Transformer model trained on next token prediction given all of the previous words within some text. Its Italian version, GePpeTto was trained on a collection of Wikipedia Text and the ItWac corpus [28], amounting to almost 13GB. The Transformer-based models we discussed differ with regard to pre-training objective, size, and number of parameters. Moreover, they are also pre-trained on corpora of different sizes, as highlighted in Table 1. 3. Experimental settings We aimed to compare the performance of five Transformer-based models on two different tasks related to the administrative domain. To this end, we fine-tuned these five models on two main downstream tasks in the administrative domain: NER-PA and multi-label document classification. More specifically, we chose the base-cased version of all the models and fine- tuned each model on two datasets, one for each task. We describe the used datasets in Section 3.1. We compared their performance in terms of precision, recall, F1 score, and accuracy while analyzing the model pre-training and architecture effect on each task, as described in Section 3.5. 3.1. Datasets In order to fine-tune the models on the NER task for the administrative domain, we selected a corpus containing documents from the Italian Public Administration, i.e. the PA Corpus. The corpus is annotated with domain entity labels. As for the multi-label document classification task, we proposed a new dataset, named the Atto corpus, specifically for this purpose. Unfortunately, the raw annotated dataset used in this paper cannot be released due to sensitive information being present in the data. However, trained models (both for NER and document classification) are available via huggingface8 . 3.1.1. PA Corpus. It was first proposed in [16] for building a Named Entity Recognizer, named INFORMed PA, specifically designed for the administrative domain. INFORMed PA extended the traditional NER classes [29] (i.e. Person, Locations, Organizations, and Miscellaneous Entities) with other classes representing the administrative domain. As for the traditional classes, the dataset includes the class LOC, which is used for marking both geo-political entities (e.g., “Comune di Pisa”) and locations (e.g., “via S. Maria 36, Pisa”); the class PER, which is used for identifying a physical subjects; the class ORG, used for marking organizations such as Companies. As for the PA-specific classes, it includes the following additional classes: ORG_PA, LAW, and ACT. ORG_PA is used for labelling organizations specifically related to PA (e.g., the Municipality Departments). LAW denotes instead legislative references. Finally, ACT marks references to other administrative documents. In particular, ACT is further divided in several sub-classes. Specifically, the annotation distinguishes their type (ACT_T), number (ACT_N), date (ACT_D), functional tokens (ACT_X) and unparsable tokens (ACT_U). For example, the ACT Delibera di Giunta Comunale numero 53 del 23/10/2016 is annotated as follows: Delibera di Giunta comunale (ACT_T) numero (ACT_X) 53 (ACT_N) del (ACT_X) 23/10/2016 (ACT_D), while the act DD/67/2012 is annotated as ACT_U. The corpus contains 460 documents taken from the Albo Pretorio Nazionale for a total of 724,623 tokens. The first 100 documents of the corpus were annotated using the aforementioned Name Entities (NE) by two annotators, including one domain expert. Then, a CRF model was trained using these documents and it was used to automatically annotate new documents. Finally, two other annotators manually revised the entire output. The distribution of the different NEs in the corpus, as well as examples, is listed in Table 2. 8 https://huggingface.co/colinglab Table 2 The distribution of Named Entities in the PA Corpus. Tag Freq. Example Law 5217 art. 183 comma 7 del D.Lgs. n. 267/2000 Loc 4498 Comune di Pisa Per 3706 Mario Rossi Org 3594 Consip Spa Act 2240 Determina n. 4 del 12/02/2011 Org_pa 2074 Sezione Anagrafe 3.1.2. ATTO. The ATTO (Administrative Texts labeled by TOpic) corpus is a dataset that includes adminis- trative texts that are labeled with one or more topics (e.g., Environment, Construction, Urban Planning, Social, etc.) pertaining to PA. The dataset was built upon a domain ontology created in the context of the project SEMPLICE.9 The ontology contains approximately 2,700 administra- tive domain terms that domain experts divided into 13 classes, each corresponding to a different sub-sector of the PA (e.g. Environment, Construction, Urban Planning, Social, etc.). We collected up to about 1,000 documents for each of the 13 classes of topics. The collected data are a subset of a dataset including documents from several Italian municipalities annotated and indexed by topic. The complete list of considered topics, as well as their distribution, are shown in Table 3. The corpus was built by querying the whole PA document collection by topic. As each document was associated with more than one topic in the original dataset, we only selected unique documents (i.e., a document retrieved through two or more different topics was added only once to the ATTO dataset). We further filtered the documents, to obtain high-quality data. Specifically, we removed documents containing OCR-level errors, which were identified by means of shallow matching rules. Finally, we removed documents with a length higher than 600 words. The final version of atto includes 11,019 documents. 3.2. Named Entity Recognition in the PA domain In order to evaluate the performances of the chosen Transformer models on Named Entity Recognition for PA (NER-PA), we fine-tuned each model on the PA corpus (see Sec. 3.1.1). First, the data are pre-processed to unify cross-sentence entities, merging sentences with a common entity. Then, we split the dataset into 70% training, 20% validation, and 10% test. In addition to the test set, we evaluated the models on a different test set of 25 documents from 25 various municipalities following the original approach proposed in [16]. The goal of this further analysis is to test the performance of the models on different templates and ways of referring to entities employed by the various municipalities. As for the actual training, we fine-tuned each model for five epochs with a learning rate of 2e-5. Due to differences in model size and limited hardware availability, we chose to vary the training batch size from model to model to best fit our available memory, to obtain the best possible results for each model. The training 9 SEMantic instruments for PubLIc administrators and CitizEns : www.semplicepa.it Table 3 The distribution of the topic labels in the atto corpus. Topic Number of Documents Environment 999 Demographics 716 Advocacy 1433 Tenders and Contracts 3928 Trade and Business 210 Culture, tourism and sport 381 Constructions 1358 Personnel 1625 Public Education 951 Information services 1613 Financial services 5380 Social 1419 Urban Planning 1567 Total 11019 was performed on a desktop computer equipped with an NVIDIA TITAN RTX graphic card. BERT-Ita, mBERT and XLM-Roberta were fine-tuned with a batch size of 16, while GePpeTto and UmBERTo with a batch size of 8 and 4, respectively. All models belong to the Huggingface Transformers library.10 3.3. Multi-label document classification As for the multi-label document classification on the atto dataset (see Sec. 3.1.2), the training process was straightforward. We chose to perform 5-fold cross-validation, in order to ensure the reliability of the results in absence of a test set. We first performed a preliminary evaluation in order to find the best hyperparameters. By considering the per-epoch training and validation loss, we concluded that all the models could be trained for 10 epochs before starting to overfit. Thus, we fine-tuned our models for 10 epochs using a batch size of 16, 2e-5 as learning rate, and a maximum sequence length of 512. The final results for each model are obtained by averaging the performances on each training fold. 3.4. Baselines For what concerns NER, we used as baseline the results obtained on the same datasets by Informed PA [16], a PA-specialized machine learning model based on the Stanford NER, using a Conditional Random Field (CRF) as a learning algorithm. As for the baseline of the multi-label document classification task, we implemented a Bidi- rectional LSTM (bi-LSTM) model. We constructed the model with one bi-LSTM layer with 256 neurons and a single dense output layer with 13 neurons since we have 13 labels in the dataset. We chose the Sigmoid as the activation function and the binary-cross entropy as the 10 https://huggingface.co/docs/Transformers/index Table 4 Results for 5-fold cross validation on multi-label document classification. Model Name F1-score Accuracy BERT-Ita 0.881 0.692 mBERT 0.857 0.647 XLM-RoBERTa 0.883 0.692 UmBERTo 0.892 0.708 GePpeTto 0.870 0.671 Bi-LSTM 0.386 0.545 loss function. We also pre-processed the documents to feed the bi-LSTM removing line breaks, typycal of administrative acts layout. We fed the neural network with 100-dimensional vector representations of the first 250 tokens of each document obtained using Word2Vec [30]. We performed 5-fold cross-validation on the ATTO corpus, as for all the other Transformer-based models, and we trained the bi-LSTM for ten epochs with a batch size of 128. The final results of the model are obtained by averaging the result of each training fold. 3.5. Evaluation Metrics As the two considered tasks are fairly standard, we evaluated the performances on common metrics for classification. We used micro and macro averages for precision, recall, and F1-score for the NER-PA task. For the multi-label document classification, accuracy, precision, recall and micro average F1-score were considered. Note that the micro average F1-score on multi-label classification corresponds to the harmonic mean between precision and recall, while accuracy refers to the number of data points for which the predicted set of labels is identical to the true one over all the dataset. 4. Results and Discussion Table 4 summarizes the results of the models on the multi-label document classification task. Differences in performances are clearly minimal across all Transformer-based models, which significantly outperform the baseline with regards to F1-score. The best performing model was UmBERTo with an F1-score of 0.89 and an accuracy of 0.71. BERT-Ita and XLM-RoBERTa both achieved an accuracy of 0.69 and an F1-score of 0.88, followed closely by GePpeTto which achieved 0.67 accuracy and 0.87 F1-score. The worst performing model among Transformers was mBERT, whose accuracy and F1-score were 0.65 and 0.86, respectively. Ultimately, the bi-LSTM model achieved an accuracy of 0.55 and an F1-score of 0.39, both far below the average score obtained by Transformers. These findings indicate that the most suited model for a multi-label classification task on administrative documents is UmBERTo. This is probably due to the fact that, with respect to the other monolingual Italian models (i.e., BERT-Ita and GePpeTto, UmBERTo), it has a bigger size and was pre-trained on more data. While, comparing Bert-Ita and GePpeTto, despite the models having approximately the same number of parameters and the same amount of data used for pre-training, it appears that BERT-Ita represents a more suitable architecture to perform a multi-label classification task than the generative model GePpeTto. Regarding the multilingual models, our results confirm what has been reported in the literature for the general domain, i.e., the monolingual models perform better than their multilingual counterparts [31]. Indeed, mBERT was the model with the lowest scores in both accuracy and F1-score, and, although XLM-RoBERTa is almost three times as large as BERT-Ita in terms of parameters, their performance was identical. As highlighted in Table 5, which shows the precision, recall, and F1-score obtained by the models for each topic class, the ranking of the models is quite consistent across all classes: the best-performing model, UmBERTo, obtained the highest F1-score in most cases, followed by Bert-Ita and XLM-Roberta, GePpeTto, mBERT, and lastly the bi-LSTM model. Only two classes constitute an exception: Trade and Business and Culture, tourism, and sport. In these cases, the best-performing models were GePpeTto and XLM-Roberta, although the scores are still lower when compared to the results of the other classes. That is probably related to the number of documents labeled with these classes in the dataset, as they have fewer instances than all others (see Table 3). Regarding the baseline, we can observe that the bi-LSTM performs poorly on almost the whole dataset. Only in the Financial Service class, the most represented in the corpus, the bi-LSTM achieves an F1-score of 0.88, comparable to that of the Transformer models. Table 6 provides the results of each model for the NER-PA task in terms of precision, recall, and F1-score. These results refer to the scores obtained on the PA Corpus test set. Table 7 reports the results obtained by the models on the additional dataset of 25 administrative documents from 25 different municipalities. As for the multi-label document classification, the best performing model was UmBERTo, with a micro average precision of 0.86, a recall of 0.89, and F1-score of 0.87. XLM-RoBERTa obtained the same recall score as UmBERTo, but with lower precision and F1-score, respectively of 0.83 and 0.86. BERT-Ita and mBERT achieved an equal precision of 0.83 and F1-score of 0.85. BERT-Ita performed better than mBERT in terms of recall, with 0.88 compared to 0.87. Again, the model that performed worse was GePpeTto with a micro average precision of 0.69, recall of 0.80, and F1-score of 0.74, deviating from the best model by about 0.13 F1-score points. On the 25 document dataset, the best-performing Transformer models was XLM-RoBERTa, scoring 0.8, 0.88, and 0.84 on micro average precision, recall and F1-score, respectively. The second best performing model was UmBERTo, followed by BERT-Ita and mBERT. GePpeTto was the worst performing model with 0.63 precision, 0.78 recall, and an F1-score of 0.7. This performance comparison highlights that for the token classification task, the amount of monolingual data used for pre-training appears to play a crucial role, together with the model’s size. We observed that the best performing model on the PA corpus was UmBERTo, which is pre-trained on a much larger sample of Italian texts with respect to all the other models. On the 25 document sample, which includes a higher variability in terms of how Named Entities are mentioned in the text, the larger model in terms of parameters, namely XLM-RoBERTa, has a slight advantage over UmBERTo. Moreover, results show that a rather small generative model, in comparison with current state-of-the-art ones such as GPT-3 [32], is the least suited for fine-tuning on a token classification task. If we look at the detailed scores for each class on the 25 documents dataset, we can also observe that the classes for which the Transformer models struggle the most are ORG_PA Table 5 Performance comparison of the models on the ATTO corpus Doc Label Metric BERT-Ita mBERT XLM-RoB UmBERTo GePpeTto Bi-LSTM P 0.853 0.798 0.861 0.865 0.844 0.400 Environment R 0.782 0.734 0.810 0.839 0.759 0.043 F1 0.815 0.763 0.834 0.852 0.798 0.074 P 0.919 0.929 0.922 0.949 0.904 0.849 Advocacy R 0.868 0.833 0.851 0.866 0.869 0.402 F1 0.893 0.878 0.885 0.905 0.886 0.539 P 0.876 0.859 0.869 0.886 0.845 0.823 Tenders and Contracts R 0.880 0.874 0.898 0.914 0.890 0.530 F1 0.878 0.867 0.883 0.900 0.867 0.638 P 0.932 0.948 0.948 0.971 0.803 0.550 Trade and Business R 0.378 0.283 0.260 0.157 0.550 0.072 F1 0.532 0.432 0.403 0.268 0.652 0.125 P 0.781 0.760 0.780 0.767 0.743 0.200 Culture, tourism and sport R 0.551 0.520 0.567 0.557 0.585 0.003 F1 0.645 0.615 0.655 0.645 0.654 0.005 P 0.937 0.910 0.932 0.939 0.918 0.874 Demographics R 0.909 0.868 0.889 0.930 0.876 0.234 F1 0.923 0.888 0.909 0.934 0.896 0.321 P 0.862 0.861 0.863 0.878 0.842 0.790 Constructions R 0.821 0.800 0.856 0.871 0.816 0.314 F1 0.840 0.829 0.859 0.874 0.829 0.444 P 0.881 0.858 0.890 0.902 0.880 0.859 Personnel R 0.860 0.787 0.866 0.866 0.844 0.239 F1 0.870 0.821 0.878 0.883 0.861 0.363 P 0.861 0.839 0.855 0.856 0.842 0.861 Public Education R 0.837 0.789 0.847 0.843 0.802 0.183 F1 0.849 0.812 0.851 0.849 0.821 0.239 P 0.874 0.867 0.866 0.878 0.885 0.922 Information services R 0.822 0.806 0.844 0.844 0.799 0.433 F1 0.847 0.835 0.855 0.860 0.840 0.587 P 0.951 0.940 0.949 0.954 0.950 0.908 Financial services R 0.965 0.953 0.966 0.968 0.954 0.858 F1 0.958 0.946 0.958 0.961 0.952 0.876 P 0.858 0.761 0.837 0.855 0.820 0.373 Social R 0.791 0.732 0.811 0.809 0.790 0.053 F1 0.823 0.746 0.824 0.831 0.804 0.085 P 0.904 0.887 0.903 0.916 0.898 0.890 Urban Planning R 0.832 0.808 0.815 0.836 0.810 0.538 F1 0.867 0.845 0.856 0.874 0.851 0.669 Table 6 Performance comparison of the Transformer model and INFORMed PA on the PA corpus. Model Metric ACT LAW LOC ORG ORG𝑃 𝐴 PER MacAvg MicAvg P 0.893 0.831 0.764 0.763 0.784 0.888 0.849 0.832 BERT-Ita R 0.909 0.878 0.826 0.818 0.823 0.880 0.877 0.877 F1 0.898 0.854 0.794 0.790 0.803 0.884 0.861 0.854 P 0.918 0.810 0.769 0.727 0.774 0.889 0.855 0.829 mBERT R 0.881 0.867 0.822 0.815 0.809 0.899 0.862 0.867 F1 0.895 0.838 0.795 0.769 0.791 0.894 0.856 0.848 P 0.891 0.821 0.787 0.755 0.754 0.908 0.848 0.834 XLM-RoB. R 0.945 0.881 0.833 0.820 0.857 0.897 0.901 0.890 F1 0.916 0.850 0.809 0.786 0.802 0.902 0.873 0.861 P 0.916 0.846 0.808 0.795 0.785 0.908 0.872 0.858 UmBERTo R 0.942 0.877 0.841 0.838 0.828 0.900 0.899 0.890 F1 0.928 0.861 0.824 0.816 0.806 0.904 0.885 0.873 P 0.833 0.641 0.640 0.574 0.579 0.776 0.738 0.694 GePpeTto R 0.851 0.761 0.733 0.678 0.773 0.817 0.802 0.800 F1 0.824 0.696 0.683 0.622 0.662 0.796 0.758 0.744 P 0.788 0.827 0.702 0.709 0.616 0.837 0.746 - INFORMed PA R 0.891 0.842 0.740 0.689 0.777 0.878 0.803 - F1 0.836 0.834 0.720 0.698 0.686 0.857 0.772 - and ACT_U (see Tab. 7 and 8). As for ORG_PA, this may be due to the fact that the names of the organizations are more closely related to the individual Public Administration entity (e.g., Settore LL.PP. - Public Work Sector; Ufficio politiche abitative - Housing Policies Office). Given that this dataset includes documents from 25 different municipalities, it is understandable that even the best-performing models, namely UmBERTo and XLM-RoBERTa, may encounter difficulty handling such variability in the nomenclature of PA organizations. On the other hand, UmBERTo and XLM-RoBERTa seem to handle well the class ACT_U, which labeled the unparsable token, i.e., the reference to the PA act expressed as a unique string that can vary in terms of templates, such as n.12/2013; 67/96; 57/2008. On the contrary, BERT-Ita, mBERT, and GePpeTto performed poorly on this class. We also computed the macro average of precision, recall, and F1-score for each model and dataset to make our data comparable with the results obtained by INFORMed PA [16]. Recall that INFORMed PA is a NER for the Italian PA based on the Stanford NER using a Conditional Random Field (CRF) as learning algorithm. The results of INFORMed PA on the PA Corpus are shown in Table 6, while the results on the additional dataset of 25 documents are reported in Table 7. By comparing the performance of our models with INFORMed PA on the PA Corpus, we observed that it performed almost on par with GePpeTto, the worst performing Transformer- based model, on all metrics. It obtained a precision of 0.74, 0.80 for recall, and an F1-score of 0.77. In this regard, the best-performing Transformer model, UmBERTo, outperformed INFORMed PA by a wide margin. Specifically, it scored 0.13 points higher in accuracy, 0.10 points higher in Table 7 Performance comparison of the Transformer model and INFORMed PA on 25 documents dataset. Model Measure ACT LAW LOC ORG ORG𝑃 𝐴 PER MicAvg MacAvg P 0.876 0.804 0.692 0.557 0.598 0.907 0.790 0.794 BERTIta R 0.788 0.927 0.789 0.775 0.688 0.914 0.857 0.803 F1 0.811 0.861 0.738 0.648 0.640 0.910 0.822 0.785 P 0.861 0.828 0.676 0.551 0.504 0.880 0.780 0.774 mBERT R 0.757 0.949 0.759 0.773 0.656 0.930 0.851 0.785 F1 0.780 0.884 0.715 0.643 0.570 0.904 0.814 0.762 P 0.876 0.807 0.720 0.563 0.579 0.909 0.796 0.796 XLM-RoB. R 0.902 0.932 0.768 0.790 0.753 0.943 0.880 0.870 F1 0.889 0.865 0.743 0.657 0.654 0.926 0.836 0.829 P 0.877 0.836 0.665 0.579 0.538 0.911 0.796 0.792 UmBERTo R 0.906 0.936 0.770 0.760 0.677 0.918 0.870 0.859 F1 0.890 0.883 0.714 0.657 0.600 0.915 0.831 0.822 P 0.597 0.628 0.532 0.447 0.386 0.737 0.630 0.572 GePpeTto R 0.713 0.816 0.648 0.702 0.602 0.812 0.780 0.715 F1 0.646 0.710 0.584 0.547 0.471 0.773 0.697 0.631 P 0.975 0.949 0.799 0.802 0.871 0.914 0.914 0.885 INFORMed PA R 0.848 0.962 0.691 0.769 0.796 0.869 0.836 0.822 F1 0.907 0.955 0.741 0.785 0.832 0.891 0.873 0.852 recall, and 0.12 points higher in F1-score. All the others models outperformed INFORMed PA of at least 0.11 points in precision, 0.6 in recall, and 0.9 in F1-score. Surprisingly, the results of INFORMed PA on the 25 documents dataset exceed those of the Transformer-based models in terms of both precision and F1 score. We speculate that the CRF model is better able in dealing with classes whose instance show a high degree of linguistic regularity. In particular, the model seems to benefit from linguistic and shallow features such as word shape, n-grams, PoS-tags, and the presence of complex terms. By performing a class-wise comparison between models, we can observe that Transformer-based ones are more prone to errors in extracting Organization classes (both ORG and ORG_PA), LAW and ACTS while remaining effective on the others, which are arguably less dependent on the domain. In this case, in fact, their prediction capabilities are closer to InformedPA. These results show that language models based on the Transformer encoder architecture can guarantee higher performance when adequately trained on domain data than traditional tools such as INFORMed PA. Conversely, when there is some variability between the data on which the models are evaluated and the data used during the fine-tuning, the performance of Transformers tends to suffer with respect to that of a CRF model. This means that to guarantee high performance on domain-specific linguistic data, such as administrative texts, and in domain-specific tasks, such as the recognition of Public Administra- tion entities, the domain adaptation of the model via additional pre-training on domain-specific data may prove to be necessary. In this way, the model should be better able to derive an adequate representation of domain-specific terms and cope with the high degree of variability Table 8 Precision, Recall and F1-score achieved by the models in the sub-sections of the ACT_PA class on the 25 documents dataset Model Measure ACT𝐷 ACT𝑁 ACT𝑇 ACT𝑈 ACT𝑋 P 0.936 0.969 0.817 0.800 0.859 BERTIta R 0.903 0.939 0.884 0.320 0.890 F1 0.919 0.954 0.849 0.457 0.874 P 0.928 0.968 0.814 0.714 0.881 mBERT R 0.912 0.929 0.868 0.200 0.877 F1 0.920 0.948 0.840 0.312 0.879 P 0.929 0.948 0.807 0.808 0.886 XLM-RoB. R 0.920 0.929 0.901 0.840 0.922 F1 0.924 0.939 0.852 0.824 0.904 P 0.937 0.948 0.837 0.759 0.905 UmBERTo R 0.897 0.902 0.911 0.880 0.938 F1 0.916 0.925 0.873 0.815 0.921 P 0.884 0.782 0.502 0.000 0.819 GePpeTto R 0.922 0.951 0.823 0.000 0.871 F1 0.903 0.858 0.624 0.000 0.844 in how entities and acts are mentioned in bureaucracy documents. Thus, one of our future goals will be to make a domain adaptation of the model that performed best in this comparison and to extend further the suite of tasks more closely related to the PA domain on which the model will be assessed. 5. Conclusion In this paper, we aimed to confront the performance of various Transformer-based models on administrative texts. Our goal was to evaluate the performance of generic pre-trained models on administrative data and to identify the most suitable model for this type of task in this particular domain. To this end, we considered a multi-label document classification task and a Named Entity Recognition task on Public Administration texts. Among the different kinds and sizes of Transformer models, UmBERTo was shown to be the best option to handle both text and token classification tasks. It achieved the highest accuracy of 0.71 and F1-score of 0.89 on the multi-label classification task and the highest micro average precision, recall, and F1-score on the NER-PA task of 0.86, 0.89, and 0.87, respectively. While on multi-label document classification, no clear differences emerged among Transformer-based models, which outperformed the bi-LSMT model by a wide margin, we have observed a much more significant variance in performances for the NER-PA task. This is especially true if we consider encoder-only models such as UmBERTo and BERT and encoder- decoder ones such as GePpeTto. Indeed, the latter performed worst in the NER-PA, both in comparison with other Transformer-based models and with INFORMed PA [16]. Furthermore, we have observed that in the case of high variability in the data, document structures, and the way in which entities are expressed, the best performances are obtained with the domain-adapted CRF model INFORMed PA. These findings lead us to surmise that as a future direction, it may be crucial to adapt Transformer-based language models to this domain via additional pre-training steps in order to improve their performances. In addition, we also aim at extending the kind of tasks and the number of datasets related to the Italian administrative domain in order to have more data available for the training and the evaluation of the models. We also plan to anonymize these data to share them. Acknowledgments This research has been funded by the Project “ABI2LE (Ability to Learning)”, funded by Regione Toscana (POR Fesr 2014-2020). The project partners are University of Pisa, CoLing Lab and the companies 01Semplice s.r.l. (coordinator), IPKOM s.r.l., and Insurance OnLine S.p.A. References [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:10.18653/v1/N19-1423. [2] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [4] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, X. Huang, Pre-trained models for natural language processing: A survey, Science China Technological Sciences 63 (2020) 1872–1897. [5] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop pretraining: Adapt language models to domains and tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8342–8360. URL: https://aclanthology.org/ 2020.acl-main.740. doi:10.18653/v1/2020.acl-main.740. [6] L. Rasmy, Y. Xiang, Z. Xie, C. Tao, D. Zhi, Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction, NPJ digital medicine 4 (2021) 1–13. [7] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [8] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, arXiv preprint arXiv:1903.10676 (2019). [9] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, arXiv preprint arXiv:2010.02559 (2020). [10] P. Grouchy, S. Jain, M. Liu, K. Wang, M. Tian, N. Arora, H. Ngai, F. K. Khattak, E. Dolatabadi, S. A. Koçak, An experimental evaluation of transformer-based language models in the biomedical domain, arXiv preprint arXiv:2012.15419 (2020). [11] R. Joshi, A. Gupta, Performance comparison of simple transformer and res-cnn-bilstm for cyberbullying classification, arXiv preprint arXiv:2206.02206 (2022). [12] C. Lothritz, K. Allix, L. Veiber, J. Klein, T. F. D. A. Bissyande, Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3750–3760. [13] M. Polignano, M. de Gemmis, G. Semeraro, Comparing transformer-based ner approaches for analysing textual medical diagnoses., in: CLEF (Working Notes), 2021, pp. 818–833. [14] P. I. Khan, I. Razzak, A. Dengel, S. Ahmed, Performance comparison of transformer-based models on twitter health mention classification, IEEE Transactions on Computational Social Systems (2022) 1–10. doi:10.1109/TCSS.2022.3143768. [15] J. Li, A. Sun, J. Han, C. Li, A survey on deep learning for named entity recognition, IEEE Transactions on Knowledge and Data Engineering 34 (2020) 50–70. doi:10.1109/TKDE. 2020.2981314. [16] L. C. Passaro, A. Lenci, A. Gabbolini, Informed pa: A ner for the italian public administration domain, in: Fourth Italian Conference on Computational Linguistics CLiC-it, 2017, pp. 246–251. [17] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, San Diego, California, 2016, pp. 1480–1489. URL: https://aclanthology.org/N16-1174. doi:10.18653/v1/N16-1174. [18] C. Lothritz, K. Allix, L. Veiber, J. Klein, T. F. D. A. Bissyande, Evaluating pretrained transformer-based models on the task of fine-grained named entity recognition, in: Proceedings of the 28th International Conference on Computational Linguistics, 2020, pp. 3750–3760. [19] R. Pappagari, P. Zelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical transformers for long document classification, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 838–844. [20] Z. Shaheen, G. Wohlgenannt, E. Filtz, Large scale legal text classification using transformer models, arXiv preprint arXiv:2010.12871 (2020). [21] A. Adhikari, A. Ram, R. Tang, J. Lin, Docbert: Bert for document classification, arXiv preprint arXiv:1904.08398 (2019). [22] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue: A multi-task benchmark and analysis platform for natural language understanding, arXiv preprint arXiv:1804.07461 (2018). [23] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4996–5001. URL: https://aclanthology.org/P19-1493. doi:10.18653/v1/P19-1493. [24] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [25] L. De Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim, M. Guerini, Geppetto carves italian into a language model, arXiv preprint arXiv:2004.14253 (2020). [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [27] A. Radford, T. Salimans, Gpt: Improving language understanding by generative pre- training. arxiv (2018). [28] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta, The wacky wide web: a collection of very large linguistically processed web-crawled corpora, Language resources and evaluation 43 (2009) 209–226. [29] E. F. Sang, F. De Meulder, Introduction to the conll-2003 shared task: Language-independent named entity recognition, arXiv preprint cs/0306050 (2003). [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems 26 (2013). [31] M. Polignano, P. Basile, M. De Gemmis, G. Semeraro, V. Basile, Alberto: Italian bert language understanding model for nlp challenging tasks based on tweets, in: 6th Italian Conference on Computational Linguistics, CLiC-it 2019, volume 2481, CEUR, 2019, pp. 1–6. [32] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates, Inc., 2020, pp. 1877–1901. URL: https://proceedings.neurips.cc/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.