Evaluating Text-To-Text Framework for Topic and Style Classification of Italian texts Michele Papucci1,2 , Chiara De Nigris1 , Alessio Miaschi3 and Felice Dell’Orletta2,3 1 Università di Pisa, Pisa 2 TALIA S.r.l., Pisa 3 Istituto di Linguistica Computazionale "A. Zampolli" (ILC-CNR), ItaliaNLP Lab, www.italianlp.it, Pisa Abstract In this paper, we propose an extensive evaluation of the first text-to-text Italian Neural Language Model (NLM), IT5 [1], on a classification scenario. In particular, we test the performance of IT5 on several tasks involving both the classification of the topic and the style of a set of Italian posts. We assess the model in two different configurations, single- and multi-task classification, and we compare it with a more traditional NLM based on the Transformer architecture (i.e. BERT). Moreover, we test its performance in a few-shot learning scenario. We also perform a qualitative investigation on the impact of label representations in modeling the classification of the IT5 model. Results show that IT5 could achieve good results, although generally lower than the BERT model. Nevertheless, we observe a significant performance improvement of the Text-to-text model in a multi-task classification scenario. Finally, we found that altering the representation of the labels mainly impacts the classification of the topic. Keywords transformers, text-to-text, t5, bert, topic classification, style classification 1. Introduction and Motivation Over the past few years, the text-to-text paradigm has become one of the most widely adopted approach in the development of state-of-the-art Neural Language Models (NLMs) [3, 4, 5]. The basic idea of this paradigm, inspired by previous unifying frameworks for NLP tasks [6, 7, 8], is to consider each task as a text-to-text task, i.e. getting text as input data and producing new text as output. This unifying framework has proven to be a particularly effective transfer learning method, often outperforming previous models, e.g. BERT [9], in data-poor settings. Nevertheless, few works proposed systematic evaluations of such models in different classification scenarios and in comparison with more traditional NLMs. Among these, [3] showed that T5 achieves comparable, if not better performance, with previous state-of-the-art models on the most popular NLP benchmarks, e.g. GLUE [10] and SQuAD [11]. [12], instead, demonstrated that T5 NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [2] $ m.papucci@studenti.unipi.it (M. Papucci); c.denigris@studenti.unipi.it (C. Nigris); alessio.miaschi@ilc.cnr.it (A. Miaschi); felice.dellorletta@ilc.cnr.it (F. Dell’Orletta) € https://github.com/michelepapucci (M. Papucci); https://alemiaschi.github.io/ (A. Miaschi); http://www.italianlp.it/people/felice-dellorletta/ (F. Dell’Orletta)  0000-0003-4251-7254 (M. Papucci); 0000-0002-0736-5411 (A. Miaschi); 0000-0003-3454-9387 (F. Dell’Orletta) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Attribute Description Values Age Age of the writer Five ranges: 0-19, 20-29, 30-39, 40-49 and 50-100 Gender Gender of the writer M, F Topic Topic of the post Eleven possible categories: ANIME, AUTO-MOTO, BIKES, CELEBRITIES, ENTERTAINMENT, NATURE, MEDICINE-AESTHETIC, METAL-DETECTING, SMOKE, SPORTS, TECHNOLOGY Table 1 TAG-it labels description. outperforms BERT in a document ranking task, especially in a data-poor setting with limited training data. Inspecting the performance of 6 different NLMs on a sentiment analysis task, [13] found that T5 is the second best performing model, next only to XLNet [14]. Whereas, focusing on languages other than English, [1] compared the performance of their IT5 with other multilingual and Italian models, showing e.g. that IT5 base outperforms BERT on SQuAD-IT [15], the extractive question answering task for the Italian language. Similar results have been obtained by [16] while measuring the performance of their Brazilian Portuguese T5 model (PTT5) against the ones obtained with BERTimbau, a BERT model pre-trained on the brWaC corpus [17]. Comparing the models on three different evaluation tasks for the Portuguese language (i.e. semantic similarity and entailment prediction [18] and NER [19]), they showed that PTT5 achieves competitive performance with BERTimbau, although the latter obtained slightly better results. Building on these previous studies, in this work we propose an evaluation of the first text-to- text Transformer model developed for the Italian language, IT5 [1], on several classification tasks. More specifically, we performed our experiments on two different classification scenarios, single-task and multi-task, and we compared the performance of IT5 against those obtained with an Italian version of BERT. Furthermore, in order to verify the ability of the model in a data-poor setting, we also tested its performance in a few-shot learning scenario. Finally, following the approach devised by [20], we performed a more in-depth analysis to test the impact of label representations in modeling the classification of the IT5 model. The remainder of the paper is organized as follows: in Sec. 2 and 3 we introduce the dataset and the models used in our experiments, in Sec. 4 we describe the experimental setting, in Sec. 5 and 6 we discuss the obtained results and in Sec. 7 we conclude the paper. Contributions: In this paper we: i) proposed an extensive evaluation of IT5 performance on three different classification tasks based on Italian sentences; ii) we tested the performance of the model in different scenarios (single- and multi-task classification) and we compared them with those obtained with another Transformer especially suited for classification tasks; iii) we studied the behavior of the model in a data-poor setting by measuring its performance in few-shot learning scenario; iv) we verified the impact of label modification on IT5’ performance. Figure 1: Age distribution. Figure 2: Gender distribution. 2. Data In order to perform our experiments, we relied on posts extracted from TAG-IT [21], the profiling shared task presented at EVALITA 2020 [22]. The dataset, based on the corpus defined in [23], consists of more than 10,000 posts written in Italian and collected from different blogs. Each post is labeled with three different labels: age and gender of the writer and topic. The details and the statistics about the dataset are reported in Table 1 and Figures 1, 2 and 3. As it can be noticed from the Figures, the Age variable presents a quite balanced distribution Figure 3: Topic distribution. among the five classes, especially for the three intervals between 30 and 100. For what concerns the Gender attribute, we can observe that the majority of posts were written by male users, thus determining a strongly unbalanced distribution of the two classes. The last variable, Topic, presents 11 labels, with 3 of them (ANIME, SPORTS and AUTO-MOTO) having more than 2,500 posts each. In order to have enough data to fine-tune our pre-trained models, we decided to modify the original task as defined in [21]. Instead of predicting the three labels of a given collection of texts (multiple posts), we fine-tuned our models to predict age, gender and topic from each single post. Moreover, since a fair amount of sentences were quite short, we decided to remove those shorter than 10 tokens. At the end of this process, we obtained a dataset consisting of 13553 posts as training set and 5055 posts as test set. 3. Models In what follows, we discuss more in detail the characteristics of the models used in our experi- ments. IT5 We used the T5 base version pre-trained on the Italian language [1]1 . In particular, the model was trained on the Italian sentences extracted from a cleaned version of the mC4 corpus [24], a multilingual version of the C4 corpus including 107 languages. As discussed in [3], in order to compare different architectures (e.g. T5 and BERT), it would be ideal to analyze models with meaningful similarities, e.g. having a similar number of parameters or amount of computation to process an input-output sequence. Since T5 with n layers has approximately the same number of parameters as a BERT with 2𝑛 layers but also the same amount of computational cost of an 𝑛-layers BERT, in order to achieve the fairest comparison of the two Transformers, we decided to use the base version of IT5 (220M parameters). BERT In order to compare the performance of IT5 with that of another Transformer model generically used in classification scenarios, we relied on a pre-trained Italian BERT. Specifically, we used the base cased BERT (12 layers, 110M parameters) developed by the MDZ Digital Library Team, available trough the Huggingface’s Transformers library [25]2 . The model was trained using Wikipedia and the OPUS corpus [26]. 4. Experimental Setting As we already introduced in Sec. 1, we performed our experiments on two different classification scenarios: i) single-task and ii) multi-task classification. For what concerns the single-task scenario, we both fine-tuned BERT and IT5 three times in order to create three different single- task sequence classification models, one for each variable. To perform fine-tuning with the BERT model, we converted the three target variables into numeric labels. On the other hand, the target variables were verbalized empirically as follows for the IT5 model: • Gender: values have been transformed in uomo and donna; • Topic: values have been translated in Italian, written in lowercase and truncated into a single word (e.g. MEDICINE-AESTHETIC into medicina), thus resulting in the following list: anime, automobilismo, bici, sport, natura, metalli, medicina, celebrità, fumo, intrattenimento, tecnologia; • Age: values have been left unchanged. Moreover, following the Fixed-prompt LM tuning approach (see [27] for an overview), we added a prefix to each input when fine-tuning the IT5 model. This approach implies providing a textual template that is then applied to every training and test example. Fixed-prompt LM tuning has been already successfully explored for text classification, allowing more efficient learning [28, 29, 30]. In our experiments, we tested three different prefixes, one for each classification task: "Classifica argomento", "Classifica età" and "Classifica genere". Concerning instead the multi-task classification, each sentence has been presented three times during the training phase of the two models, each one with the appropriate label and, in the case of IT5, with the appropriate prefix. 1 https://huggingface.co/gsarti/it5-base 2 https://huggingface.co/dbmdz/bert-base-italian-xxl-cased Model Topic Age Gender Macro Weighted Macro Weighted Macro Weighted Dummy (S) 0.09 0.17 0.20 0.22 0.50 0.68 Dummy (MF) 0.04 0.10 0.09 0.14 0.44 0.69 BERT Random 0.14 0.34 0.26 0.27 0.56 0.74 IT5 Random 0.14 0.34 0.20 0.26 0.36 0.74 BERT 0.50 0.64 0.32 0.33 0.76 0.84 IT5 0.19 0.41 0.16 0.22 0.31 0.70 Multi-task MT BERT 0.56 0.67 0.32 0.33 0.75 0.84 MT IT5 0.31 0.52 0.16 0.23 0.33 0.71 Table 2 Macro and Weighted average F-Score for all the models and according to the tree classification variables. Results obtained with the multi-task models are also reported (MT BERT/IT5). Few-Shot Learning In order to evaluate the performance of IT5 also in a context with little data available, we decided to carry out our classification experiments in a few-shot learning scenario. Specifically, we divided the original dataset into 5 equal subsets (1/5 = 2,710, 2/5 = 5,420, 3/5 = 8,130, 4/5 = 10,840, 5/5 = 13,554) and then we monitored the performance trend of both IT5 and BERT at increasing intervals of data samples: 0/5, 1/5, 2/5, 3/5, 4/5 and 5/5 of the TAG-IT dataset. 4.1. Baseline and Evaluation We relied on two different typology of models as baseline. The first one is based on two dummy classifiers: i) most frequent classifier (Dummy (MF)), which always predict the most frequent label for each input sequence and ii) stratified dummy classifier (Dummy(S)), that generates predictions by respecting the class distribution of the training data. Moreover, in order to assess the impact of the pre-training phase of the two Transformer models, we also used a BERT Italian (BERT Random) and an IT5 model (IT5 Random) with randomly initialized weights. We used F-Score (macro and weighted) as evaluation metric for all the experiments. 5. Results Single-task Classification results are reported in Table 2. As we can observe, Transformer models outperformed the dummy baselines in almost all the classification tasks. The only exception concerns the performance of IT5 on the Age prediction task, for which the stratified dummy classifier obtained the same scores. It should be considered that the Age classification task appears to be the most complex task, regardless of the model taken into account. In fact, the best performing model (BERT) obtained only .11 points more than the baseline. The complexity in predicting the age ranges could be due to the fact that the task requires more sophisticated information rather than the simple identification of textual clues. On the other hand, the classifiers that achieved best results are those trained to predict the gender and the topic of each post. This result is in line with [21], where the authors suggested that textual clues seem to be more indicative of these dimensions than age. Moreover, the higher scores obtained for the gender classification task could also be indicative of the fact that, differently from the other two, gender prediction was cast as a binary task. When we look at the performance obtained by the randomly initialized BERT and IT5, we note that the latter achieved results close to those of the pre-trained models. Indeed in some cases, e.g. IT5 on the Age and Gender prediction tasks, the Random model gets better results. This seems to suggest that the pre-training phase of IT5 did not allow the model to encode enough useful information in order to improve its performance on the selected tasks. On the other hand, the pre-training phase had a strong impact on BERT performance, since the pre-trained model outperformed the Random one in all classification tasks. If we focus instead on the differences between the two models, we can clearly notice that BERT performed best in all three configurations. In particular, IT5 achieved fairly reasonable results in comparison with BERT for simpler tasks, such as Gender and Topic classification. For what concerns the Age prediction task instead, we observed a performance drop, with a difference in terms of weighted F-Score of .17 points. A possible explanation for this behavior could be due to the fact that, differently from BERT, T5 has to produce the label by generating open text, thus making the prediction more complex from a computational point of view. In this regard, it is important to notice that for our experiments we relied on the base version of IT5, which, despite being bigger in terms of parameters than BERT base, is still quite smaller than the best-performing model (T5-11B) presented in [3]. Moreover, it should be pointed out that in some cases IT5 generated labels that did not belong to those defined in Sec. 4, but which actually turned out to be more accurate than the original ones. This is the case, for instance, of a few posts labelled with fumo (en. smoke) that were predicted instead by IT5 with the label tabacco (en. tobacco). We will inspect more in detail this behavior in Sec. 6. We also found that sometimes IT5 was not able to generate meaningful labels, but rather produced only punctuation marks or single letters. Nevertheless, we only identified a few isolated cases of them (less than 5 for what concerns Topic classification), which had no real impact on the overall performance of the model. We would like to also point out that the IT5 Random model does not generate unexpected labels like the pre-trained one does. This could be another motivation for its better performance in the two cases of Age and Gender classification. Multi-task Observing the results obtained in the multi-task setting, we notice a significant increase in the performance of IT5. In fact, while BERT achieved a consistent boost only in the Topic prediction scenario, T5 performances improve significantly in all classification tasks, with an average improvement of around .06 points more (in terms of weighted F-Score) than during single-task classification. This is particularly evident with regard to Topic and Age classification, while the scores obtained for the Gender prediction task remain roughly the same. This result could suggest that, besides having more data for the fine-tuning phase, the IT5 model particularly benefits from learning multiple tasks at a time, thus improving its generalization abilities. Few-Shot Learning Figures 4, 5 and 6 report the results obtained with the few-shot learning classification scenario. As we can see, the trend is quite different between the two models. In Figure 4: Few-Shot Learning results for Topic classification. Figure 5: Few-Shot Learning results for Age classification. fact, while BERT performance shows a fairly regular increase across the 5 fractions of the dataset, in the IT5 model we observe a quite constant improvement only for the Age prediction task. Interestingly, for what concerns Topic and Gender classification, IT5 makes correct predictions only after being exposed to 4/5 of the entire dataset. This behavior appears to be in line with what we already noticed during multi-task classification, namely that having more data available for the fine-tuning phase allows the model to perform better, and consequently, to obtain results closer to those of BERT. This seems to be further suggested by the fact that, unlike IT5, BERT obtains strong performance already from the early portions of the datasets but then it tends to Figure 6: Few-Shot Learning results for Gender classification. Sentence Predicted Label Correct Label Che bell’acqua e che bei vitellini! Grande Pres.! animali celebrità Perchè non l’alcool alimentare essendo neutro? alcol fumo E costa pure meno terza miscela svizzera champagne eccellente! bevande fumo non vedo l’ora di tornare da two lions per altre miscele Table 3 Examples of IT5 predictions. remain quite stable, showing an improvement of only a few points in the remaining portions. This is especially the case of the Gender prediction task, where the accuracy of the BERT model in predicting the correct labels is roughly the same (.84 in terms of weighted F-Score) even after seeing 2/5 of the original dataset. Nevertheless, in the case of zero-shot learning, both models are unable to correctly classify the posts occurring in the test set of the three datasets. 6. Label Analysis As described in [3], one of the issues of the Text-to-text framework applied in a classification scenario is that the model could outputs text on a task that does not correspond to any of the possible labels. However, as we already observed in the previous section, in some cases it seems that IT5 was able to generate more appropriate labels that those originally defined for the task, thus suggesting generalization abilities. For instance, as we can observe from the examples in Table 3, the labels predicted for the three input posts are not among those expected for the Topic prediction task. Nevertheless, by looking at the posts, the labels predicted by T5 might be Model Topic Age Gender Macro Weighted Macro Weighted Macro Weighted IT5 0.19 0.41 0.16 0.22 0.31 0.70 IT5 shuffled 0.07 0.17 0.11 0.17 0.29 0.69 Table 4 Macro and Weighted F-Scores for the classification tasks obtained with IT5 using correct and shuffled labels (IT5 shuffled). Labels Macro Weighted m/f 0.32 0.70 uomo/donna 0.31 0.70 Table 5 Macro and Weighted F-Score on the Gender prediction task using m/f and uomo/donna as target variables. considered more appropriate predictions. Inspired by such behavior, we decided to further investigate the generalization abilities of the IT5 model by measuring the impact of different labels on model performance. More specifically, we decided to produce a shuffled version of each dataset by randomly replacing the labels with each other. Results are reported in Table 4. As we can see, the most significant variations in model performance concern the Topic and Age classification tasks. In particular, we can observe a drastic performance drop for what concerns Topic, with a difference between the predictions on correct and shuffled labels of more than .24 points in terms of Weighted F-Score. Moreover, it is interesting to note that the scores obtained with the shuffled labels are also lower than those obtained by the randomly initialized IT5 (0.17 vs. 0.34). This result seems to suggest that the IT5 model is indeed able to learn some specific lexical correlations between the encoding of the input tokens and of the labels during the fine-tuning phase and that these correlations are no longer observable after the shuffling process. This is also corroborated by the fact that, when presented with shuffled data, the model stopped generating new and more specific labels for the input sequences. If we look instead at the results obtained with the Gender dataset, we can notice that shuffling the labels does not have a significant effect on the performance of the model. This is a clear evidence that, unlike Topic, the Gender prediction task does not present a direct lexical connection between the input sequence and the label. As a result, the model tends to memorize the information available in the fine-tuning data rather than derive generalities exploiting the knowledge learned during the pre-training phase. Finally, inspired by the work of [20], we conducted further analysis on the effect of the strings used to represent labels on model performance. In particular, we decided to replace the labels used for the Gender prediction task (i.e. uomo and donna) with the original tags defined in the TAG-IT dataset, i.e. m and f. As shown in Table 5, modifying the label representation did not affect the performance of IT5, which obtained basically the same results in both configurations. This seems to confirm once again that for tasks that do not show an explicit relationship between input samples and labels, the choice of the label largely does not affect model performance. 7. Conclusions In this paper, we proposed an extensive evaluation of the first Italian text-to-text model, IT5, on different classification tasks based on Italian sentences. Specifically, we chose to exploit the TAG-it dataset in order to measure the performance of the model in different classification scenarios. First, we evaluated IT5 in a high data setting, assessing its performance during single- and multi-task classification and comparing them with the ones obtained by fine-tuning an Italian version of BERT. Results showed that IT5 is able to achieve quite good results, especially in Topic and Gender classification, and that its performance increases significantly when fine-tuned in a multi-task manner. Nevertheless, we found that BERT outperformed IT5 in all classification tasks. Next, we tested the model in a poor data setting by measuring its performance in a few-shot learning scenario. Once again, IT5 achieved lower scores with respect to BERT, which obtained satisfactory results even in a context with very few data available (e.g. 1/5 of the entire dataset). A possible explanation of these results could be that given the high complexity of predicting the correct label by generating open text, it may be necessary to employ bigger text-to-text models to outperform models that are explicitly designed for solving classification tasks. Regardless of the classification scenario, we noticed that, especially for the Topic prediction task, IT5 occasionally generated labels that were not among those defined in the TAG-it dataset and that such labels often proved to be more indicative of the topic than the original ones. This result suggested that the model is indeed able to identify lexical clues indicative of the topic although in some cases it does not associate them with the labels that were originally defined for the task. Finally, we investigated the impact of modifying the classification labels on IT5 performance. In particular, by shuffling at random the values of the original labels, we found that the model achieved generally lower scores and this is especially true for the classification of the topic. Nevertheless, experimenting with the Gender prediction task, we found that the choice of label representation does not affect significantly the model performance. References [1] G. Sarti, M. Nissim, It5: Large-scale text-to-text pretraining for italian language under- standing and generation, ArXiv preprint 2203.03759 (2022). URL: https://arxiv.org/abs/ 2203.03759. [2] D. Nozza, L. Passaro, M. Polignano, Preface to the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Pro- ceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022. [3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer., J. Mach. Learn. Res. 21 (2020) 1–67. [4] V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. Le Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, in: The Tenth International Conference on Learning Representations, 2022. [5] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, S. V. Mehta, H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, et al., Ext5: Towards extreme multi-task scaling for transfer learning, in: International Conference on Learning Representations, 2021. [6] B. McCann, N. S. Keskar, C. Xiong, R. Socher, The natural language decathlon: Multitask learning as question answering, arXiv preprint arXiv:1806.08730 (2018). [7] N. S. Keskar, B. McCann, C. Xiong, R. Socher, Unifying question answering, text classifica- tion, and regression via span extraction, arXiv preprint arXiv:1904.09286 (2019). [8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners (????). [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [10] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, Glue: A multi-task benchmark and analysis platform for natural language understanding, in: 7th International Conference on Learning Representations, ICLR 2019, 2019. [11] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine com- prehension of text, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 2383–2392. URL: https://aclanthology.org/D16-1264. doi:10.18653/v1/D16-1264. [12] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to- sequence model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 708–718. URL: https: //aclanthology.org/2020.findings-emnlp.63. doi:10.18653/v1/2020.findings-emnlp. 63. [13] K. Pipalia, R. Bhadja, M. Shukla, Comparative analysis of different transformer based architectures used in sentiment analysis, in: 2020 9th International Conference System Modeling and Advancement in Research Trends (SMART), IEEE, 2020, pp. 411–415. [14] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, Advances in neural information processing systems 32 (2019). [15] D. Croce, A. Zelenanska, R. Basili, Neural learning for question answering in italian, in: International Conference of the Italian Association for Artificial Intelligence, Springer, 2018, pp. 389–402. [16] D. Carmo, M. Piau, I. Campiotti, R. Nogueira, R. Lotufo, Ptt5: Pretraining and validating the t5 model on brazilian portuguese data, arXiv preprint arXiv:2008.09144 (2020). [17] J. A. Wagner Filho, R. Wilkens, M. Idiart, A. Villavicencio, The brWaC corpus: A new open resource for Brazilian Portuguese, in: Proceedings of the Eleventh International Confer- ence on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan, 2018. URL: https://aclanthology.org/L18-1686. [18] L. Real, E. Fonseca, H. Gonçalo Oliveira, The assin 2 shared task: a quick overview, in: International Conference on Computational Processing of the Portuguese Language, Springer, 2020, pp. 406–412. [19] D. Santos, N. Seco, N. Cardoso, R. Vilela, Harem: An advanced ner evaluation contest for portuguese, in: quot; In Nicoletta Calzolari; Khalid Choukri; Aldo Gangemi; Bente Mae- gaard; Joseph Mariani; Jan Odjik; Daniel Tapias (ed) Proceedings of the 5 th International Conference on Language Resources and Evaluation (LREC’2006)(Genoa Italy 22-28 May 2006), 2006. [20] X. Chen, J. Xu, A. Wang, Label representations in modeling classification as text generation, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, 2020, pp. 160–164. [21] Cimino, Dell’Orletta, Nissim, Tag-it – topic, age and gender prediction, EVALITA (2020). [22] V. Basile, M. Di Maro, D. Croce, L. Passaro, Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian, in: 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2020, volume 2765, CEUR-ws, 2020. [23] A. Maslennikova, P. Labruna, A. Cimino, F. Dell’Orletta, Quanti anni hai? age identification for italian., in: CLiC-it, 2019. [24] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mT5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 483–498. URL: https://aclanthology.org/2021.naacl-main.41. doi:10.18653/v1/2021.naacl-main.41. [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. doi:10.18653/v1/2020.emnlp-demos.6. [26] J. Tiedemann, L. Nygaard, The OPUS corpus - parallel and free: http://logos.uio.no/opus, in: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), European Language Resources Association (ELRA), Lisbon, Portugal, 2004. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/320.pdf. [27] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, arXiv preprint arXiv:2107.13586 (2021). [28] T. Schick, H. Schütze, Few-shot text generation with pattern-exploiting training, arXiv preprint arXiv:2012.11926 (2020). [29] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, 2021, pp. 255–269. URL: https://aclanthology.org/2021.eacl-main.20. doi:10.18653/v1/2021.eacl-main.20. [30] T. Gao, A. Fisch, D. Chen, Making pre-trained language models better few-shot learners, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Online, 2021, pp. 3816–3830. URL: https://aclanthology.org/2021.acl-long.295. doi:10.18653/v1/2021.acl-long.295.