1. Introduction

Sixth Workshop on Natural Language for Artificial Intelligence, November

Is EVALITA Done? On the Impact of Prompting on the Italian NLP Evaluation Campaign.

Valerio Basile

0 0 University of Turin , C.so Svizzera 185, 10147 , Italy

2022

30 2022 0000 0001

Prompt-based learning is a recent paradigm in NLP that leverages large pre-trained language models to perform a variety of tasks. With this technique, it is possible to build classifiers that do not need training data (zero-shot). In this paper, we assess the status of prompt-based learning applied to several text classification tasks in the Italian language. The results indicate that the performance gap towards current supervised methods is still relevant. However, the diference in performance between pre-trained models and the characteristic of the prompt-based classifier of operating in a zero-shot fashion open a discussion regarding the next generation of evaluation campaigns for NLP.

eol>Prompt-based learning Text Classification Benchmarking

1. Introduction

own NLP evaluation campaigns, such as GermEval4 for German or IberLEF (previously known as IberEval)5 for Spanish and other Iberian languages.

EVALITA is the “periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language”6. Started in 2007, EVALITA was held seven times in 2007, 2009, 2011, 2014, 2016, 2018, and 2020, and its eighth edition is scheduled for 2023. The retrospective article by Passaro et al.[ 2 ] describes a healthy community, reflected by a growing number of shared tasks proposed at each edition, culminating with the 14 tasks at EVALITA 2020 [ 3 ]. At the same time, more interestingly for this paper, the number of classification tasks has consistently grown over the years. This phenomenon has become apparent in the 2018 edition EVALITA [ 4 ], where a single system was submitted to four diferent tasks (ABSITA [ 5 ] GxG [ 6 ], HaSpeeDe [ 7 ], and IronITA [ 8 ]) and ranked first in most of the individual subtasks [ 9 ]. This system was able to achieve very high results on all the tasks by leveraging multi-task learning. While this advancement was rightly praised, it also spurred the didscussion about the format of the shared tasks organized at EVALITA, i.e., if many tasks follow the same format (text classification), then the evaluation campaign may be shifting its focus towards learning models, with less regard for the underlying language phenomena.

The latest edition of EVALITA in 2020 confirmed this trend, with at least four “pure” text classification tasks (AMI [ 10 ], SARDISTANCE [ 11 ], HaSpeeDe 2 [ 12 ], and TAG-it [ 13 ]) and a few more where classification is partially involved important role (DANKMEMES [ 14 ] and ATE_ABSITA [ 15 ]).

In this paper, we revisit a number of tasks from the past editions of EVALITA in the light of the newest technologies available for NLP. We focus on classification tasks (Section 4), although in principle the experiment could be extended to other forms of inference over textual data. In particular, we consider the recently proposed paradigm of prompt-based learning (Section 2), which makes use of large pre-trained language models (Section 3) to perform classification in a zero-shot fashion. With the right combiniation of parameters, prompt-based zero-shot classifiers often performs surprisingly well, therefore raising important questions about the future of the evaluation in NLP:

R1: Is supervised learning becoming obsolete in NLP, along with the need for training data? If pre-trained language models can provide acceptable predictions without training data, in particular superior to those of classical, pre-neural machine learning models, then perhaps the baseline methods typically associated with shared tasks should be rethought.

R2: Should zero-shot methods become the new baseline for NLP tasks? The rest of this paper presents an experiment where a number of language models are used in combination with prompt-based learning and tested against benchmarks provided by EVALITA, in order to answer these questions.

4https://germeval.github.io/ 5https://sites.google.com/view/iberlef2022 6https://www.evalita.it 2. Methodology

Prompt-based learning [ 16 ] is a recent paradigm which gained enormous traction in the NLP community, applied, among other tasks, to zero-shot classification. In a nutshell, prompt-based classification makes use of large pre-trained language models to map labels to handcrafted or automatically derived natural language expressions. The plausibility of the instance to classify, augmented with the prompt, determines the label without the need for further training or ifne-tuning. Prompting for NLP is an active area of research. Solutions have been proposed for automatically inducing prompts [ 17 ], to improve the learning process, e.g. with calibration [ 18 ], and to adapt the method to few-shot learning [ 19 ].

In this paper, we propose an experiment of classification with prompts and pre-trained models with purposely simplistic characteristics. For each binary classification task, we create exactly two verbalizations, one for each label. The template for the verbalizations is fixed and it belongs to one of two types, namely text classification and author profiling. Furthermmore, the templates provide exactly one slot which is filled with exactly one word. Table 1 illustrates the verbalizations associated to each label in our experiments. The verbalizations are manually crafted, without any efort to optimize them or tuning any parameters.

Label irony hate subjective positive negative misogyny aggressiveness man/woman

Template Positive filler Negative filler ironica normale (EN) ironic (EN) normal ofensiva normale (EN) ofensive (EN) normal soggettiva oggettiva (EN) subjective (EN) objective positiva normale Questa frase è [mask] (EN) positive (EN) normal (EN) This sentence is [mask] negativa normale (EN) negative (EN) normal misogina normale (EN) misogynous (EN) normal aggressiva normale (EN) aggressive (EN) normal (EN) L’autore di questa frase è [mask] uomo donna

The author of this sentence is [mask] (EN) man (EN) woman

The experiment is implemented with OpenPrompt [ 20 ], a Python library that streamlines the process of creating templates and verbalizers, up to the prediction of labels on textual data.7

7https://github.com/thunlp/OpenPrompt 3. Models

The classification power of prompt-based learning is only as good as the pre-trained model that serves as the basis for the classification algorithm. In this section, we briefly describe the three models used in the experiments presented in this paper. The models are based on Bidirectional Encoder Representations from Transformers [21, BERT], a popular and highperforming language model based on transformers [ 22 ].

Two of the models used in this paper are monolingual and have been created specifically to encode the properties of the Italian language. The third model is multilingual, i.e., trained on text from multiple languages

3.1. AlBERTo

The first neural language model that has been proposed for the Italian language is called AlBERTo [ 23 ]. AlBERTo is based on BERT and trained on a collection of 200 million posts from Twitter from the corpus TWITA [ 24 ]. The hyperparameter setting of AlBERTo mimics the first base model for English, with 12 hidden layers, 768-dimensional embeddings, and 12 attention heads, for a total of 110 million parameters. AlBERTo is available from the Huggingface model repository8 with the identifier: m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0

3.2. MDZ Italian BERT

The MDZ Digital Library team at the Bavarian State Library published a set of BERT and ELECTRA [ 25 ] models trained on a Wikipedia dump, the OPUS corpora collection [26], and the Italian part of the OSCAR corpus [27] for a total of about 13 million tokens. The architecture of the network is for the most part the same as AlBERTo: 12 hidden layers, 768-dimensional embeddings, and 12 attention heads. The Italian BERT model used for the experiments in this paper is available from the Huggingface model repository with the identifier: dbmdz/bert-base-italian-xxl-uncased

3.3. Multilingual BERT

The multilingual BERT, in its cased and uncased variants, is one of the first models released together with the BERT architecture itself [ 21 ]. It is trained on text in 102 languages from Wikipedia with a masked language model goal. Although it has been surpassed in performance for many NLP tasks, Multilingual BERT has been widely adopted, also because pre-trained language models for languages other than English are often unavailable or smaller than their English counterparts. The Multilingual BERT model used for the experiments in this paper is available from the Huggingface model repository with the identifier: bert-base-multilingual-uncased

4. Tasks

Six shared tasks have been selected from the past three editions of EVALITA, one from EVALITA 2016 [28], four from EVALITA 2018 [ 4 ], and one from EVALITA 2020 [ 3 ]. All tasks are classification tasks, and more specifically binary classification tasks, i.e., where the label to predict for each textual instance can have one of two possible values. Table 2 summarizes the tasks selected for the experiments presented in this paper and statistics on their size and label distribution.

Task Label IronITA irony HaSpeeDe (TW) hate HaSpeeDe (FB) hate HaSpeeDe 2 hate AMI amgigsoregsysnivyeness

subjective SENTIPOLC pnoegsiattivivee

irony GxG (CH) man/woman GxG (DI) man/woman GxG (JO) man/woman GxG (TW) man/woman GxG (YT) man/woman

For all the shared tasks, we downloaded the test set textual data and labels from the European Language Grid9 (ELG) [29]. The ELG is a recently proposed platform for Language Technology in Europe funded by the Horizon 2020 scheme. The main goal of ELG is to create an open and shared linguistic benchmark for Italian on a large set of representative tasks. The EVALITA4ELG project [30]10 integrated a large number of datasets and other resources, including pre-trained models and systems, from all editions of EVALITA to date into the ELG. It is therefore suficient to register an account on the platform and the data can be accessed programmatically with the oficial ELG Python library.

4.1. IronITA

The EVALITA 2018 Task on Irony Detection in Italian Tweets [8, IronITA] is a shared task focused on the automatic detection of irony in Italian tweets. The shared task is articulated in two subtasks with increasing level of granularity. The first subtask is a binary classification of tweets into ironic vs. non-ironic. The second task adds the level of sarcasm to the classification,

9https://live.european-language-grid.eu/

10https://live.european-language-grid.eu/meta-forum-2022/project-expo/evalita4elg conditioned on the presence of irony in the tweets. For the experiments of this task, we only consider the first subtask.

4.2. HaSpeeDe and HaSpeeDe 2

Hate Speech Detection (HaSpeeDe) is a classification task that was run twice, at EVALITA 2018 [ 7 ] and 2020 [ 12 ], with similar scheme but updating the dataset from one edition to the other. The task focuses on the classification of hateful, aggressive, and ofensive content in social media data from Twitter and Facebook. The first edition of HaSpeeDe features a binary classification task (hate vs. not hate) and a cross-domain subtask. In this paper, we used the test set of the first two subtasks, i.e., binary classification of hate on Twitter (TW) and Facebook (FB). HaSpeeDe 2 proposed a couple of additional subtasks, namely stereotype detection and the identification of nominal utterances linked to hateful content. For the purpose of this paper, we only used the data and labels from the main subtask of HaSpeeDe 2. 4.3. AMI The Automatic Misogyny Identification shared task at EVALITA 2020 [ 10 ] proposes a benchmark for the classification of misogynistic and aggressive content towards women in Italian tweets. The main task is a double binary classification where systems are called to label tweets with two independent labels: misogynous vs not misogynous and aggressive vs. not aggressive. Furthermore, the second subtask of AMI introduces a synthetic dataset to measure the fairness of misogyny classification models. In this paper, we only used the binary classification data from the first subtask of AMI (misogyny and aggressiveness).

4.4. SENTIPOLC

The Sentiment Polarity Classification task (SENTIPOLC) was organized at EVALITA 2014 [ 31] and 2016 [32], with the second edition including the data used for the prevvious one plus a new test set. The task is focused on sentiment analysis on Italian tweets, with three classification tasks: subjectivity, polarity, and irony. The main task, classification of polarity is cast as a double binary classification task, where systems must produce two independent labels for positive and negative sentiment found in the text. In this way, the SENTIPOLC annotation scheme is able to encode poositive and negative sentiment, as well as neutral (both the positive and negative labels are absent) and mixed sentiment (both the positive and negative labels are present). For the experiments in this paper, we use the test sets of all the four binary classification tasks of SENTIPOLC 2016. 4.5. GxG The Cross-Genre Gender Prediction task [6, GxG] was organized at EVALITA 2018. The shared task falls in the area of author profiling, in particularly asking participant systems to predict whether the author of a short text is a man or a woman. The texts come from five diferent sources: Twitter (TW), YouTube (YT), children writings (CH), newspapers (JO, for journalism), and personal diaries (DI). GxG places an emphasis on cross-dataset prediction, where a model is trained on a set of data from one domain (or source, in this case) and predictions are made on data from a diferent one. For this paper, we use the five sets independently, since no training is involved in our experiment. In this binary classification task, there is no natural negative and positive label, therefore we impose the arbitrary mapping man=negative label; woman=positive label.

5. Results

In this section, we present the results of the experiment of prompt-based classification on EVALITA tasks. The results are presented separately for each task, because evaluation metrics may vary from one task to another — accuracy, F1-score of the positive class, and macroaveraged F1-score are used. Moreover, we present the results, in Tables 4–7, along with the baseline(s) and best systems according to the reports of the individual tasks.

Task IronITA task A

System Prompt-based Prompt-based Prompt-based Baseline (most frequent class) Baseline (random) Best system (ItaliaNLP)

Score .419 .469 .573 .334 .505 .731 AMI task A

Prompt-based Prompt-based Prompt-based Baseline (most frequent class) Best system (jigsaw)

The results of this experiment show that prompt-based classification (at least, this simplified version of it) usually beats trivial baselines, but otherwise underperforms with respect to supervised models on benchmarks for the Italian language. This is expected, since the method is fully zero-shot. The results on GxG, the only task related to author profiling, are closer to the best performing systems of the shared task, indicating an expressive power of the language models beyond the standing meaning of the text. Interestingly, the results vary widely between pre-trained language models, with none of the three models being clearly superior to the others across tasks.

6. Discussion and Conclusion

The Betteridge’s law of headlines11 states that “any headline that ends in a question mark can be answered by the word no”. This paper is no exception: the answer to the question Is EVALITA 11https://web.archive.org/web/20090226202006/http://www.technovia.co.uk/2009/02/ techcrunch-irresponsible-journalism.html GxG CH GxG DI GxG JO GxG TW GxG YY

System Prompt-based Prompt-based Prompt-based Best system (ItaliaNLP) Prompt-based Prompt-based Prompt-based Best system (ItaliaNLP) Prompt-based Prompt-based Prompt-based Best system (UniOR) Prompt-based Prompt-based Prompt-based Best system (ItaliaNLP) Prompt-based Prompt-based Prompt-based Best system (ItaliaNLP) done? is certainly no. The prompt-based systems presented in this papers are far from the classification performance of their supervised counterparts on the EVALITA benchmarks. This result is in stark contrast to results reported on English benchmarks12. Moreover, the performance of the two Italian models and the multilingual model tested in this paper are unstable, with some models apparently more fit to certain tasks than others, raising the question whether the subpar performance is due to the method or the underlying language-specific pre-trained models. However, the results of the prompt-based models could be undermined by the lack of optimization of verbalizers and templates. There is certainly space for improvement, which was not the main focus of this paper, including an analysis of the disagreement between verbalizers, and of the actual output of the prompt-based models.

It is worth noting that this new technology allows us to create zero-shot classifiers for rather abstract language classification problems. Recent literature indicates that often few training instances (few-shot learning) are suficient to increase the performance of promptbased classifiers greatly [ 33]. Considering that the experiments in this paper make use only of the most basic elements of prompt-based classification, this paradigm should be regarded as a new frontier, not only for the advancement of text classification methodology, but also for its evaluation. Supervised learning in NLP is perhaps not on its way to obsolescence (R1), but the growing literature on zero-shot classification indicates at least that there is a new player on the field. Would it make sense to organize a shared task as part of an evaluation campaign like EVALITA where training data is not provided at all (R2)? The first results presented in this 12https://github.com/thunlp/OpenPrompt/tree/main/results/ paper seem to indicate that this is the case, paving the way for evaluation campaigns focused on zero-shot learning for NLP. discriminators rather than generators, in: International Conference on Learning Representations, 2020, pp. 1 – 18. URL: https://openreview.net/forum?id=r1xMH1BtvB. [26] J. Tiedemann, L. Nygaard, The OPUS corpus - parallel and free: http://logos.uio.no/opus, in: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), European Language Resources Association (ELRA), Lisbon, Portugal, 2004, pp. 1183–1186. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/320.pdf. [27] J. Abadji, P. J. O. Suárez, L. Romary, B. Sagot, Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus, in: H. Lüngen, M. Kupietz, P. Bański, A. Barbaresi, S. Clematide, I. Pisetta (Eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), Leibniz-Institut für Deutsche Sprache, Mannheim, 2021, pp. 1 – 9. URL: https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688. doi:10.14618/ids-pub-10468. [28] P. Basile, F. Cutugno, M. Nissim, V. Patti, R. Sprugnoli, et al., Evalita 2016: Overview of the 5th evaluation campaign of natural language processing and speech tools for italian, in: 3rd Italian Conference on Computational Linguistics, CLiC-it 2016 and 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, EVALITA 2016, volume 1749, CEUR-WS, 2016, pp. 1–4. [29] G. Rehm, M. Berger, E. Elsholz, S. Hegele, F. Kintzel, K. Marheinecke, S. Piperidis, M. Deligiannis, D. Galanis, K. Gkirtzou, P. Labropoulou, K. Bontcheva, D. Jones, I. Roberts, J. Hajič, J. Hamrlová, L. Kačena, K. Choukri, V. Arranz, A. Vasil,jevs, O. Anvari, A. Lagzdin, š, J. Mel,n, ika, G. Backfried, E. Dikici, M. Janosik, K. Prinz, C. Prinz, S. Stampler, D. ThomasAniola, J. M. Gómez-Pérez, A. Garcia Silva, C. Berrío, U. Germann, S. Renals, O. Klejch, European language grid: An overview, in: Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 3366–3380. URL: https://aclanthology.org/2020.lrec-1.413. [30] V. Basile, C. Bosco, M. Fell, V. Patti, R. Varvara, Italian NLP for everyone: Resources and models from EVALITA to the European language grid, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 174–180. URL: https://aclanthology.org/2022.lrec-1.19. [31] V. Basile, A. Bolioli, V. Patti, P. Rosso, M. Nissim, Overview of the evalita 2014 sentiment polarity classification task, Overview of the Evalita 2014 SENTIment POLarity Classification Task (2014) 50–57. [32] F. Barbieri, V. Basile, D. Croce, M. Nissim, N. Novielli, V. Patti, Overview of the evalita 2016 sentiment polarity classification task, in: P. Basile, A. Corazza, F. Cutugno, S. Montemagni, M. Nissim, V. Patti, G. Semeraro, R. Sprugnoli (Eds.), Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016, volume 1749 of CEUR Workshop Proceedings, CEUR-WS.org, 2016, pp. 1–11. URL: http://ceur-ws.org/Vol-1749/paper_026.pdf. [33] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, Online, 2021, pp. 255–269. URL: https://aclanthology.org/2021.eacl-main.20. doi:10.18653/v1/2021.eacl-main.20.

[1]

Nozza ,

Passaro ,

Polignano , Preface to the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI) , in: D. Nozza , L. C. Passaro , M. Polignano (Eds.), Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022 ) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2022 ), November 30, 2022 , CEUR-WS.org, 2022 .

[2]

L. C.

Passaro ,

M. Di

Maro ,

Basile ,

Croce , Lessons learned from evalita 2020 and thirteen years of evaluation of italian language technology , IJCoL. Italian Journal of Computational Linguistics 6 ( 2020 ) 79 - 102 .

[3]

Basile ,

Croce ,

M. D.

Maro ,

L. C.

Passaro , EVALITA 2020 : Overview of the 7th evaluation campaign of natural language processing and speech tools for italian , in: V. Basile , D.

Croce , M. D.

Maro , L. C. Passaro (Eds.), Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020 ), Online event , December 17th , 2020 , volume 2765 of CEUR Workshop Proceedings, CEUR-WS.org , 2020 , pp. 1 - 7 . URL: http://ceur-ws. org/ Vol- 2765 /overview.pdf.

[4]

Caselli ,

Novielli ,

Patti ,

Rosso , Evalita 2018: Overview on the 6th evaluation campaign of natural language processing and speech tools for italian , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), Turin, Italy, December 12-13 , 2018 , volume 2263 of CEUR Workshop Proceedings, CEUR-WS.org , 2018 , pp. 1 - 6 . URL: http://ceur-ws. org/ Vol- 2263 /paper001.pdf.

[5]

Basile ,

Croce ,

Basile , M. Polignano, Overview of the evalita 2018 aspect-based sentiment analysis task (absita) , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), volume 2263 , CEUR Workshop Proceedings (CEUR-WS. org) , Torino , 2018 , pp. 10 - 16 .

[6]

Dell'Orletta , M. Nissim, Overview of the Evalita 2018 cross-genre gender prediction (GxG) task , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), volume 2263 , CEUR Workshop Proceedings (CEUR-WS. org) , Torino , 2018 , pp. 1 - 9 .

[7]

Bosco ,

Dell'Orletta ,

Poletto ,

Sanguinetti , M. Tesconi, Overview of the EVALITA 2018 hate speech detection task , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), volume 2263 , CEUR Workshop Proceedings (CEUR-WS. org) , Torino , 2018 , pp. 1 - 9 .

[8]

A. T.

Cignarella ,

Frenda ,

Basile ,

Bosco ,

Patti ,

Rosso , Overview of the Evalita 2018 task on Irony Detection in Italian Tweets (IRONITA) , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), volume 2263 , CEUR Workshop Proceedings (CEUR-WS. org) , Torino , 2018 , pp. 1 - 9 .

[9]

Cimino ,

L. D.

Mattei , F. Dell'Orletta, Multi-task learning in deep neural networks at EVALITA 2018 , in: T. Caselli,

Novielli ,

Patti , P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018 ) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), Turin, Italy, December 12-13 , 2018 , volume 2263 of CEUR Workshop Proceedings, CEUR-WS.org , 2018 , pp. 1 - 10 . URL: http://ceur-ws. org/ Vol- 2263 /paper013.pdf.

[10]

Fersini ,

Nozza , P. Rosso, AMI@EVALITA2020: Automatic misogyny identification , in: V. Basile , D.

Croce , M. Di

Maro , L. C. Passaro (Eds.), Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2020 ), CEUR Workshop Proceedings (CEUR-WS. org) , Online , 2020 , pp. 1 - 8 .

[11]

A. T.

Cignarella ,

Lai ,

Bosco ,

Patti , P. Rosso, SardiStance@EVALITA2020: Overview of the Task on Stance Detection in Italian Tweets , in: V. Basile , D.

Croce , M. Di

[12]

Sanguinetti ,

Comandini ,

E. Di

Nuovo ,

Frenda ,

Stranisci ,

Bosco ,

Caselli ,

Patti , I. Russo , HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task , in: V. Basile , D.

Croce , M. Di

[13]

Cimino ,

Dell'Orletta , M. Nissim, TAG-it@EVALITA2020: Overview of the topic, age, and gender prediction task for italian , in: V. Basile , D.

Croce , M. Di

[14]

Miliani , G. Giorgi, I. Rama, G. Anselmi,

G. E.

Lebani , DANKMEMES@EVALITA2020: The memeing of life: memes, multimodality and politics , in: V. Basile , D.

Croce , M. Di

[15] L. De Mattei , G. De Martino, A.

Iovine , A.

Miaschi , M.

Polignano , G. Rambelli, ATE_ABSITA@ EVALITA2020: Overview of the aspect term extraction and aspect-based sentiment analysis task , in: V. Basile , D.

Croce , M. Di

[16]

Liu ,

Yuan ,

Fu ,

Jiang ,

Hayashi , G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , ACM Computing Surveys (CSUR) ( 2022 ).

[17]

Cui ,

Hu ,

Ding ,

Huang ,

Liu , Prototypical verbalizer for prompt-based few-shot tuning , in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 7014 - 7024 . URL: https://aclanthology.org/ 2022 . acl-long . 483 . doi: 10 .18653/v1/ 2022 . acl-long . 483 .

[18]

Zhao ,

Wallace ,

Feng ,

Klein ,

Singh , Calibrate before use: Improving few-shot performance of language models , in: M. Meila , T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning , volume 139 of Proceedings of Machine Learning Research, PMLR , 2021 , pp. 12697 - 12706 . URL: https://proceedings.mlr.press/v139/ zhao21c.html.

[19]

Le Scao ,

Rush , How many data points is a prompt worth?, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 2627 - 2636 . URL: https://aclanthology.org/ 2021 .naacl-main. 208 . doi: 10 . 18653/v1/ 2021 .naacl-main. 208 .

[20]

Ding ,

Hu ,

Zhao ,

Chen ,

Liu ,

Zheng , M. Sun, OpenPrompt: An opensource framework for prompt-learning, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 105 - 113 . URL: https://aclanthology. org/ 2022 .acl-demo. 10 . doi: 10 .18653/v1/ 2022 .acl-demo. 10 .

[21]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), Association for Computational Linguistics , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ N19-1423. doi: 10 .18653/v1/ N19 -1423.

[22]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: NIPS' 17 , Curran Associates Inc., Red

Hook

, NY , USA, 2017 , p. 6000 - 6010 .

[23]

Polignano ,

Basile , M. de Gemmis, G. Semeraro,

Basile , Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets , in: R. Bernardi,

Navigli , G. Semeraro (Eds.), Proceedings of the Sixth Italian Conference on Computational Linguistics , Bari, Italy, November 13-15 , 2019 , volume 2481 of CEUR Workshop Proceedings, CEUR-WS.org , 2019 , pp. 1 - 6 . URL: http://ceur-ws. org/ Vol- 2481 /paper57.pdf.

[24]

Basile ,

Lai , M. Sanguinetti, Long-term social media data collection at the university of turin , in: E. Cabrio , A.

Mazzei , F.

Tamburini (Eds.), Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018 ), Torino, Italy, December 10-12 , 2018 , volume 2253 of CEUR Workshop Proceedings, CEUR-WS.org , 2018 , pp. 1 - 6 . URL: http://ceur-ws. org/ Vol- 2253 /paper48.pdf.

[25]

Clark , M.-

Luong ,

Q. V.

Le ,

C. D.

Manning , Electra: Pre-training text encoders as