=Paper= {{Paper |id=Vol-3287/paper13 |storemode=property |title=Is EVALITA Done? On the Impact of Prompting on the Italian NLP Evaluation Campaign. |pdfUrl=https://ceur-ws.org/Vol-3287/paper13.pdf |volume=Vol-3287 |authors=Valerio Basile |dblpUrl=https://dblp.org/rec/conf/aiia/Basile22 }} ==Is EVALITA Done? On the Impact of Prompting on the Italian NLP Evaluation Campaign.== https://ceur-ws.org/Vol-3287/paper13.pdf
Is EVALITA Done?
On the Impact of Prompting on the Italian NLP
Evaluation Campaign.
Valerio Basile1
1
    University of Turin, C.so Svizzera 185, 10147, Italy


                                         Abstract
                                         Prompt-based learning is a recent paradigm in NLP that leverages large pre-trained language models
                                         to perform a variety of tasks. With this technique, it is possible to build classifiers that do not need
                                         training data (zero-shot). In this paper, we assess the status of prompt-based learning applied to several
                                         text classification tasks in the Italian language. The results indicate that the performance gap towards
                                         current supervised methods is still relevant. However, the difference in performance between pre-trained
                                         models and the characteristic of the prompt-based classifier of operating in a zero-shot fashion open a
                                         discussion regarding the next generation of evaluation campaigns for NLP.

                                         Keywords
                                         Prompt-based learning, Text Classification, Benchmarking




1. Introduction
Shared tasks and evaluation campaigns are a pillar of the research in Natural Language Pro-
cessing. The constant effort by the community to organize, maintain, and update shared tasks
allows researchers to test their models and algorithms in systematic ways, compare the perfor-
mance fairly, and apply them to new languages and domains. An important byproduct of the
organization of a shared task is typically novel data, which gets distributed across the research
community.
   Perhaps the best known, long-running evaluation campaign in the field of Natural Language
Processing is SemEval1 . Originating in 1998, this initiative was at first called SensEval and
focused on semantic-related tasks. Over the years, the campaign evolved to include a large
variety of shared tasks in NLP. Some evaluation campaigns are focused on specific tasks or
research areas, such as PAN2 for digital text forensics and stylometry. Alternatively, shared
tasks are sometimes organized in a standalone fashion, or linked to an event such a workshop
like Threat, Aggression and Cyberbullying (TRAC)3 . Finally, several research communities
gravitating around a specific geographic area or interested in a specific language organize their

NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [1]
$ valerio.basile@unito.it (V. Basile)
 0000-0001-8110-6832 (V. Basile)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                    https://semeval.github.io/
                  2
                    http://pan.webis.de/
                  3
                    https://sites.google.com/view/trac2022/
own NLP evaluation campaigns, such as GermEval4 for German or IberLEF (previously known
as IberEval)5 for Spanish and other Iberian languages.
   EVALITA is the “periodic evaluation campaign of Natural Language Processing (NLP) and
speech tools for the Italian language”6 . Started in 2007, EVALITA was held seven times in
2007, 2009, 2011, 2014, 2016, 2018, and 2020, and its eighth edition is scheduled for 2023. The
retrospective article by Passaro et al.[2] describes a healthy community, reflected by a growing
number of shared tasks proposed at each edition, culminating with the 14 tasks at EVALITA
2020 [3]. At the same time, more interestingly for this paper, the number of classification tasks
has consistently grown over the years. This phenomenon has become apparent in the 2018
edition EVALITA [4], where a single system was submitted to four different tasks (ABSITA [5]
GxG [6], HaSpeeDe [7], and IronITA [8]) and ranked first in most of the individual subtasks [9].
This system was able to achieve very high results on all the tasks by leveraging multi-task
learning. While this advancement was rightly praised, it also spurred the didscussion about the
format of the shared tasks organized at EVALITA, i.e., if many tasks follow the same format
(text classification), then the evaluation campaign may be shifting its focus towards learning
models, with less regard for the underlying language phenomena.
   The latest edition of EVALITA in 2020 confirmed this trend, with at least four “pure” text
classification tasks (AMI [10], SARDISTANCE [11], HaSpeeDe 2 [12], and TAG-it [13]) and
a few more where classification is partially involved important role (DANKMEMES [14] and
ATE_ABSITA [15]).
   In this paper, we revisit a number of tasks from the past editions of EVALITA in the light of
the newest technologies available for NLP. We focus on classification tasks (Section 4), although
in principle the experiment could be extended to other forms of inference over textual data. In
particular, we consider the recently proposed paradigm of prompt-based learning (Section 2),
which makes use of large pre-trained language models (Section 3) to perform classification
in a zero-shot fashion. With the right combiniation of parameters, prompt-based zero-shot
classifiers often performs surprisingly well, therefore raising important questions about the
future of the evaluation in NLP:

       R1: Is supervised learning becoming obsolete in NLP, along with the need for
       training data?

If pre-trained language models can provide acceptable predictions without training data, in
particular superior to those of classical, pre-neural machine learning models, then perhaps the
baseline methods typically associated with shared tasks should be rethought.

       R2: Should zero-shot methods become the new baseline for NLP tasks?

The rest of this paper presents an experiment where a number of language models are used in
combination with prompt-based learning and tested against benchmarks provided by EVALITA,
in order to answer these questions.

   4
     https://germeval.github.io/
   5
     https://sites.google.com/view/iberlef2022
   6
     https://www.evalita.it
2. Methodology
Prompt-based learning [16] is a recent paradigm which gained enormous traction in the NLP
community, applied, among other tasks, to zero-shot classification. In a nutshell, prompt-based
classification makes use of large pre-trained language models to map labels to handcrafted or
automatically derived natural language expressions. The plausibility of the instance to classify,
augmented with the prompt, determines the label without the need for further training or
fine-tuning. Prompting for NLP is an active area of research. Solutions have been proposed for
automatically inducing prompts [17], to improve the learning process, e.g. with calibration [18],
and to adapt the method to few-shot learning [19].
   In this paper, we propose an experiment of classification with prompts and pre-trained
models with purposely simplistic characteristics. For each binary classification task, we create
exactly two verbalizations, one for each label. The template for the verbalizations is fixed and
it belongs to one of two types, namely text classification and author profiling. Furthermmore,
the templates provide exactly one slot which is filled with exactly one word. Table 1 illustrates
the verbalizations associated to each label in our experiments. The verbalizations are manually
crafted, without any effort to optimize them or tuning any parameters.

    Label                Template                               Positive filler   Negative filler
                                                                ironica           normale
    irony
                                                                (EN) ironic       (EN) normal
                                                                offensiva         normale
    hate
                                                                (EN) offensive    (EN) normal
                                                                soggettiva        oggettiva
    subjective
                                                                (EN) subjective   (EN) objective
                                                                positiva          normale
    positive
                         Questa frase è [mask]                  (EN) positive     (EN) normal
                         (EN) This sentence is [mask]           negativa          normale
    negative
                                                                (EN) negative     (EN) normal
                                                                misogina          normale
    misogyny
                                                                (EN) misogynous   (EN) normal
                                                                aggressiva        normale
    aggressiveness
                                                                (EN) aggressive   (EN) normal
                         (EN) L’autore di questa frase è [mask] uomo              donna
    man/woman
                         The author of this sentence is [mask] (EN) man           (EN) woman
Table 1
Verbalizations associated with binary labels.

  The experiment is implemented with OpenPrompt [20], a Python library that streamlines the
process of creating templates and verbalizers, up to the prediction of labels on textual data.7




    7
        https://github.com/thunlp/OpenPrompt
3. Models
The classification power of prompt-based learning is only as good as the pre-trained model
that serves as the basis for the classification algorithm. In this section, we briefly describe
the three models used in the experiments presented in this paper. The models are based on
Bidirectional Encoder Representations from Transformers [21, BERT], a popular and high-
performing language model based on transformers [22].
  Two of the models used in this paper are monolingual and have been created specifically to
encode the properties of the Italian language. The third model is multilingual, i.e., trained on
text from multiple languages

3.1. AlBERTo
The first neural language model that has been proposed for the Italian language is called
AlBERTo [23]. AlBERTo is based on BERT and trained on a collection of 200 million posts from
Twitter from the corpus TWITA [24]. The hyperparameter setting of AlBERTo mimics the first
base model for English, with 12 hidden layers, 768-dimensional embeddings, and 12 attention
heads, for a total of 110 million parameters. AlBERTo is available from the Huggingface model
repository8 with the identifier:
m-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alb3rt0


3.2. MDZ Italian BERT
The MDZ Digital Library team at the Bavarian State Library published a set of BERT and
ELECTRA [25] models trained on a Wikipedia dump, the OPUS corpora collection [26], and the
Italian part of the OSCAR corpus [27] for a total of about 13 million tokens. The architecture
of the network is for the most part the same as AlBERTo: 12 hidden layers, 768-dimensional
embeddings, and 12 attention heads. The Italian BERT model used for the experiments in this
paper is available from the Huggingface model repository with the identifier:
dbmdz/bert-base-italian-xxl-uncased


3.3. Multilingual BERT
The multilingual BERT, in its cased and uncased variants, is one of the first models released
together with the BERT architecture itself [21]. It is trained on text in 102 languages from
Wikipedia with a masked language model goal. Although it has been surpassed in performance
for many NLP tasks, Multilingual BERT has been widely adopted, also because pre-trained
language models for languages other than English are often unavailable or smaller than their
English counterparts. The Multilingual BERT model used for the experiments in this paper is
available from the Huggingface model repository with the identifier:
bert-base-multilingual-uncased




   8
       https://huggingface.co/models
4. Tasks
Six shared tasks have been selected from the past three editions of EVALITA, one from EVALITA
2016 [28], four from EVALITA 2018 [4], and one from EVALITA 2020 [3]. All tasks are classifi-
cation tasks, and more specifically binary classification tasks, i.e., where the label to predict for
each textual instance can have one of two possible values. Table 2 summarizes the tasks selected
for the experiments presented in this paper and statistics on their size and label distribution.

                  Task                 Label              Pos. labels    Neg. labels      Total
                  IronITA              irony                      435            437        872
                  HaSpeeDe (TW)        hate                       324            676       1000
                  HaSpeeDe (FB)        hate                       677            323       1000
                  HaSpeeDe 2           hate                       622            641       1263
                                       misogyny                   500            500       1000
                  AMI
                                       aggressiveness             176            824       1000
                                       subjective                1305            695       2000
                                       positive                   352           1648       2000
                  SENTIPOLC
                                       negative                   770           1230       2000
                                       irony                      235           1765       2000
                  GxG (CH)             man/woman                  100            100        200
                  GxG (DI)             man/woman                   37             37         74
                  GxG (JO)             man/woman                  100            100        200
                  GxG (TW)             man/woman                 3000           3000       6000
                  GxG (YT)             man/woman                 2200           2200       4400
Table 2
The six EVALITA shared tasks used as benchmarks in this paper and the distribution of the labels in
their test sets.

   For all the shared tasks, we downloaded the test set textual data and labels from the European
Language Grid9 (ELG) [29]. The ELG is a recently proposed platform for Language Technology
in Europe funded by the Horizon 2020 scheme. The main goal of ELG is to create an open and
shared linguistic benchmark for Italian on a large set of representative tasks. The EVALITA4ELG
project [30]10 integrated a large number of datasets and other resources, including pre-trained
models and systems, from all editions of EVALITA to date into the ELG. It is therefore sufficient
to register an account on the platform and the data can be accessed programmatically with the
official ELG Python library.

4.1. IronITA
The EVALITA 2018 Task on Irony Detection in Italian Tweets [8, IronITA] is a shared task
focused on the automatic detection of irony in Italian tweets. The shared task is articulated in
two subtasks with increasing level of granularity. The first subtask is a binary classification of
tweets into ironic vs. non-ironic. The second task adds the level of sarcasm to the classification,

   9
        https://live.european-language-grid.eu/
   10
        https://live.european-language-grid.eu/meta-forum-2022/project-expo/evalita4elg
conditioned on the presence of irony in the tweets. For the experiments of this task, we only
consider the first subtask.

4.2. HaSpeeDe and HaSpeeDe 2
Hate Speech Detection (HaSpeeDe) is a classification task that was run twice, at EVALITA
2018 [7] and 2020 [12], with similar scheme but updating the dataset from one edition to the
other. The task focuses on the classification of hateful, aggressive, and offensive content in
social media data from Twitter and Facebook. The first edition of HaSpeeDe features a binary
classification task (hate vs. not hate) and a cross-domain subtask. In this paper, we used the test
set of the first two subtasks, i.e., binary classification of hate on Twitter (TW) and Facebook
(FB). HaSpeeDe 2 proposed a couple of additional subtasks, namely stereotype detection and
the identification of nominal utterances linked to hateful content. For the purpose of this paper,
we only used the data and labels from the main subtask of HaSpeeDe 2.

4.3. AMI
The Automatic Misogyny Identification shared task at EVALITA 2020 [10] proposes a benchmark
for the classification of misogynistic and aggressive content towards women in Italian tweets.
The main task is a double binary classification where systems are called to label tweets with
two independent labels: misogynous vs not misogynous and aggressive vs. not aggressive.
Furthermore, the second subtask of AMI introduces a synthetic dataset to measure the fairness
of misogyny classification models. In this paper, we only used the binary classification data
from the first subtask of AMI (misogyny and aggressiveness).

4.4. SENTIPOLC
The Sentiment Polarity Classification task (SENTIPOLC) was organized at EVALITA 2014 [31]
and 2016 [32], with the second edition including the data used for the prevvious one plus a new
test set. The task is focused on sentiment analysis on Italian tweets, with three classification
tasks: subjectivity, polarity, and irony. The main task, classification of polarity is cast as a double
binary classification task, where systems must produce two independent labels for positive and
negative sentiment found in the text. In this way, the SENTIPOLC annotation scheme is able
to encode poositive and negative sentiment, as well as neutral (both the positive and negative
labels are absent) and mixed sentiment (both the positive and negative labels are present). For
the experiments in this paper, we use the test sets of all the four binary classification tasks of
SENTIPOLC 2016.

4.5. GxG
The Cross-Genre Gender Prediction task [6, GxG] was organized at EVALITA 2018. The shared
task falls in the area of author profiling, in particularly asking participant systems to predict
whether the author of a short text is a man or a woman. The texts come from five different
sources: Twitter (TW), YouTube (YT), children writings (CH), newspapers (JO, for journalism),
and personal diaries (DI). GxG places an emphasis on cross-dataset prediction, where a model
is trained on a set of data from one domain (or source, in this case) and predictions are made on
data from a different one. For this paper, we use the five sets independently, since no training is
involved in our experiment. In this binary classification task, there is no natural negative and
positive label, therefore we impose the arbitrary mapping man=negative label; woman=positive
label.


5. Results
In this section, we present the results of the experiment of prompt-based classification on
EVALITA tasks. The results are presented separately for each task, because evaluation metrics
may vary from one task to another — accuracy, F1-score of the positive class, and macro-
averaged F1-score are used. Moreover, we present the results, in Tables 4–7, along with the
baseline(s) and best systems according to the reports of the individual tasks.

                     Task             System                                Score
                                      Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                    .419
                                      Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                 .469
                                      Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇           .573
                     IronITA task A
                                      Baseline (most frequent class)          .334
                                      Baseline (random)                       .505
                                      Best system (ItaliaNLP)                 .731
Table 3
Results on IronITA (irony detection) in terms of macro-averaged F1-score.


                 Task                  System                                  Score
                                       Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                      .534
                                       Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                   .613
                 HaSpeeDe-FB           Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇             .505
                                       Baseline (most frequent class)            .244
                                       Best system (ItaliaNLP)                   .828
                                       Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                      .625
                                       Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                   .590
                 HaSpeeDe-TW           Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇             .507
                                       Baseline (most frequent class)            .403
                                       Best system (ItaliaNLP)                   .799
                                       Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                      .526
                                       Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                   .583
                                       Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇             .537
                 HaSpeeDe 2 task A
                                       Baseline (most frequent class)            .336
                                       Baseline (Support Vector Machine)         .721
                                       Best system (TheNorth)                    .808
Table 4
Results on the two editions of HaSpeeDe (hate speech detection) in terms of macro-averaged F1-score.
                       Task           System                              Score
                                      Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                  .573
                                      Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇               .509
                       AMI task A     Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇         .422
                                      Baseline (most frequent class)        .665
                                      Best system (jigsaw)                  .741
Table 5
Results on AMI (misogyny identification) in terms of average between the macro-averaged F1-score of
the two classes misogyny and aggressiveness.

             Task                                System                               Score
                                                 Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                   .374
                                                 Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                .501
                                                 Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇          .443
             SENTIPOLC task 1: subjectivity
                                                 Baseline (most frequent class)         .394
                                                 Best system (Unitor)                   .744
                                                 Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                   .470
                                                 Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                .498
                                                 Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇          .476
             SENTIPOLC task 2: polarity
                                                 Baseline (most frequent class)         .416
                                                 Best system (UniPI)                    .663
                                                 Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜                   .374
                                                 Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇                .400
                                                 Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇          .412
             SENTIPOLC task 3: irony
                                                 Baseline (most frequent class)         .468
                                                 Best system (tweet2check)              .541
Table 6
Results on SENTIPOLC (sentiment analysis) in terms of macro-averaged F1-score for task 1 and 3, and
average of the macro-averages F1-scores of the two classes positive and negative for task 2.


   The results of this experiment show that prompt-based classification (at least, this simplified
version of it) usually beats trivial baselines, but otherwise underperforms with respect to
supervised models on benchmarks for the Italian language. This is expected, since the method
is fully zero-shot. The results on GxG, the only task related to author profiling, are closer to the
best performing systems of the shared task, indicating an expressive power of the language
models beyond the standing meaning of the text. Interestingly, the results vary widely between
pre-trained language models, with none of the three models being clearly superior to the others
across tasks.


6. Discussion and Conclusion
The Betteridge’s law of headlines11 states that “any headline that ends in a question mark can be
answered by the word no”. This paper is no exception: the answer to the question Is EVALITA
   11
     https://web.archive.org/web/20090226202006/http://www.technovia.co.uk/2009/02/
techcrunch-irresponsible-journalism.html
                            Task         System                          Score
                                         Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜              .550
                                         Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇           .570
                            GxG CH
                                         Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇     .595
                                         Best system (ItaliaNLP)           .640
                                         Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜              .581
                                         Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇           .554
                            GxG DI
                                         Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇     .527
                                         Best system (ItaliaNLP)           .676
                                         Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜              .560
                                         Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇           .565
                            GxG JO
                                         Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇     .545
                                         Best system (UniOR)               .585
                                         Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜              .542
                                         Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇           .577
                            GxG TW
                                         Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇     .529
                                         Best system (ItaliaNLP)           .595
                                         Prompt-based𝐴𝑙𝐵𝐸𝑅𝑇 𝑜              .510
                                         Prompt-based𝐼𝑡𝑎𝑙𝑖𝑎𝑛𝐵𝐸𝑅𝑇           .536
                            GxG YY
                                         Prompt-based𝑀 𝑢𝑙𝑡𝑖𝑙𝑖𝑛𝑔𝑢𝑎𝑙𝐵𝐸𝑅𝑇     .483
                                         Best system (ItaliaNLP)           .555
Table 7
Results on GxG (gender prediction) in terms of accuracy.


done? is certainly no. The prompt-based systems presented in this papers are far from
the classification performance of their supervised counterparts on the EVALITA benchmarks.
This result is in stark contrast to results reported on English benchmarks12 . Moreover, the
performance of the two Italian models and the multilingual model tested in this paper are
unstable, with some models apparently more fit to certain tasks than others, raising the question
whether the subpar performance is due to the method or the underlying language-specific
pre-trained models. However, the results of the prompt-based models could be undermined by
the lack of optimization of verbalizers and templates. There is certainly space for improvement,
which was not the main focus of this paper, including an analysis of the disagreement between
verbalizers, and of the actual output of the prompt-based models.
   It is worth noting that this new technology allows us to create zero-shot classifiers for
rather abstract language classification problems. Recent literature indicates that often few
training instances (few-shot learning) are sufficient to increase the performance of prompt-
based classifiers greatly [33]. Considering that the experiments in this paper make use only of
the most basic elements of prompt-based classification, this paradigm should be regarded as
a new frontier, not only for the advancement of text classification methodology, but also for
its evaluation. Supervised learning in NLP is perhaps not on its way to obsolescence (R1), but
the growing literature on zero-shot classification indicates at least that there is a new player
on the field. Would it make sense to organize a shared task as part of an evaluation campaign
like EVALITA where training data is not provided at all (R2)? The first results presented in this
   12
        https://github.com/thunlp/OpenPrompt/tree/main/results/
paper seem to indicate that this is the case, paving the way for evaluation campaigns focused
on zero-shot learning for NLP.


References
 [1] D. Nozza, L. Passaro, M. Polignano, Preface to the Sixth Workshop on Natural Language
     for Artificial Intelligence (NL4AI), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Pro-
     ceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI
     2022) co-located with 21th International Conference of the Italian Association for Artificial
     Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022.
 [2] L. C. Passaro, M. Di Maro, V. Basile, D. Croce, Lessons learned from evalita 2020 and
     thirteen years of evaluation of italian language technology, IJCoL. Italian Journal of
     Computational Linguistics 6 (2020) 79–102.
 [3] V. Basile, D. Croce, M. D. Maro, L. C. Passaro, EVALITA 2020: Overview of the 7th
     evaluation campaign of natural language processing and speech tools for italian, in:
     V. Basile, D. Croce, M. D. Maro, L. C. Passaro (Eds.), Proceedings of the Seventh Evaluation
     Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop
     (EVALITA 2020), Online event, December 17th, 2020, volume 2765 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2020, pp. 1–7. URL: http://ceur-ws.org/Vol-2765/overview.pdf.
 [4] T. Caselli, N. Novielli, V. Patti, P. Rosso, Evalita 2018: Overview on the 6th evaluation
     campaign of natural language processing and speech tools for italian, in: T. Caselli,
     N. Novielli, V. Patti, P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of
     Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA
     2018) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it
     2018), Turin, Italy, December 12-13, 2018, volume 2263 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2018, pp. 1–6. URL: http://ceur-ws.org/Vol-2263/paper001.pdf.
 [5] P. Basile, D. Croce, V. Basile, M. Polignano, Overview of the evalita 2018 aspect-based
     sentiment analysis task (absita), in: T. Caselli, N. Novielli, V. Patti, P. Rosso (Eds.), Proceed-
     ings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools
     for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference
     on Computational Linguistics (CLiC-it 2018), volume 2263, CEUR Workshop Proceedings
     (CEUR-WS. org), Torino, 2018, pp. 10–16.
 [6] F. Dell’Orletta, M. Nissim, Overview of the Evalita 2018 cross-genre gender prediction
     (GxG) task, in: T. Caselli, N. Novielli, V. Patti, P. Rosso (Eds.), Proceedings of the Sixth
     Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final
     Workshop (EVALITA 2018) co-located with the Fifth Italian Conference on Computational
     Linguistics (CLiC-it 2018), volume 2263, CEUR Workshop Proceedings (CEUR-WS. org),
     Torino, 2018, pp. 1–9.
 [7] C. Bosco, F. Dell’Orletta, F. Poletto, M. Sanguinetti, M. Tesconi, Overview of the EVALITA
     2018 hate speech detection task, in: T. Caselli, N. Novielli, V. Patti, P. Rosso (Eds.), Proceed-
     ings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools
     for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference
     on Computational Linguistics (CLiC-it 2018), volume 2263, CEUR Workshop Proceedings
     (CEUR-WS. org), Torino, 2018, pp. 1–9.
 [8] A. T. Cignarella, S. Frenda, V. Basile, C. Bosco, V. Patti, P. Rosso, Overview of the Evalita
     2018 task on Irony Detection in Italian Tweets (IRONITA), in: T. Caselli, N. Novielli,
     V. Patti, P. Rosso (Eds.), Proceedings of the Sixth Evaluation Campaign of Natural Language
     Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018) co-located with
     the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), volume 2263,
     CEUR Workshop Proceedings (CEUR-WS. org), Torino, 2018, pp. 1–9.
 [9] A. Cimino, L. D. Mattei, F. Dell’Orletta, Multi-task learning in deep neural networks
     at EVALITA 2018, in: T. Caselli, N. Novielli, V. Patti, P. Rosso (Eds.), Proceedings of
     the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for
     Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference on
     Computational Linguistics (CLiC-it 2018), Turin, Italy, December 12-13, 2018, volume 2263
     of CEUR Workshop Proceedings, CEUR-WS.org, 2018, pp. 1–10. URL: http://ceur-ws.org/
     Vol-2263/paper013.pdf.
[10] E. Fersini, D. Nozza, P. Rosso, AMI@EVALITA2020: Automatic misogyny identification,
     in: V. Basile, D. Croce, M. Di Maro, L. C. Passaro (Eds.), Proceedings of the 7th evaluation
     campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2020),
     CEUR Workshop Proceedings (CEUR-WS. org), Online, 2020, pp. 1–8.
[11] A. T. Cignarella, M. Lai, C. Bosco, V. Patti, P. Rosso, SardiStance@EVALITA2020: Overview
     of the Task on Stance Detection in Italian Tweets, in: V. Basile, D. Croce, M. Di Maro,
     L. C. Passaro (Eds.), Proceedings of the 7th evaluation campaign of Natural Language
     Processing and Speech tools for Italian (EVALITA 2020), CEUR Workshop Proceedings
     (CEUR-WS. org), Online, 2020, pp. 1–10.
[12] M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda, M. Stranisci, C. Bosco, T. Caselli,
     V. Patti, I. Russo, HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech
     Detection Task, in: V. Basile, D. Croce, M. Di Maro, L. C. Passaro (Eds.), Proceedings of
     the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian
     (EVALITA 2020), CEUR Workshop Proceedings (CEUR-WS. org), Online, 2020, pp. 1–9.
[13] A. Cimino, F. Dell’Orletta, M. Nissim, TAG-it@EVALITA2020: Overview of the topic, age,
     and gender prediction task for italian, in: V. Basile, D. Croce, M. Di Maro, L. C. Passaro
     (Eds.), Proceedings of the 7th evaluation campaign of Natural Language Processing and
     Speech tools for Italian (EVALITA 2020), CEUR Workshop Proceedings (CEUR-WS. org),
     Online, 2020, pp. 1–9.
[14] M. Miliani, G. Giorgi, I. Rama, G. Anselmi, G. E. Lebani, DANKMEMES@EVALITA2020:
     The memeing of life: memes, multimodality and politics, in: V. Basile, D. Croce, M. Di Maro,
     L. C. Passaro (Eds.), Proceedings of the 7th evaluation campaign of Natural Language
     Processing and Speech tools for Italian (EVALITA 2020), CEUR Workshop Proceedings
     (CEUR-WS. org), Online, 2020, pp. 1–9.
[15] L. De Mattei, G. De Martino, A. Iovine, A. Miaschi, M. Polignano, G. Rambelli,
     ATE_ABSITA@EVALITA2020: Overview of the aspect term extraction and aspect-based
     sentiment analysis task, in: V. Basile, D. Croce, M. Di Maro, L. C. Passaro (Eds.), Proceed-
     ings of the 7th evaluation campaign of Natural Language Processing and Speech tools for
     Italian (EVALITA 2020), CEUR Workshop Proceedings (CEUR-WS. org), Online, 2020, pp.
     1–8.
[16] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A
     systematic survey of prompting methods in natural language processing, ACM Computing
     Surveys (CSUR) (2022).
[17] G. Cui, S. Hu, N. Ding, L. Huang, Z. Liu, Prototypical verbalizer for prompt-based few-shot
     tuning, in: Proceedings of the 60th Annual Meeting of the Association for Computa-
     tional Linguistics (Volume 1: Long Papers), Association for Computational Linguistics,
     Dublin, Ireland, 2022, pp. 7014–7024. URL: https://aclanthology.org/2022.acl-long.483.
     doi:10.18653/v1/2022.acl-long.483.
[18] Z. Zhao, E. Wallace, S. Feng, D. Klein, S. Singh, Calibrate before use: Improving few-shot
     performance of language models, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th
     International Conference on Machine Learning, volume 139 of Proceedings of Machine
     Learning Research, PMLR, 2021, pp. 12697–12706. URL: https://proceedings.mlr.press/v139/
     zhao21c.html.
[19] T. Le Scao, A. Rush, How many data points is a prompt worth?, in: Proceedings of the
     2021 Conference of the North American Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Association for Computational Linguistics,
     Online, 2021, pp. 2627–2636. URL: https://aclanthology.org/2021.naacl-main.208. doi:10.
     18653/v1/2021.naacl-main.208.
[20] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng, M. Sun, OpenPrompt: An open-
     source framework for prompt-learning, in: Proceedings of the 60th Annual Meeting of
     the Association for Computational Linguistics: System Demonstrations, Association for
     Computational Linguistics, Dublin, Ireland, 2022, pp. 105–113. URL: https://aclanthology.
     org/2022.acl-demo.10. doi:10.18653/v1/2022.acl-demo.10.
[21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
     N19-1423. doi:10.18653/v1/N19-1423.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: NIPS’17, Curran Associates Inc., Red Hook, NY, USA,
     2017, p. 6000–6010.
[23] M. Polignano, P. Basile, M. de Gemmis, G. Semeraro, V. Basile, Alberto: Italian BERT
     language understanding model for NLP challenging tasks based on tweets, in: R. Bernardi,
     R. Navigli, G. Semeraro (Eds.), Proceedings of the Sixth Italian Conference on Compu-
     tational Linguistics, Bari, Italy, November 13-15, 2019, volume 2481 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2019, pp. 1–6. URL: http://ceur-ws.org/Vol-2481/paper57.pdf.
[24] V. Basile, M. Lai, M. Sanguinetti, Long-term social media data collection at the university
     of turin, in: E. Cabrio, A. Mazzei, F. Tamburini (Eds.), Proceedings of the Fifth Italian
     Conference on Computational Linguistics (CLiC-it 2018), Torino, Italy, December 10-12,
     2018, volume 2253 of CEUR Workshop Proceedings, CEUR-WS.org, 2018, pp. 1–6. URL:
     http://ceur-ws.org/Vol-2253/paper48.pdf.
[25] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as
     discriminators rather than generators, in: International Conference on Learning Represen-
     tations, 2020, pp. 1 – 18. URL: https://openreview.net/forum?id=r1xMH1BtvB.
[26] J. Tiedemann, L. Nygaard, The OPUS corpus - parallel and free: http://logos.uio.no/opus, in:
     Proceedings of the Fourth International Conference on Language Resources and Evaluation
     (LREC’04), European Language Resources Association (ELRA), Lisbon, Portugal, 2004, pp.
     1183–1186. URL: http://www.lrec-conf.org/proceedings/lrec2004/pdf/320.pdf.
[27] J. Abadji, P. J. O. Suárez, L. Romary, B. Sagot, Ungoliant: An optimized pipeline for the
     generation of a very large-scale multilingual web corpus, in: H. Lüngen, M. Kupietz,
     P. Bański, A. Barbaresi, S. Clematide, I. Pisetta (Eds.), Proceedings of the Workshop on
     Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021
     (Online-Event), Leibniz-Institut für Deutsche Sprache, Mannheim, 2021, pp. 1 – 9. URL:
     https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104688. doi:10.14618/ids-pub-10468.
[28] P. Basile, F. Cutugno, M. Nissim, V. Patti, R. Sprugnoli, et al., Evalita 2016: Overview of
     the 5th evaluation campaign of natural language processing and speech tools for italian,
     in: 3rd Italian Conference on Computational Linguistics, CLiC-it 2016 and 5th Evaluation
     Campaign of Natural Language Processing and Speech Tools for Italian, EVALITA 2016,
     volume 1749, CEUR-WS, 2016, pp. 1–4.
[29] G. Rehm, M. Berger, E. Elsholz, S. Hegele, F. Kintzel, K. Marheinecke, S. Piperidis, M. Deli-
     giannis, D. Galanis, K. Gkirtzou, P. Labropoulou, K. Bontcheva, D. Jones, I. Roberts, J. Ha-
     jič, J. Hamrlová, L. Kačena, K. Choukri, V. Arranz, A. Vasil, jevs, O. Anvari, A. Lagzdin, š,
     J. Mel, n, ika, G. Backfried, E. Dikici, M. Janosik, K. Prinz, C. Prinz, S. Stampler, D. Thomas-
     Aniola, J. M. Gómez-Pérez, A. Garcia Silva, C. Berrío, U. Germann, S. Renals, O. Klejch,
     European language grid: An overview, in: Proceedings of the Twelfth Language Resources
     and Evaluation Conference, European Language Resources Association, Marseille, France,
     2020, pp. 3366–3380. URL: https://aclanthology.org/2020.lrec-1.413.
[30] V. Basile, C. Bosco, M. Fell, V. Patti, R. Varvara, Italian NLP for everyone: Resources and
     models from EVALITA to the European language grid, in: Proceedings of the Thirteenth
     Language Resources and Evaluation Conference, European Language Resources Associa-
     tion, Marseille, France, 2022, pp. 174–180. URL: https://aclanthology.org/2022.lrec-1.19.
[31] V. Basile, A. Bolioli, V. Patti, P. Rosso, M. Nissim, Overview of the evalita 2014 sentiment po-
     larity classification task, Overview of the Evalita 2014 SENTIment POLarity Classification
     Task (2014) 50–57.
[32] F. Barbieri, V. Basile, D. Croce, M. Nissim, N. Novielli, V. Patti, Overview of the evalita
     2016 sentiment polarity classification task, in: P. Basile, A. Corazza, F. Cutugno, S. Monte-
     magni, M. Nissim, V. Patti, G. Semeraro, R. Sprugnoli (Eds.), Proceedings of Third Italian
     Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign
     of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA
     2016), Napoli, Italy, December 5-7, 2016, volume 1749 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2016, pp. 1–11. URL: http://ceur-ws.org/Vol-1749/paper_026.pdf.
[33] T. Schick, H. Schütze, Exploiting cloze-questions for few-shot text classification and natural
     language inference, in: Proceedings of the 16th Conference of the European Chapter of the
     Association for Computational Linguistics: Main Volume, Association for Computational
     Linguistics, Online, 2021, pp. 255–269. URL: https://aclanthology.org/2021.eacl-main.20.
     doi:10.18653/v1/2021.eacl-main.20.