<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lost in Labels: An Ongoing Quest to Optimize Text-to-Text Label Selection for Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Papucci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli'</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TALIA s.r.l.</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università di Pisa</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present an evaluation of the influence of label selection on the performance of a Sequence-to-Sequence Transformer model in a classification task. Our study investigates whether the choice of words used to represent classification categories afects the model's performance, and if there exists a relationship between the model's performance and the selected words. To achieve this, we fine-tuned an Italian T5 model on topic classification using various labels. Our results indicate that the diferent label choices can significantly impact the model's performance. That being said, we did not find a clear answer on how these choices afect the model performances, highlighting the need for further research in optimizing label selection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;encoder-decoder</kwd>
        <kwd>label selection</kwd>
        <kwd>topic classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>
        best words-label mapping by maximizing the likelihood
of the training data. [13] instead developed ProtoVerb, a
In recent years, the Sequence-to-Sequence paradigm prototypical verbalizer that learns class prototypes from
has emerged as a highly popular approach in build- training data to build verbalizers automatically.
ing cutting-edge Transformer-based Language Models Nevertheless, few works have focused on
investigat[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. This paradigm draws inspiration from earlier uni- ing more deeply and systematically the efect that the
ifed frameworks for Natural Language Processing (NLP) choice of strings used to represent one (or more) labels
tasks [
        <xref ref-type="bibr" rid="ref4">4, 5, 6</xref>
        ], treating each task as a text-to-text transfor- has on model performance. Among these, [14] designed
mation. In other words, it involves taking text as input diferent label representations (e.g. canonical task labels,
and generating new text as output. task-unrelated antonyms) and tested their impact with
      </p>
      <p>This unifying framework has proven to be a partic- the T5 model on four classification tasks, showing that
ularly efective transfer learning method, often outper- the performance was generally unafected by the choice
forming previous models, e.g. BERT [7], in data-poor of label representation. Similarly, experimenting with
settings. Furthermore, the recent application and refine- the gender prediction task from the TAG-IT dataset [15],
ment of prompt-based tuning techniques for pre-trained [16] noticed that while modifying the label
representaLarge Language Models (LLMs) have made this paradigm tions did not afect the performance of the IT5 model
even more powerful, especially in few-shot and zero-shot [17], shufling them for the topic classification task lead
learning scenarios [8]. to worse results.</p>
      <p>In such a scenario, several studies have focused on In this work, we present an evaluation of the impact
defining methods for the formulation of prompts and of label selection on the performance of a
Sequence-tothe definition of verbalizers, i.e. mapping techniques be- Sequence Model in a classification task. Specifically, we
tween model-predicted words and task labels. As for the address the following research questions: i) Do the words
latter, the vast majority of studies have concentrated on used to represent the classification categories influence
devising automatic or semi-automatic approaches to cre- the model’s performance? ii) Are there any relationship
ate verbalizers that can be applied especially in zero- or between classification categories and the words used to
few-shot configurations [ 9, 10, 11]. For instance, [12] pro- represent them that we can exploit to do label selection?
posed Petal, an approach for automatically finding the To investigate these questions, we conducted a series
of experiments by fine-tuning the Italian version of the
CLiC-it 2023: 9th Italian Conference on Computational Linguistics, T5 model [17] on the topic classification task [ 15] using
November 30 - December 2, 2023, Venice, IT various labels. In particular, we defined diferent sets of
a$lesmsiioc.hmeilea.spcahpi@uciclci@.c ntar.liita.(cAl o.uMdia(Msc.hPi)a;pfeulciccei).;dellorletta@ilc.cnr.it labels and examined the model’s performance for each
(F. Dell’Orletta) of these sets. Additionally, we conducted an in-depth
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License qualitative analysis to inspect which labels contribute
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
most significantly to the improvement or decline in
classification results and why that might be the case.</p>
      <p>The remainder of the paper is organized as follows:
in Sec. 2 we present our approach, introducing the data
and the model we used (Sec. 2.1 and Sec. 2.2) and the
experimental setting (Sec. 2.3). In Sec. 3 we discuss the
obtained results and in Sec. 4 we conclude the paper.</p>
      <p>Contributions. In this paper we: i) propose an
evaluation of the influence that label selection has on the
performance of a Text-to-Text Transformer model for
classification; ii) investigate how the words used to
represent the classification categories, in a multi-class
classiifcation task, impact task performance both globally, and
at class-level; iii) investigate the existence of a
relationship between classification categories and selected labels
and how this connection can be leveraged to improve
label selection.
more than 18,000 posts written in Italian and collected
from diferent blogs. Each post is labelled with three
diferent labels: age and gender of the writer and topic.</p>
      <p>In order to experiment with various possible
combinations of labels, we have decided to focus only on the
2. Our Approach Topic classification task. Moreover, to have enough data
to fine-tune the model, we decided to modify the original
In this section, we first define the data and the model task as defined in [ 15]. Instead of predicting the label of
used to perform our experiments. Then, we detail the a given collection of texts (multiple posts), we fine-tuned
experimental setting we devised to select the tested labels our model to predict the topic from each single post.
Fiand fine-tune the T5 model. nally, since a fair amount of sentences were quite short,
we decided to remove those shorter than 10 tokens. At
2.1. Data the end of this process, we obtained a dataset
consistWe relied on posts extracted from TAG-IT [15], the pro- ing of 13,553 posts as training set and 5,055 posts as test
ifling shared task presented at EVALITA 2020 [ 18]. The set. The distribution of posts according to each label is
dataset, based on the corpus defined in [ 19], consists of reported in Table 1.
2.2. Model</p>
      <sec id="sec-1-1">
        <title>We used the T5 base version pre-trained on the Italian</title>
        <p>language, i.e. IT5 [17]1. In particular, the model was
trained on the Italian sentences extracted from a cleaned
version of the mC4 corpus [20], a multilingual version of
the C4 corpus including 107 languages.</p>
        <sec id="sec-1-1-1">
          <title>2.3. Experimental Setting</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>As already introduced in Sec. 1, to investigate the influ</title>
        <p>ence of label selection on the model performance, we
ifne-tuned the IT5 model using diferent combinations of
strings to represent the original classification categories.
We will refer to the set of the original categories with .
We first translated the categories (as seen in Table 1) in
Italian. (e.g. Celebrities into celebrità)2. Then, for each
category  in  we created a set  composed by 100
string representations: 10 were selected from synonyms
and related words to the original categories (including
aforementioned translated ones), while the remaining 90
were randomly chosen from the most frequent nouns in
the ItWac corpus [21]. Let  = {0, 1, ..., 99} be
the set of labels for the category , and  be the ℎ
label in the set. Then, for each category  we ranked
its corresponding set of labels  in descending order of
similarity:</p>
        <p>(, 0) ≥ (, 1) ≥ ... ≥ (, 99)</p>
      </sec>
      <sec id="sec-1-3">
        <title>Where (,  ) is the cosine similarity between the</title>
        <p>average embedding of the subtokens of  and  ,
extracted from the last encoding layer of the IT5 model.</p>
        <p>Given the previously defined sets , which contains
the elements ranked by similarity, we created 100 sets
of labels  (where  ranges from 0 to 99). Each set is
defined as:  = {0 , 1 , ..., 10 }, where e.g. 0 is
the ℎ ranked label for category 0. As a consequence,
0 contains the labels that achieved the highest cosine
similarity with the original categories, while 99 is the
set containing the lowest cosine similarities. An overview
of our setting is shown in Figure 1.</p>
        <p>We then fine-tuned IT5 for each ranked set of
representation  . Each model was trained for 10 epochs and
using f-score as the evaluation metric.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Results</title>
      <p>Overall results Figure 2 summarizes the results
obtained by the T5 models fine-tuned on the topic
classification tasks according to the 100 diferent sets of labels ( ).</p>
      <sec id="sec-2-1">
        <title>1https://huggingface.co/gsarti/it5-base</title>
        <p>2List of translated labels: anime, automobilismo, bicicletta, sport,
natura, metal detector, medicina, celebrità, fumo, intrattenimento
and tecnologia.</p>
        <p>At first glance, we can readily observe that the choice
of words used to represent the classification categories
has a considerable impact on the model’s average
performance. Indeed, we can see that the classification scores
vary significantly, ranging from a minimum of 0.54 (rank
75) to a maximum of 0.65 (rank 86). Additionally, it is
worth noting that the model trained with 0, which
contains the original translated labels, achieved an f-score of
0.63. This result indicates that simply using the original
labels directly still provides a competitive performance.
However, the significant fluctuations in the classification
scores among the diferent sets  suggest that certain
labels may still ofer better performance than the
original ones, while others may introduce noise or ambiguity,
resulting in sub-optimal outcomes.</p>
        <p>Interestingly, these findings appear to diverge from
previous studies [14, 16], where the role of label
representation was underestimated. While being a task-dependent
issue, the role of label representation seems to have a
large impact on model performance, especially for lower
frequency labels, going as far as making certain labels
range from being completely unpredictable to reaching
satisfactory performances.</p>
        <p>That being said, despite the diferences in terms of
weighted f-scores, there does not seem to be a clear
correlation between the model’s performance and the degree
of "semantic" distance between the chosen labels and
the original ones (represented by the rank  of the
representation set). In fact, as the cosine similarity decreases
between the selected representations and the original
ones (from rank 0 to rank 99), there is no apparent trend
in f-score values.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Per-label results In order to gain a more precise insight into the impact of the tested labels, Figure 3 illus</title>
        <p>trates the variation of f-scores obtained with the 100 f-score. However, in other instances, it struggles
signifidiferent sets of labels ( ) for each individual category. cantly, making erroneous classifications for the majority
Firstly, we can observe that the average results can vary of cases. For instance, in the case of Medicine-Aesthetics,
significantly depending on the category under consider- the f-score reaches a maximum of 0.71 when the label
ation. For instance, IT5 shows promising average per- is represented by the term acuto but it fails to correctly
formance in classifying posts related to Anime, Sports or classify any instance (f-score = 0) when the label is
repreAuto-Moto, while encountering dificulties in identifying sented as proprio. This highlights how the choice of the
posts annotated with the topics Bikes and Technology. label can significantly impact IT5’s classification
perforThis is possibly due to the fact that the posts belonging mance across diferent topics and therefore, suggests the
to the former categories are the most frequent in the importance of exploring optimized selection strategies
entire dataset. Particularly noteworthy is the fact that, to maximize the model performance.
across almost all tested ranks, the model failed to cor- To obtain a more comprehensive qualitative
perspecrectly identify any posts related to Technology. This issue tive of these findings, we include in Figure 4 the top and
is likely attributed to the limited representation of this bottom 10 representations that maximized/minimized
category within the dataset, further compounded by the the f-score values for the four aforementioned categories.
original dataset configuration having more examples in As we can observe, among the four considered categories,
the test set than in the training set (51 and 85 samples in only one (Medicine-Aesthetics) contains the original label,
the training and test sets respectively). i.e. the one with cosine similarity equal to 1 (medicina), in</p>
        <p>Analyzing the variation of results based on the labels the top 10 representations. For the other categories, the
used for representing the categories, we observe, in line absence of the original label seems to suggest that the
chowith Figure 1, that the choice of the label often has a sig- sen word for the label, which should be the closest one to
nificant impact on the model’s performance. While some the reference topic, may not be the one that can maximize
labels exhibit relatively stable results with minor vari- the results. When analyzing individual words, it becomes
ations across diferent representations, such as Anime, evident that not all words contributing to the model’s best
Bikes, Sports and Auto-Moto, there are other instances performance belong exclusively to the domain of the
conwhere the selected labels lead to remarkable fluctuations sidered category. Surprisingly, words such as cinema and
in the model’s performance. Notably, this behaviour sitcom, seemingly related to the Entertainment domain,
emerges especially in the identification of posts related are among those that most negatively impact the model’s
to Nature, Metal-Detecting, Medicine-Aesthetics and Enter- f-scores. Nevertheless, Medicine-Aesthetics shows an
extainment. For these categories, IT5’s classification perfor- ception, with several words aligned with the category’s
mance can change drastically depending on the specific domain, e.g. benessere, medicina, dottoressa e
sensibillabel. In some cases, the model manages to achieve quite ità. Lastly, it is worth noticing that the performance
good results, accurately classifying posts with a high drop is mostly label-dependent, and there is a significant</p>
        <p>Semantic Similarity Initially, we aimed to ascertain
whether there is a correlation between the words that are
more/less semantically similar to the original categories
and the performance of IT5. To achieve this, we
computed the Spearman correlation between the T5 model’s
performance and the cosine similarity values calculated
to construct the 100 sets for each label  . The results of
these correlations are presented in Table 23. As observed,
6 out of the 11 classification categories exhibit
statistiTable 2 cally significant correlations. Among these, only one
Spearman correlations between f-scores and label similarities correlation is positive (Entertainment), while the others
(cosine similarity) for each category. Statistically significant show negative correlation values. This outcome is quite
correlations are marked with *. unexpected as it seemingly implies that the improvement
in the model’s performance is linked to a decrease in
semantic similarity. However, it is crucial to emphasize
diference between the most- and least-performing rep- that the correlation values are not particularly high, and
resentations for the four categories. In fact, while Nature thus, we cannot draw any conclusion about these results.
and Metal-Dectecting exhibit a relatively modest decrease Moreover, it is important to consider that while cosine
(around .20 f-score points), Medicine-Aesthetics and En- similarity can serve as a useful measure of similarity
tertainment display a far more pronounced diference in between embeddings, it may not encompass the entire
performance. semantic space.</p>
        <sec id="sec-2-2-1">
          <title>3.1. Correlating Model Performance and</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Tested Representations</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Having analyzed the model’s performance and assessed the impact of words used to represent the categories on the classification results, we decided to explore the existence of any relationship between the model’s per</title>
        <p>Internal Similarity Since the similarity between
selected labels’ within each set could potentially impact the
model’s performance, we conducted an additional test to
investigate whether higher semantic similarity among
3In Appendix A we also reported the scatterplots showing the
relationship between f-scores and cosine similarity values for these
labels.
representations within a set could negatively afect the icant impact on the model’s performance. While some
performance of IT5. To achieve this, we computed the labels led to competitive results, others resulted in
sub"inner similarity" of each set, defined as the average co- optimal outcomes, with noteworthy variations in the
sine similarity of all possible distinct label combinations4. classification scores. This finding diverges from
previSubsequently, we computed the Spearman correlation ous studies that suggested label representations had little
between each set’s "inner similarity" and the f-scores ob- impact on model performance.
tained by the model fine-tuned with it. Although the Interestingly, the correlation between the model’s
pervalues of "inner similarities" vary considerably across the formance and the degree of "semantic" distance between
sets (ranging from a similarity of 0.69 for rank 0 to 0.38 the chosen labels and the original ones was not clear.
for rank 100), we did not find a statistically significant While some labels exhibited statistically significant
corcorrelation with the model’s performance (Spearman = relations, they were either positive or negative, indicating
0.01, p-value = 0.90). These results suggest that, despite that higher or lower semantic similarity did not
consisthe sets exhibited considerable variation in terms of in- tently lead to better performance.
ner similarity, the similarity between the representation In conclusion, our findings suggest that the choice of
didn’t plainly afect the model’s performance. the label is not a trivial matter and can have a
significant impact on the performance of Sequence-to-Sequence
Models in classification tasks. To maximize performance,
it is essential to explore optimized label selection
techniques that are carefully selected and tailored to the
specific task and dataset.</p>
        <p>Future research could focus on developing more
sophisticated methods for label selection, taking into account
not only semantic similarity but also other relevant
factors. Additionally, it would be valuable to investigate the
generalizability of these findings across other languages
and models, and in order to gain a more comprehensive
understanding of the influence of label selection on
different NLP tasks.</p>
        <p>Representations Frequencies Finally, since the
aforementioned results have demonstrated that diferent labels
have an impact on the model’s performance, we decided
to investigate whether this impact could be somehow
related to the frequency of these representations within
the model’s training dataset. To this end, we computed
the absolute frequency of each label used in our
experiments (11 labels per 100 sets, totalling 1100 words) within
the Italian version of the mC4 Corpus, i.e. the corpus on
which IT5 was trained. Subsequently, we calculated the
correlation between the scores obtained by IT5 for each
label of each set  and the corresponding frequencies
of each label found in the mC4 corpus. Among the 11
categories present in the dataset, only one showed a
statistically significant correlation, Smoke, with a Spearman
correlation value of -0.255. This result suggests that, at
least for this particular category, a decrease in the label’s
frequency in the training corpus corresponds to an
increase in the model’s performance. However, the fact that
only one representation exhibits a significant correlation
and that this correlation is not particularly high once
again prevents us from drawing any conclusive findings.</p>
        <p>Thus, it underscores the need to explore other strategies
in the future for label selection.</p>
      </sec>
      <sec id="sec-2-4">
        <title>This work has been supported by the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <sec id="sec-4-1">
        <title>In this work, we presented an evaluation of the impact</title>
        <p>of label selection on the performance of a
Sequence-toSequence Model in a classification task. By fine-tuning
the Italian version of the T5 model on a topic classification
task, we explored various sets of labels and examined
their influence on the model’s performance.</p>
        <p>Our results indicate that the choice of words used to
represent the classification categories can have a
signif</p>
      </sec>
      <sec id="sec-4-2">
        <title>4As defined in Sec. 2.3, a label is represented as the average embed</title>
        <p>ding of each subtoken in the string.
5The table with all the correlations is reported in Appendix B.
ing as question answering, arXiv preprint 1: Long Papers), Association for Computational
LinarXiv:1806.08730 (2018). guistics, Dublin, Ireland, 2022, pp. 7014–7024. URL:
[5] N. S. Keskar, B. McCann, C. Xiong, R. Socher, Uni- https://aclanthology.org/2022.acl-long.483. doi:10.
fying question answering, text classification, and 18653/v1/2022.acl-long.483.
regression via span extraction, arXiv preprint [14] X. Chen, J. Xu, A. Wang, Label representations in
arXiv:1904.09286 (2019). modeling classification as text generation, in:
Pro[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, ceedings of the 1st Conference of the Asia-Pacific
I. Sutskever, et al., Language models are unsuper- Chapter of the Association for Computational
Linvised multitask learners (2019). guistics and the 10th International Joint
Confer[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ence on Natural Language Processing: Student
RePre-training of deep bidirectional transformers for search Workshop, Association for Computational
language understanding, in: Proceedings of the Linguistics, Suzhou, China, 2020, pp. 160–164. URL:
2019 Conference of the North American Chap- https://aclanthology.org/2020.aacl-srw.23.
ter of the Association for Computational Linguis- [15] A. Cimino, F. Dell’Orletta, M. Nissim, Tag-it@
tics: Human Language Technologies, Volume 1 evalita 2020: Overview of the topic, age, and gender
(Long and Short Papers), Association for Com- prediction task for italian, Evaluation Campaign of
putational Linguistics, Minneapolis, Minnesota, Natural Language Processing and Speech Tools for
2019, pp. 4171–4186. URL: https://aclanthology.org/ Italian (2020).</p>
        <p>N19-1423. doi:10.18653/v1/N19-1423. [16] M. Papucci, C. De Nigris, A. Miaschi, F. Dell’Orletta,
[8] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, Evaluating text-to-text framework for topic and
W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, style classification of italian texts, in: Proceedings
et al., Scaling instruction-finetuned language mod- of the Sixth Workshop on Natural Language for
els, arXiv preprint arXiv:2210.11416 (2022). Artificial Intelligence (NL4AI 2022) co-located with
[9] C. Song, F. Cai, J. Zheng, W. Chen, Z. Pan, Met- 21th International Conference of the Italian
Associric sentiment learning for label representation, in: ation for Artificial Intelligence (AI* IA 2022), 2022.
Proceedings of the 30th ACM International Confer- [17] G. Sarti, M. Nissim, It5: Large-scale text-to-text
ence on Information &amp; Knowledge Management, pretraining for italian language understanding and
CIKM ’21, Association for Computing Machinery, generation, ArXiv preprint 2203.03759 (2022). URL:
New York, NY, USA, 2021, p. 1703–1712. URL: https: https://arxiv.org/abs/2203.03759.
//doi.org/10.1145/3459637.3482369. doi:10.1145/ [18] V. Basile, M. Di Maro, D. Croce, L. Passaro, Evalita
3459637.3482369. 2020: Overview of the 7th evaluation campaign of
[10] W. Jiang, Y. Zhang, J. Kwok, Efective structured natural language processing and speech tools for
prompting by meta-learning and representative ver- italian, in: 7th Evaluation Campaign of Natural
Lanbalizer, in: International Conference on Machine guage Processing and Speech Tools for Italian. Final
Learning, PMLR, 2023, pp. 15186–15199. Workshop, EVALITA 2020, volume 2765, CEUR-ws,
[11] K. Ji, Y. Lian, J. Gao, B. Wang, Hierarchical ver- 2020.</p>
        <p>balizer for few-shot hierarchical text classification, [19] A. Maslennikova, P. Labruna, A. Cimino,
in: Proceedings of the 61st Annual Meeting of the F. Dell’Orletta, Quanti anni hai? age identification
Association for Computational Linguistics (Volume for italian., in: CLiC-it, 2019.
1: Long Papers), Association for Computational [20] L. Xue, N. Constant, A. Roberts, M. Kale, R.
AlLinguistics, Toronto, Canada, 2023, pp. 2918–2933. Rfou, A. Siddhant, A. Barua, C. Rafel, mT5:
URL: https://aclanthology.org/2023.acl-long.164. A massively multilingual pre-trained text-to-text
[12] T. Schick, H. Schmid, H. Schütze, Automat- transformer, in: Proceedings of the 2021
Conically identifying words that can serve as la- ference of the North American Chapter of the
bels for few-shot text classification, in: Pro- Association for Computational Linguistics:
Huceedings of the 28th International Conference man Language Technologies, Association for
Comon Computational Linguistics, International Com- putational Linguistics, Online, 2021, pp. 483–498.
mittee on Computational Linguistics, Barcelona, URL: https://aclanthology.org/2021.naacl-main.41.
Spain (Online), 2020, pp. 5569–5578. URL: https: doi:10.18653/v1/2021.naacl-main.41.
//aclanthology.org/2020.coling-main.488. doi:10. [21] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta,
18653/v1/2020.coling-main.488. The wacky wide web: a collection of very large
[13] G. Cui, S. Hu, N. Ding, L. Huang, Z. Liu, Prototyp- linguistically processed web-crawled corpora,
Lanical verbalizer for prompt-based few-shot tuning, guage resources and evaluation 43 (2009) 209–226.
in: Proceedings of the 60th Annual Meeting of the</p>
        <p>Association for Computational Linguistics (Volume</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Appendix A</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          .,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stiegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          , et al.,
          <article-title>Multitask prompted training enables zero-shot task generalization</article-title>
          ,
          <source>in: The Tenth International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Aribandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Q.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          , et al.,
          <article-title>Ext5: Towards extreme multi-task scaling for transfer learning</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Keskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>The natural language decathlon: Multitask learn-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>