1. Introduction and Background

Lost in Labels: An Ongoing Quest to Optimize Text-to-Text Label Selection for Classification

Michele Papucci

1 2

Alessio Miaschi

Felice Dell'Orletta

0 1 0 ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli' , Pisa , Italy 1 TALIA s.r.l. , Pisa , Italy 2 Università di Pisa , Pisa , Italy

In this paper, we present an evaluation of the influence of label selection on the performance of a Sequence-to-Sequence Transformer model in a classification task. Our study investigates whether the choice of words used to represent classification categories afects the model's performance, and if there exists a relationship between the model's performance and the selected words. To achieve this, we fine-tuned an Italian T5 model on topic classification using various labels. Our results indicate that the diferent label choices can significantly impact the model's performance. That being said, we did not find a clear answer on how these choices afect the model performances, highlighting the need for further research in optimizing label selection.

eol>encoder-decoder label selection topic classification

1. Introduction and Background

best words-label mapping by maximizing the likelihood of the training data. [13] instead developed ProtoVerb, a In recent years, the Sequence-to-Sequence paradigm prototypical verbalizer that learns class prototypes from has emerged as a highly popular approach in build- training data to build verbalizers automatically. ing cutting-edge Transformer-based Language Models Nevertheless, few works have focused on investigat[ 1, 2, 3 ]. This paradigm draws inspiration from earlier uni- ing more deeply and systematically the efect that the ifed frameworks for Natural Language Processing (NLP) choice of strings used to represent one (or more) labels tasks [ 4, 5, 6 ], treating each task as a text-to-text transfor- has on model performance. Among these, [14] designed mation. In other words, it involves taking text as input diferent label representations (e.g. canonical task labels, and generating new text as output. task-unrelated antonyms) and tested their impact with

This unifying framework has proven to be a partic- the T5 model on four classification tasks, showing that ularly efective transfer learning method, often outper- the performance was generally unafected by the choice forming previous models, e.g. BERT [7], in data-poor of label representation. Similarly, experimenting with settings. Furthermore, the recent application and refine- the gender prediction task from the TAG-IT dataset [15], ment of prompt-based tuning techniques for pre-trained [16] noticed that while modifying the label representaLarge Language Models (LLMs) have made this paradigm tions did not afect the performance of the IT5 model even more powerful, especially in few-shot and zero-shot [17], shufling them for the topic classification task lead learning scenarios [8]. to worse results.

In such a scenario, several studies have focused on In this work, we present an evaluation of the impact defining methods for the formulation of prompts and of label selection on the performance of a Sequence-tothe definition of verbalizers, i.e. mapping techniques be- Sequence Model in a classification task. Specifically, we tween model-predicted words and task labels. As for the address the following research questions: i) Do the words latter, the vast majority of studies have concentrated on used to represent the classification categories influence devising automatic or semi-automatic approaches to cre- the model’s performance? ii) Are there any relationship ate verbalizers that can be applied especially in zero- or between classification categories and the words used to few-shot configurations [ 9, 10, 11]. For instance, [12] pro- represent them that we can exploit to do label selection? posed Petal, an approach for automatically finding the To investigate these questions, we conducted a series of experiments by fine-tuning the Italian version of the CLiC-it 2023: 9th Italian Conference on Computational Linguistics, T5 model [17] on the topic classification task [ 15] using November 30 - December 2, 2023, Venice, IT various labels. In particular, we defined diferent sets of a$lesmsiioc.hmeilea.spcahpi@uciclci@.c ntar.liita.(cAl o.uMdia(Msc.hPi)a;pfeulciccei).;dellorletta@ilc.cnr.it labels and examined the model’s performance for each (F. Dell’Orletta) of these sets. Additionally, we conducted an in-depth © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License qualitative analysis to inspect which labels contribute CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) most significantly to the improvement or decline in classification results and why that might be the case.

The remainder of the paper is organized as follows: in Sec. 2 we present our approach, introducing the data and the model we used (Sec. 2.1 and Sec. 2.2) and the experimental setting (Sec. 2.3). In Sec. 3 we discuss the obtained results and in Sec. 4 we conclude the paper.

Contributions. In this paper we: i) propose an evaluation of the influence that label selection has on the performance of a Text-to-Text Transformer model for classification; ii) investigate how the words used to represent the classification categories, in a multi-class classiifcation task, impact task performance both globally, and at class-level; iii) investigate the existence of a relationship between classification categories and selected labels and how this connection can be leveraged to improve label selection. more than 18,000 posts written in Italian and collected from diferent blogs. Each post is labelled with three diferent labels: age and gender of the writer and topic.

In order to experiment with various possible combinations of labels, we have decided to focus only on the 2. Our Approach Topic classification task. Moreover, to have enough data to fine-tune the model, we decided to modify the original In this section, we first define the data and the model task as defined in [ 15]. Instead of predicting the label of used to perform our experiments. Then, we detail the a given collection of texts (multiple posts), we fine-tuned experimental setting we devised to select the tested labels our model to predict the topic from each single post. Fiand fine-tune the T5 model. nally, since a fair amount of sentences were quite short, we decided to remove those shorter than 10 tokens. At 2.1. Data the end of this process, we obtained a dataset consistWe relied on posts extracted from TAG-IT [15], the pro- ing of 13,553 posts as training set and 5,055 posts as test ifling shared task presented at EVALITA 2020 [ 18]. The set. The distribution of posts according to each label is dataset, based on the corpus defined in [ 19], consists of reported in Table 1. 2.2. Model

We used the T5 base version pre-trained on the Italian

language, i.e. IT5 [17]1. In particular, the model was trained on the Italian sentences extracted from a cleaned version of the mC4 corpus [20], a multilingual version of the C4 corpus including 107 languages.

2.3. Experimental Setting As already introduced in Sec. 1, to investigate the influ

ence of label selection on the model performance, we ifne-tuned the IT5 model using diferent combinations of strings to represent the original classification categories. We will refer to the set of the original categories with . We first translated the categories (as seen in Table 1) in Italian. (e.g. Celebrities into celebrità)2. Then, for each category in we created a set composed by 100 string representations: 10 were selected from synonyms and related words to the original categories (including aforementioned translated ones), while the remaining 90 were randomly chosen from the most frequent nouns in the ItWac corpus [21]. Let = {0, 1, ..., 99} be the set of labels for the category , and be the ℎ label in the set. Then, for each category we ranked its corresponding set of labels in descending order of similarity:

(, 0) ≥ (, 1) ≥ ... ≥ (, 99)

Where (, ) is the cosine similarity between the

average embedding of the subtokens of and , extracted from the last encoding layer of the IT5 model.

Given the previously defined sets , which contains the elements ranked by similarity, we created 100 sets of labels (where ranges from 0 to 99). Each set is defined as: = {0 , 1 , ..., 10 }, where e.g. 0 is the ℎ ranked label for category 0. As a consequence, 0 contains the labels that achieved the highest cosine similarity with the original categories, while 99 is the set containing the lowest cosine similarities. An overview of our setting is shown in Figure 1.

We then fine-tuned IT5 for each ranked set of representation . Each model was trained for 10 epochs and using f-score as the evaluation metric.

3. Results

Overall results Figure 2 summarizes the results obtained by the T5 models fine-tuned on the topic classification tasks according to the 100 diferent sets of labels ( ).

1https://huggingface.co/gsarti/it5-base

2List of translated labels: anime, automobilismo, bicicletta, sport, natura, metal detector, medicina, celebrità, fumo, intrattenimento and tecnologia.

At first glance, we can readily observe that the choice of words used to represent the classification categories has a considerable impact on the model’s average performance. Indeed, we can see that the classification scores vary significantly, ranging from a minimum of 0.54 (rank 75) to a maximum of 0.65 (rank 86). Additionally, it is worth noting that the model trained with 0, which contains the original translated labels, achieved an f-score of 0.63. This result indicates that simply using the original labels directly still provides a competitive performance. However, the significant fluctuations in the classification scores among the diferent sets suggest that certain labels may still ofer better performance than the original ones, while others may introduce noise or ambiguity, resulting in sub-optimal outcomes.

Interestingly, these findings appear to diverge from previous studies [14, 16], where the role of label representation was underestimated. While being a task-dependent issue, the role of label representation seems to have a large impact on model performance, especially for lower frequency labels, going as far as making certain labels range from being completely unpredictable to reaching satisfactory performances.

That being said, despite the diferences in terms of weighted f-scores, there does not seem to be a clear correlation between the model’s performance and the degree of "semantic" distance between the chosen labels and the original ones (represented by the rank of the representation set). In fact, as the cosine similarity decreases between the selected representations and the original ones (from rank 0 to rank 99), there is no apparent trend in f-score values.

Per-label results In order to gain a more precise insight into the impact of the tested labels, Figure 3 illus

trates the variation of f-scores obtained with the 100 f-score. However, in other instances, it struggles signifidiferent sets of labels ( ) for each individual category. cantly, making erroneous classifications for the majority Firstly, we can observe that the average results can vary of cases. For instance, in the case of Medicine-Aesthetics, significantly depending on the category under consider- the f-score reaches a maximum of 0.71 when the label ation. For instance, IT5 shows promising average per- is represented by the term acuto but it fails to correctly formance in classifying posts related to Anime, Sports or classify any instance (f-score = 0) when the label is repreAuto-Moto, while encountering dificulties in identifying sented as proprio. This highlights how the choice of the posts annotated with the topics Bikes and Technology. label can significantly impact IT5’s classification perforThis is possibly due to the fact that the posts belonging mance across diferent topics and therefore, suggests the to the former categories are the most frequent in the importance of exploring optimized selection strategies entire dataset. Particularly noteworthy is the fact that, to maximize the model performance. across almost all tested ranks, the model failed to cor- To obtain a more comprehensive qualitative perspecrectly identify any posts related to Technology. This issue tive of these findings, we include in Figure 4 the top and is likely attributed to the limited representation of this bottom 10 representations that maximized/minimized category within the dataset, further compounded by the the f-score values for the four aforementioned categories. original dataset configuration having more examples in As we can observe, among the four considered categories, the test set than in the training set (51 and 85 samples in only one (Medicine-Aesthetics) contains the original label, the training and test sets respectively). i.e. the one with cosine similarity equal to 1 (medicina), in

Analyzing the variation of results based on the labels the top 10 representations. For the other categories, the used for representing the categories, we observe, in line absence of the original label seems to suggest that the chowith Figure 1, that the choice of the label often has a sig- sen word for the label, which should be the closest one to nificant impact on the model’s performance. While some the reference topic, may not be the one that can maximize labels exhibit relatively stable results with minor vari- the results. When analyzing individual words, it becomes ations across diferent representations, such as Anime, evident that not all words contributing to the model’s best Bikes, Sports and Auto-Moto, there are other instances performance belong exclusively to the domain of the conwhere the selected labels lead to remarkable fluctuations sidered category. Surprisingly, words such as cinema and in the model’s performance. Notably, this behaviour sitcom, seemingly related to the Entertainment domain, emerges especially in the identification of posts related are among those that most negatively impact the model’s to Nature, Metal-Detecting, Medicine-Aesthetics and Enter- f-scores. Nevertheless, Medicine-Aesthetics shows an extainment. For these categories, IT5’s classification perfor- ception, with several words aligned with the category’s mance can change drastically depending on the specific domain, e.g. benessere, medicina, dottoressa e sensibillabel. In some cases, the model manages to achieve quite ità. Lastly, it is worth noticing that the performance good results, accurately classifying posts with a high drop is mostly label-dependent, and there is a significant

Semantic Similarity Initially, we aimed to ascertain whether there is a correlation between the words that are more/less semantically similar to the original categories and the performance of IT5. To achieve this, we computed the Spearman correlation between the T5 model’s performance and the cosine similarity values calculated to construct the 100 sets for each label . The results of these correlations are presented in Table 23. As observed, 6 out of the 11 classification categories exhibit statistiTable 2 cally significant correlations. Among these, only one Spearman correlations between f-scores and label similarities correlation is positive (Entertainment), while the others (cosine similarity) for each category. Statistically significant show negative correlation values. This outcome is quite correlations are marked with *. unexpected as it seemingly implies that the improvement in the model’s performance is linked to a decrease in semantic similarity. However, it is crucial to emphasize diference between the most- and least-performing rep- that the correlation values are not particularly high, and resentations for the four categories. In fact, while Nature thus, we cannot draw any conclusion about these results. and Metal-Dectecting exhibit a relatively modest decrease Moreover, it is important to consider that while cosine (around .20 f-score points), Medicine-Aesthetics and En- similarity can serve as a useful measure of similarity tertainment display a far more pronounced diference in between embeddings, it may not encompass the entire performance. semantic space.

3.1. Correlating Model Performance and Tested Representations Having analyzed the model’s performance and assessed the impact of words used to represent the categories on the classification results, we decided to explore the existence of any relationship between the model’s per

Internal Similarity Since the similarity between selected labels’ within each set could potentially impact the model’s performance, we conducted an additional test to investigate whether higher semantic similarity among 3In Appendix A we also reported the scatterplots showing the relationship between f-scores and cosine similarity values for these labels. representations within a set could negatively afect the icant impact on the model’s performance. While some performance of IT5. To achieve this, we computed the labels led to competitive results, others resulted in sub"inner similarity" of each set, defined as the average co- optimal outcomes, with noteworthy variations in the sine similarity of all possible distinct label combinations4. classification scores. This finding diverges from previSubsequently, we computed the Spearman correlation ous studies that suggested label representations had little between each set’s "inner similarity" and the f-scores ob- impact on model performance. tained by the model fine-tuned with it. Although the Interestingly, the correlation between the model’s pervalues of "inner similarities" vary considerably across the formance and the degree of "semantic" distance between sets (ranging from a similarity of 0.69 for rank 0 to 0.38 the chosen labels and the original ones was not clear. for rank 100), we did not find a statistically significant While some labels exhibited statistically significant corcorrelation with the model’s performance (Spearman = relations, they were either positive or negative, indicating 0.01, p-value = 0.90). These results suggest that, despite that higher or lower semantic similarity did not consisthe sets exhibited considerable variation in terms of in- tently lead to better performance. ner similarity, the similarity between the representation In conclusion, our findings suggest that the choice of didn’t plainly afect the model’s performance. the label is not a trivial matter and can have a significant impact on the performance of Sequence-to-Sequence Models in classification tasks. To maximize performance, it is essential to explore optimized label selection techniques that are carefully selected and tailored to the specific task and dataset.

Future research could focus on developing more sophisticated methods for label selection, taking into account not only semantic similarity but also other relevant factors. Additionally, it would be valuable to investigate the generalizability of these findings across other languages and models, and in order to gain a more comprehensive understanding of the influence of label selection on different NLP tasks.

Representations Frequencies Finally, since the aforementioned results have demonstrated that diferent labels have an impact on the model’s performance, we decided to investigate whether this impact could be somehow related to the frequency of these representations within the model’s training dataset. To this end, we computed the absolute frequency of each label used in our experiments (11 labels per 100 sets, totalling 1100 words) within the Italian version of the mC4 Corpus, i.e. the corpus on which IT5 was trained. Subsequently, we calculated the correlation between the scores obtained by IT5 for each label of each set and the corresponding frequencies of each label found in the mC4 corpus. Among the 11 categories present in the dataset, only one showed a statistically significant correlation, Smoke, with a Spearman correlation value of -0.255. This result suggests that, at least for this particular category, a decrease in the label’s frequency in the training corpus corresponds to an increase in the model’s performance. However, the fact that only one representation exhibits a significant correlation and that this correlation is not particularly high once again prevents us from drawing any conclusive findings.

Thus, it underscores the need to explore other strategies in the future for label selection.

This work has been supported by the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. Acknowledgments 4. Conclusion In this work, we presented an evaluation of the impact

of label selection on the performance of a Sequence-toSequence Model in a classification task. By fine-tuning the Italian version of the T5 model on a topic classification task, we explored various sets of labels and examined their influence on the model’s performance.

Our results indicate that the choice of words used to represent the classification categories can have a signif

4As defined in Sec. 2.3, a label is represented as the average embed

ding of each subtoken in the string. 5The table with all the correlations is reported in Appendix B. ing as question answering, arXiv preprint 1: Long Papers), Association for Computational LinarXiv:1806.08730 (2018). guistics, Dublin, Ireland, 2022, pp. 7014–7024. URL: [5] N. S. Keskar, B. McCann, C. Xiong, R. Socher, Uni- https://aclanthology.org/2022.acl-long.483. doi:10. fying question answering, text classification, and 18653/v1/2022.acl-long.483. regression via span extraction, arXiv preprint [14] X. Chen, J. Xu, A. Wang, Label representations in arXiv:1904.09286 (2019). modeling classification as text generation, in: Pro[6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, ceedings of the 1st Conference of the Asia-Pacific I. Sutskever, et al., Language models are unsuper- Chapter of the Association for Computational Linvised multitask learners (2019). guistics and the 10th International Joint Confer[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: ence on Natural Language Processing: Student RePre-training of deep bidirectional transformers for search Workshop, Association for Computational language understanding, in: Proceedings of the Linguistics, Suzhou, China, 2020, pp. 160–164. URL: 2019 Conference of the North American Chap- https://aclanthology.org/2020.aacl-srw.23. ter of the Association for Computational Linguis- [15] A. Cimino, F. Dell’Orletta, M. Nissim, Tag-it@ tics: Human Language Technologies, Volume 1 evalita 2020: Overview of the topic, age, and gender (Long and Short Papers), Association for Com- prediction task for italian, Evaluation Campaign of putational Linguistics, Minneapolis, Minnesota, Natural Language Processing and Speech Tools for 2019, pp. 4171–4186. URL: https://aclanthology.org/ Italian (2020).

N19-1423. doi:10.18653/v1/N19-1423. [16] M. Papucci, C. De Nigris, A. Miaschi, F. Dell’Orletta, [8] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, Evaluating text-to-text framework for topic and W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, style classification of italian texts, in: Proceedings et al., Scaling instruction-finetuned language mod- of the Sixth Workshop on Natural Language for els, arXiv preprint arXiv:2210.11416 (2022). Artificial Intelligence (NL4AI 2022) co-located with [9] C. Song, F. Cai, J. Zheng, W. Chen, Z. Pan, Met- 21th International Conference of the Italian Associric sentiment learning for label representation, in: ation for Artificial Intelligence (AI* IA 2022), 2022. Proceedings of the 30th ACM International Confer- [17] G. Sarti, M. Nissim, It5: Large-scale text-to-text ence on Information & Knowledge Management, pretraining for italian language understanding and CIKM ’21, Association for Computing Machinery, generation, ArXiv preprint 2203.03759 (2022). URL: New York, NY, USA, 2021, p. 1703–1712. URL: https: https://arxiv.org/abs/2203.03759. //doi.org/10.1145/3459637.3482369. doi:10.1145/ [18] V. Basile, M. Di Maro, D. Croce, L. Passaro, Evalita 3459637.3482369. 2020: Overview of the 7th evaluation campaign of [10] W. Jiang, Y. Zhang, J. Kwok, Efective structured natural language processing and speech tools for prompting by meta-learning and representative ver- italian, in: 7th Evaluation Campaign of Natural Lanbalizer, in: International Conference on Machine guage Processing and Speech Tools for Italian. Final Learning, PMLR, 2023, pp. 15186–15199. Workshop, EVALITA 2020, volume 2765, CEUR-ws, [11] K. Ji, Y. Lian, J. Gao, B. Wang, Hierarchical ver- 2020.

balizer for few-shot hierarchical text classification, [19] A. Maslennikova, P. Labruna, A. Cimino, in: Proceedings of the 61st Annual Meeting of the F. Dell’Orletta, Quanti anni hai? age identification Association for Computational Linguistics (Volume for italian., in: CLiC-it, 2019. 1: Long Papers), Association for Computational [20] L. Xue, N. Constant, A. Roberts, M. Kale, R. AlLinguistics, Toronto, Canada, 2023, pp. 2918–2933. Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: URL: https://aclanthology.org/2023.acl-long.164. A massively multilingual pre-trained text-to-text [12] T. Schick, H. Schmid, H. Schütze, Automat- transformer, in: Proceedings of the 2021 Conically identifying words that can serve as la- ference of the North American Chapter of the bels for few-shot text classification, in: Pro- Association for Computational Linguistics: Huceedings of the 28th International Conference man Language Technologies, Association for Comon Computational Linguistics, International Com- putational Linguistics, Online, 2021, pp. 483–498. mittee on Computational Linguistics, Barcelona, URL: https://aclanthology.org/2021.naacl-main.41. Spain (Online), 2020, pp. 5569–5578. URL: https: doi:10.18653/v1/2021.naacl-main.41. //aclanthology.org/2020.coling-main.488. doi:10. [21] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta, 18653/v1/2020.coling-main.488. The wacky wide web: a collection of very large [13] G. Cui, S. Hu, N. Ding, L. Huang, Z. Liu, Prototyp- linguistically processed web-crawled corpora, Lanical verbalizer for prompt-based few-shot tuning, guage resources and evaluation 43 (2009) 209–226. in: Proceedings of the 60th Annual Meeting of the

Association for Computational Linguistics (Volume

A. Appendix A

[1]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer ., J. Mach. Learn. Res . 21 ( 2020 ) 1 - 67 .

[2]

Sanh ,

Webson ,

Rafel ,

Bach ,

Sutawika ,

Alyafeai ,

Chafin ,

Stiegler ,

T. Le

Scao ,

Raja , et al., Multitask prompted training enables zero-shot task generalization , in: The Tenth International Conference on Learning Representations , 2022 .

[3]

Aribandi ,

Tay ,

Schuster ,

Rao ,

H. S.

Zheng ,

S. V.

Mehta ,

Zhuang ,

V. Q.

Tran ,

Bahri ,

Ni , et al., Ext5: Towards extreme multi-task scaling for transfer learning , in: International Conference on Learning Representations , 2021 .

[4]

McCann ,

N. S.

Keskar ,

Xiong ,

Socher , The natural language decathlon: Multitask learn-