<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Eighth Workshop on Natural Language for Artificial Intelligence, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fantastic Labels and Where to Find Them: Attention-Based Label Selection for Text-to-Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Papucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Talia S.R.L.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, Istituto di Linguistica Computazionale “Antonio Zampolli” (CNR-ILC)</institution>
          ,
          <addr-line>Pisa</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>6</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Generative language models, particularly adopting text-to-text frameworks, have shown significant success in NLP tasks. While much research has focused on input representations via prompting techniques, less attention has been given to optimizing output representations. Previous studies found inconsistent efects of label representations on model performance in classification tasks using these models. In this work, we introduce a novel method for selecting well-performing label representations by leveraging the attention mechanisms of Transformer models. We used an Italian T5 model fine-tuned on a topic classification task, trained on posts extracted from online forums and categorized into 11 classes, to evaluate diferent label representation selection strategies. We've employed a context-mixing score called Value Zeroing to assess each token's impact to select possible representations from the training set. Our results include a detailed qualitative analysis to identify which label choices most significantly afect classification outcomes, suggesting that using our approach to select label representations can enhance performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;label selection</kwd>
        <kwd>label representations</kwd>
        <kwd>encoder-decoder</kwd>
        <kwd>topic classification</kwd>
        <kwd>attention mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Background</title>
      <p>
        In recent years, generative language models have become increasingly prevalent for solving a wide
range of NLP tasks. Among these models, the text-to-text paradigm has demonstrated significant success
across numerous applications [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. The text-to-text paradigm creates a unifying framework where
each task is transformed to accommodate a textual input and output, resulting in a single abstraction
capable of handling any task. Recently, the adoption and refinement of pre-trained Large Language
Models (LLMs) have made this paradigm popular even in zero- or few-shot settings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In these
scenarios, most of the studies have focused on prompting techniques or verbalizers, i.e., how to better
represent the input for the model, by specifying instructions or tasks. Few works have instead focused
on how to better represent the output of the models. Among these, [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] designed diferent kinds of label
representations and tested their impact on the T5 model on four classification tasks, showing that for
most of these tasks, the performance was unafected by the representations. Similarly, [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] showed that
modifying the textual representation of the labels in a binary classification task (i.e. gender prediction)
the performance of the IT5 model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] does not change. On the contrary, shufling the labels for a topic
classification task leads to worse performance. By training several IT5 models with diferent label
representations, [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] found that the textual representation of the label had a big impact on the model’s
discriminatory abilities for the same task of topic classification, especially for lower frequency classes.
Nevertheless, an in-depth analysis focused on identifying correlations between model performance and
several properties of the textual representations (e.g. the cosine distance between the encodings of the
representation and the original label name, the frequencies of the representations) yielded no significant
insights on how to better chose these representations in order to maximize model performance.
      </p>
      <p>
        Starting from these premises, in this work we propose a novel methodology for selecting label
representations in a text-to-text classification scenario exploiting the potential of the attention mechanism
of Transformer models. In fact, previous work showed that attention can be successfully employed
in several scenarios, such as in the automatic identification of keyphrases from documents [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ],
ontology alignment [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], document ranking [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or semantic similarity [14]. Our purpose is to
understand whether it is possible to define an automated approach for identifying a well-performing set of
candidate labels in a classification task relying on a text-to-text model.
      </p>
      <p>To investigate this, we conducted our experiments by fine-tuning the IT5 model on the topic
classiifcation task [ 15] using various label representations. Specifically, we tested diferent approaches for
selecting candidate labels relying on Value Zeroing [16], a context-mixing score based on the attention
mechanism aimed at quantifying the contribution each context token has in determining the final
representation of a target token. Moreover, we performed a thorough qualitative analysis to determine
which labels have the most substantial impact on the improvement or decline of classification results.
Contributions In this paper we: i) present a novel technique for label representation selection based
on the attention mechanics of Transformers models. We tested three diferent configurations and found
that one shows promising results in finding the best possible representations to maximize performances;
ii) show an in-depth qualitative analysis of the chosen representation, with the intent to find usable
correlations to improve the performance of our label representation selection technique.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Our Approach</title>
      <p>When employing a text-to-text model for classification tasks, the class names must be represented as
specific sequences of tokens (hereafter label representations) that the model outputs to assign an
input to a particular class. We aim to find a set of suitable label representations that maximize the
model’s performance.</p>
      <p>To do so, we hypothesize that we can use the attention mechanism of the model to find suitable
representations for each class inside the training set of the target task. Particularly, we look at which
tokens were the most salient for building the vectorial representations of important tokens in the post
using Value Zeroing. We tested three diferent ways to select the important tokens inside the posts. First,
we tried looking at the tokens that were used to build the representation of the End-of-Sentence special
character of T5 &lt;/s&gt; (EOS). Then we also tried to append class-related tokens to the end of the posts.
The idea was to inject class-related words into the posts to see which original tokens from the posts
were useful in building them:
• In the Appended Label method, we define  as the translation of the original class names1,
e.g. the post “Che giornata indimenticabile... è passato proprio tanto tempo!&lt;/s&gt;” from category
SPORTS, becomes: “Che giornata indimenticabile... è passato proprio tanto tempo! Sport&lt;/s&gt;” ;
• In the Appended Label with Prompt method, we provide the model additional context, by
defining  as: La frase precedente appartiene alla categoria (English translation: The previous post
belongs to the category of ) followed by the original class name translated, e.g. the post “Che
giornata indimenticabile... è passato proprio tanto tempo!&lt;/s&gt;” from category SPORTS, becomes:
“Che giornata indimenticabile... è passato proprio tanto tempo! La frase precedente appartiene alla
categoria Sport&lt;/s&gt;”</p>
      <p>Formally, let  ∈  as one of the training posts in the dataset . Each post  is tagged with one of
the classes  ∈  where  is the set of the possible topics. The posts are tokenized using the provided
IT5-trained tokenizer  . For each post , we injected a series of tokens  tokenized with  . The
1List of translated labels: anime, automobilismo, bicicletta, sport, natura, metal detector, medicina, celebrità, fumo, intrattenimento
and tecnologia.
objective is to study which tokens from the original post  are more salient for the model to build the
representation of the tokens in .</p>
      <p>As explained before, the diference between the three methods is how  is defined: in the EOS method
 = &lt;/s&gt;, in the Appended Label method  is equal to the appended and translated class name, and
in the Appended Label with Prompt method  is equal to the predefined prompt completed with the
translated class name.</p>
      <p>After injecting  in each  ∈  we pass each post in inference through our modified implementation
of IT5, whose Encoder is able to calculate the Value-Zeroing matrix (see Section 3.1). Then, we define
as a candidate label representation  the token  ∈  that obtained the highest Value-Zeroing score
with respect to the tokens in :</p>
      <p>= (__(, ))
By doing so we obtain, for each post, the most important token whose embedding vector is used to
construct the representation of 2. After doing it for the whole dataset, we obtain, for each category
 ∈ , a list of representations .  contains  tuples  equal to the number of posts in the dataset
 tagged with . Each tuple is composed by the candidate label representation  and the Value-Zeroing
score  it obtained with respect to ,  = [(1, 1), ..., (, )].</p>
      <p>Since some of these representations may be duplicates (i.e. the same representation  has been chosen
from multiple posts), we decided to aggregate those representations, in a way that rewards their higher
frequency count. We aggregate all the tuples that have the same representation  and sum together
their Value-Zeroing score  creating a single element in the  list. After doing these aggregation
steps for all categories ∀ :  ∈ , we have, for each category , a set of representations  that we
sort based on the  value of the tuples in descending order, obtaining a ranked Representation Set.</p>
      <p>Finally, we define a set of representations , called the Representation Set of rank  where, for each
category , we have the  ranked representation  in . E.g. in the set 0, for each category, we have
the best-ranked representation, while in the set 10, for each category we have the representation that
ranked 11th.</p>
      <p>An overview of our approach is illustrated in Figure 1.
2Since Transformers’ tokenizers split tokens in multiple subtokens, to obtain the full word, we reconstruct it by reconnecting
all the tokens that are part of the word that the token with the highest Value-Zeroing score is from. The Value-Zeroing score
we consider for the full word is the one of the token that was selected. We decided to avoid aggregating the score of the full
word in any way, because that could reward or punish multi-token words .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setting</title>
      <p>We tested our approach to solving a topic classification task by training our models on forum posts
categorized into 11 classes. We tested all three previously presented selection methods and evaluated
their performances: we used as the target output the first ten best-performing Representation Sets
0, 1, ..., 9 for all three methods, training ten models for each, for a total of 30 trained models. Then,
having assessed that the most promising strategy was the EOS method, we trained 100 models using
the Representation Sets 0, ..., 99 extracted with the EOS method to study the efectiveness of this+
approach.</p>
      <p>In the following sections, we detail how the Value Zeroing technique works (Sec. 3.1), we present the
data, the model and the evaluation methods used in our experiments (Sec. 3.2 and 3.3).</p>
      <sec id="sec-3-1">
        <title>3.1. Value Zeroing</title>
        <p>Value Zeroing [16] draws inspiration from traditional interpretability techniques, where the influence
of a feature (in this case, a token representation) on the model’s output is extracted by removing that
feature from the input, i.e. feature importance methods [17]. Since deleting a word from a sentence,
without changing the semantics of it, is either challenging or impossible, the method opts to eliminate
it during the Attention computation of the considered layer, by zeroing its value vector, i.e. setting each
element in the vector to 0. Inside the Self-Attention layer of a Transformer, for each Attention head ℎ,
the input vector x, for the ℎ token in the sequence is transformed in three distinct vectors through
the use of diferent sets of weight: the Query vector qℎ, the key vector kℎ and the Value vector vℎ. The
context vector zℎ for the ℎ token of each Attention head is generated as a weighted sum over the
Value vector:</p>
        <p>zℎ = ∑︁  ℎ vℎ</p>
        <p>=1
C = (x¬ , x)
(1)
(2)
where  ℎ is the raw Attention weight assigned to the ℎ token and computed as a Softmax-normalized
dot product between the corresponding Query and Key vectors. In Value-zeroing Equation 1 is changed
by replacing the Value vector associated to  with a zero vector vℎ ← 0, ∀ℎ ∈ , where the context
vector for the ℎ token is being computed. This provides a new representation x¬ that has excluded .

By comparing the original representation x with this new one, usually by means of a pairwise distance
metric, we obtain a measure of how much the output representation is afected by the exclusion of . In
our experiment, we chose the cosine distance as a distance metric:
Computing Equation 2 for each token ,  generates a Value-Zeroing Matrix C where the value of
cell C in the map indicates the degree to which the ℎ token is dependent on the ℎ to form its
contextualized vectorial representation.</p>
        <p>For our experiments, we modified the implementation of T5 in the Python transformers library 3
such that the model’s encoder can calculate the Value-Zeroing Matrix C. In particular, we look at the
section of the matrix C:,0: where  is the number of tokens in the original sentence and  is
the number of tokens that compose the appended specially placed tokens (See 2 for how we chose
these tokens). This section of C illustrates how each original token in the sentence contributes to the
vectorial representation of the appended tokens.
3.2. Data
We relied on posts extracted from TAG-IT [15], the profiling shared task presented at EVALITA 2020
[18]. The dataset, based on the corpus defined in [ 19], consists of more than 18,000 posts written in
3The modified class is T5ForConditionalGeneration available in https://github.com/huggingface/transformers/blob/main/src/
transformers/models/t5/modeling_t5.py. To do so, we adapted the original Value Zeroing implementation for the BERT
transformer modelling class: https://github.com/hmohebbi/ValueZeroing.</p>
        <p>Categories
Anime
Auto-Moto
Bikes
Celebrities
Entertainment
Medicine-Aesthetics
Metal-Detecting
Nature
Smoke
Sports
Technology
All</p>
        <p>Italian and collected from diferent blogs. Each post is labeled with three diferent labels: age (binned
into 5 classes) and gender (male or female) of the writer, and topic (11 classes).</p>
        <p>
          Since previous works have shown that tasks that are solved through the use of lexical and semantic
information benefit the most from a well-chosen label representation [
          <xref ref-type="bibr" rid="ref7 ref9">7, 9</xref>
          ], we have decided to focus
only on the Topic classification task. Moreover, to have comparable results with previous studies, we
used the same dataset configuration used in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This setting is diferent from how the original task
was defined in [ 15]: instead of predicting the label of a given collection of texts (multiple posts), we
ifne-tuned our model to predict the topic from each single post and, since a fair amount of posts was
quite short, we removed the posts shorter than 10 tokens. At the end of this process, we obtained a
dataset consisting of 13,553 posts as the training set and 5,055 posts as the test set. The distribution of
posts according to each label is reported in Table 1.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Model and Evaluation</title>
        <p>We used the T5 base version pre-trained on the Italian language, i.e. IT54. In particular, the model was
trained on the Italian sentences extracted from a cleaned version of the mC4 corpus [20], a multilingual
version of the C4 corpus including 107 languages.</p>
        <p>Models’ performances on the topic classification task were computed using the F-Score on the test
set. To evaluate the capability of our selection method to find suitable labels, we trained up to 100
models with 100 diferent Representation Sets. Each of these sets was composed of representations
chosen by our method and was ranked based on its prediction, from the set predicted as the best (Rank
0) to the set predicted to be the worst (Rank 99). We then calculated the Spearman correlation between
the set’s ranking, and the obtained F-Score using that set. If our method can reliably predict the best
representation to maximize performance, we expect a correlation between the ranking and the model
performances. We used the traditional approach of using translated class names for classification as our
baseline.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>As a first step, we evaluated the first ten Representation Sets 0, ..., 9 from each of the three tested
methods to assess their potential in predicting the most efective representations. Figure 2 reports
the scatter-plots showing the F-scores obtained on the test set by each model according to the 10
Representations Sets. As we can notice, the first two methods, Appended Label and Appended Label
with Prompt, don’t show any particularly interesting trends. The first one has a slightly negative
coeficient and a Spearman Correlation of 0.03 with a p-value of 0.934. With such a low correlation
value and high p-value we can’t reject the null hypothesis and the obtained trend is probably random.
The same can be said for the second method too, where we have a slightly positive trend, with a
Spearman correlation of 0.151 with a p-value of 0.67. On the contrary, the third method shows a more
pronounced negative trend ( = − 0.552), i.e. as the rank increases the performance of the
models tends to decrease. Although the p-value of the correlation is below the standard cut-of threshold
( −  = 0.098), we decided to use the EOS method for testing with a total of 100 representation
sets. Before proceeding, we removed from the original dataset the posts belonging to TECHNOLOGY.
This was done since for this class we extracted only 23 sets, due to the small number of samples in the
training set. After removing this class, we evaluated the method with the rest of the categories training
100 models with the first 100 ranked Representations Sets.</p>
      <p>Correlation results are reported in Figure 3, while Table 2 shows the performances of the models
obtained with the representation set of rank 0, the best performing model (Ranked 20th), and the
worst performing model (Ranked 95th) along with the baseline (A IT5 trained with the original class
label translated into Italian). As we can see, the negative trend between models’ performance and
Representation Sets observed previously can still be noted, although less pronounced ( =
− 0.314,  −  = 0.001). In terms of classification scores, we obtained a diference of 0.05 in terms
of F-score between the best-performing model obtained by rank 20 (0.68), and the worst-performing one
obtained by rank 95 (0.63). Although general conclusions about the method cannot be drawn, it appears
that, in this setting, selecting labels from the training set using Attention attribution techniques, such
as Value-Zeroing, efectively identifies keywords with meaningful semantic connections that IT5 can
leverage to achieve higher performance.</p>
      <p>Interestingly, the lowest-performing model using the EOS method achieved the same F-score (0.63) as
the baseline method, i.e. the standard approach of using translated class names. From this perspective,
the EOS method demonstrates superiority over the standard approach: in fact, the model trained on
the Representation Set ranked 0, identified by the EOS method as the best set, achieved an F-score of
0.656. While this is not the highest score produced by the EOS method, it still outperforms the standard
approach. A possible explanation of the efectiveness of the EOS method could be that for building the
&lt;/s&gt; character, the Encoder of the model uses particularly informative words that we can leverage if
used as label representation. The role of the EOS character and other similar characters that are used
for modeling purposes, like the [CLS] character in BERT-like models, is to be used as input for the final
Language Modeling classifier. That pushes the model during the pre-training phase to learn to construct
a representation of such a token that summarizes all the relevant information in the sentence that is
needed to complete the language modeling task [21]. So, by taking the highest Value-Zeroing score
for constructing &lt;/s&gt;, we find tokens that are usually very contextually informative to the language
modelling task and contain a lot of useful information. It’s likely, then, that this information is also
useful during the fine-tuning phase, to construct that lexical connection between input sentences and
output classes. Moreover, when using the other two techniques, we focus on injected tokens that are
often appended without suficient context to justify their presence at the end of the post. It may be that
appending tokens to the end of the post may change the semantics of the sequence too much. The first
method, which simply appends the label to the end of the post, often creates scenarios where the word
appears to be out of place. The same applies to the second method, but thanks to the prompt, this efect
is less noticeable. This efect may also be the reason why the first method is the worst performing one,
while the Appended Label with Prompt method achieves F-Scores almost as high as the EOS one but
without showing any useful correlation between the chosen representation and the model F-Score, thus
not being usable as a Label Representation Selection method.</p>
      <p>
        Figure 4 shows the variation in F-Scores obtained for each class. As we can observe, there is a quite
low degree of variance between the classes, in contrast with the results obtained in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which used
the same dataset and task, but represented the classes using 10 human-selected representations and 90
randomly selected ones. This is especially pronounced for the lower frequency categories, where high
F-score scores also correspond to lower variance. For instance, in contrast to the results reported by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
where the MEDICINE-AESTHETICS class exhibited numerous outliers with F-scores dropping to as
low as 0, our selection method does not encounter such extreme variations. Even when accounting for
outliers, the performance of the class remains relatively stable, with F-scores that are acceptable even
in the worst-case scenario. A similar trend is observed for the ENTERTAINMENT class.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Qualitative Representations Analysis</title>
        <p>
          To have a deeper understanding of the efectiveness of the approach, we performed a more qualitative
analysis to determine which labels have the most substantial impact on the improvement or decline of
the classification accuracy. Table 3 reports the representations for each class obtained with the best and
worst performing models. As we can see, and in line with previous work [
          <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
          ], it would seem there
are no clear patterns that could justify why certain words work better than others. Focusing on the
best-performing set, the only two words that are somehow related to their class seem to be schedina
for SPORTS, referring to the betting ticket used to bet for sports games, and the proper noun ilaria for
CELEBRITIES. For the worst representation, the only representation that can fit in its corresponding
domain (ENTERTAINMENT) is again a proper noun: dragonette, which is the name of a Canadian band.
Another interesting case is piaciuto,provato, which was treated as a single word by the IT5 tokenizer
giving the missing space after the comma. While our aggregation method for multiply selected tokens
also rewarded the frequency, from the best-performing representations, only four had been chosen as
the most salient token in the text multiple times: schedina (3 times), troverai (2 times), premuto (2 times),
gippi (2 times). This could mean that we should re-evaluate how important frequency is, and maybe
change the aggregation method to something that doesn’t reward the frequency as much.
        </p>
        <p>To better understand the role of the representations frequencies in the training set we computed both
the raw frequency of each representation in the whole dataset and the TF-IDF of the representations.
We then calculated the Spearman Rank between the frequencies, the TF-IDFs, and the number of
subtokens of the representations against the obtained F-Score and the Representation Sets rank the
representations are in. The TF-IDF has been calculated by considering all the documents of a single
category as a single document, and the documents’ length has been calculated as the total number of
tokens (document lengths are reported in Appendix A). As we can observe (Table 4) representation
frequency does not correlate to any class with the obtained F-Score. This, again, confirms that the role
of the absolute frequency of a certain term in the training set doesn’t seem to have any positive or
negative efect on the ability of the model to use such representation for its classes. However, by using
the aggregation method that rewarded frequency mentioned in Sec. 2, we can see that for some classes
the more frequent a word is, the more likely it is to be placed at a better rank. (Rank x Frequency column
in Table 4). In particular, for two classes (SPORTS and AUTO-MOTO) more frequent representations
had a higher chance to be placed in the best ranks. This could mean that the most informative words
are frequently the same in these particular categories. Focusing on the TF-IDFs correlations, we can
notice two negative statistically significant correlations with the F-score: SPORTS and ANIME. This is
probably due to the fact that in-domain words, that don’t appear as often in the other categories, had a
slightly positive impact on performances. Moreover, we noticed that these two categories are also those
for which our model utilized several domain-specific words. In fact, the first ten ranked representations
for the two categories are mostly domain-specific:
• for SPORTS: campionato, gol, pareggio, centrocampo, milan, juventus, atalanta, tifosi, trequartista
and derby;
• for ANIME: streaming/download, grafio , manga, pokémon, pokemon, ko, morso, pokèmon, cmq,
drago5.</p>
        <p>Again, the correlation between the TF-IDF and the Representations’ rank was to be expected, based
on the method we’ve used to aggregate the representations. We’ve also empirically noticed that the
method we’ve used to extract the representations was keen on choosing domain-specific, low-frequency
words. This is why often it chooses typos and similar words with errors in them. This is probably
because low-frequency words are usually full words with much more semantic value in them, and by
being domain-specific they carry high contextual information, useful for constructing the other tokens’
representations. This could explain why the TF-IDF, which is a metric that is specifically built to find
such words, correlates so highly and significantly with the extracted words’ ranks.</p>
        <p>Since Transformer models tokenize text by splitting them into subwords, we also tried to understand
whether there is any correlation between both the F-score and the Representation rank with the number
of subwords of the representations. From our results, we can see that subword length doesn’t seem to
afect the model’s performance, nor does our selection technique seem to prefer words that are split
into more or less subwords. The only two exceptions are AUTO-MOTO, where a higher number of
subwords leads to a decrease in performances, and SPORTS, where our model seemed to place words
with a higher subword number in lower places in the ranking system.</p>
        <p>Finally, we investigated the impact of the Part-of-Speech (PoS) associated with the representations,
both globally (See Figure 5) and on a per-class basis (Class-based distribution are reported in Appendix
B). The PoS are extracted from an Italian Word Form Vocabulary developed by the Institute for
Computational Linguistics (ILC) of the National Research Council of Italy (CNR), which contains all the word
forms and their possible POs from the Italian language. As we can see from Figure 5, the most frequent
PoS are UNKNOWN, VERB, NOUN, and ADJ. The class UNKNOWN contains the words that are not
found in the Word Form Vocabulary, and these usually consist of typos, English words, proper nouns,
etc. and are going to be seen more in detail for each category. The categories with the highest number
of UNKNOWNs are ENTERTAINMENT, CELEBERITIES, ANIME, and SPORTS:
5morso, grafio and ko are all domain-specific words in the settings of the popular anime, cartoon and video-game Pokémon,
with the first two being moves and the latter being a specific status.</p>
        <p>• in ENTERTAINMENT, the majority of the UNKNOWNs are typos (e.g. cioe instead of cioè),
abbreviations (e.g. nnt instead of niente), words with an increased vocal length in the last
character (e.g. iniziaaaaaa instead of inizia) or english words (e.g. wish);
• in CELEBRITIES, the majority are proper nouns (e.g. alessia, mirco, federica, etc.) and typos;
• in ANIME, the majority are proper nouns of video games or tv shows characters (e.g. pokémon or
charmender) or Japanese words (e.g. manga);
• in SPORTS, the majority are proper nouns of soccer teams or players (e.g. milan, juventus, higuain,
etc.) or match names composed by multiple teams or nation names (e.g. italia-uruguay or
brasile-olanda) that our system didn’t split since they didn’t contain any spaces.</p>
        <p>We also noted that for BIKES, NATURE, and AUTO-MOTO more VERBs are chosen instead of
NOUNs, while for METAL-DETECTING, SPORTS, and SMOKE is the contrary. That being said, all the
Parts-of-Speech seem reasonably distributed and it seems that no particular one is preferred by the
method when choosing representations from the training set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>
        In this work, we presented a novel technique for reliably choosing label representation in
text-totext classification scenarios. This novel technique, based upon an Attention attribution technique
called Value Zeroing, provides a set of labels used to represent the class names for a text-to-text model.
We tested the approach on a Topic Classification task using IT5, an Italian pre-trained T5 model, by
training 100 diferent models with 100 sets of representation chosen this way. We found that choosing
representation with Value Zeroing and ranking them based on its value, leads to a useful correlation with
the trained model’s scores. Moreover, we noticed that choosing representation this way, leads to better
average performances and lower variance in performance, against both human- and random-chosen
representations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Compared to the standard approach of using the class names directly as their
representation (in this case, by also translating them to Italian) our method performed better, and even
the worst-performing Representation set matched the standard approach.
      </p>
      <p>We also conducted an in-depth analysis to understand whether either the performance of the model or
our rankings were related to some simple statistics (frequency, TF-IDF, and the number of subtokens of
the representations). Results showed some statistically significant correlations, especially when focusing
on the TF-IDFs of the representations. We also found no interesting trend among the Parts-of-Speech
of the representation chosen this way. While NOUNs and VERBs were the most popular, there weren’t
any interesting findings, and some distributions suggest that the chosen representations are usually
low-frequency in-domain words for that class.</p>
      <p>In conclusion, our findings highlight again that the choice of label representations isn’t trivial and
has an important impact on text-to-text classification performances and our technique seems to be a
way to find a good solution for the label representation selection task.</p>
      <p>Future research should focus on applying this technique to diferent kinds of tasks, primarily on
those tasks where lexical and semantic clues from the text are essential in solving the task. Also, other
aggregation methods should be tested, reducing the impact of the selection frequency, which showed
not to be an important factor in the fine-tuned models’ performances.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work has been supported by:</title>
      </sec>
      <sec id="sec-6-2">
        <title>FAIR - Future AI Research (PE00000013) project under the NRRP MUR program funded by the NextGenerationEU. TEAMING-UP - Teaming up with Social Artificial Agents project under the PRIN grant no. 20177FX2A7 funded by the Italian Ministry of University and Research.</title>
        <p>of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 1688–1700. URL:
https://aclanthology.org/2023.acl-long.94. doi:10.18653/v1/2023.acl-long.94.
[14] H. Yamagiwa, S. Yokoi, H. Shimodaira, Improving word mover’s distance by leveraging
selfattention matrix, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for
Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp.
11160–11183. URL: https://aclanthology.org/2023.findings-emnlp.746. doi: 10.18653/v1/2023.
findings-emnlp.746.
[15] Cimino, Dell’Orletta, Nissim, Tag-it – topic, age and gender prediction, EVALITA (2020).
[16] H. Mohebbi, W. Zuidema, G. Chrupała, A. Alishahi, Quantifying context mixing in transformers, in:
A. Vlachos, I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the
Association for Computational Linguistics, Association for Computational Linguistics, Dubrovnik,
Croatia, 2023, pp. 3378–3400. URL: https://aclanthology.org/2023.eacl-main.245. doi:10.18653/
v1/2023.eacl-main.245.
[17] S. Mishra, S. Dutta, J. Long, D. Magazzeni, A survey on the robustness of feature importance and
counterfactual explanations, 2023. URL: https://arxiv.org/abs/2111.00358. arXiv:2111.00358.
[18] V. Basile, M. Di Maro, D. Croce, L. Passaro, Evalita 2020: Overview of the 7th evaluation campaign
of natural language processing and speech tools for italian, in: 7th Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2020, volume 2765,
CEUR-ws, 2020.
[19] A. Maslennikova, P. Labruna, A. Cimino, F. Dell’Orletta, Quanti anni hai? age identification for
italian., in: CLiC-it, 2019.
[20] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Rafel, mT5: A
massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Association for Computational Linguistics, Online, 2021, pp. 483–498. URL: https:
//aclanthology.org/2021.naacl-main.41. doi:10.18653/v1/2021.naacl-main.41.
[21] K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does BERT look at? an analysis of BERT’s
attention, in: T. Linzen, G. Chrupała, Y. Belinkov, D. Hupkes (Eds.), Proceedings of the 2019 ACL
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for
Computational Linguistics, Florence, Italy, 2019, pp. 276–286.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Documents Size for TF-IDFs</title>
    </sec>
    <sec id="sec-8">
      <title>B. Distribution of Parts-of-Speech per Category of the extracted representation</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bonetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Hromei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          , Preface to the
          <source>Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2024</year>
          )
          <article-title>co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Aribandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Q.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          , et al.,
          <article-title>Ext5: Towards extreme multi-task scaling for transfer learning</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          .,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stiegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raja</surname>
          </string-name>
          , et al.,
          <article-title>Multitask prompted training enables zero-shot task generalization</article-title>
          ,
          <source>in: The Tenth International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Label representations in modeling classification as text generation, in: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th</article-title>
          <source>International Joint Conference on Natural Language Processing: Student Research Workshop</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Papucci</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. De Nigris</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miaschi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Dell'Orletta, Evaluating text-to-text framework for topic and style classification of italian texts</article-title>
          ,
          <source>in: Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2022</year>
          )
          <article-title>co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI* IA</article-title>
          <year>2022</year>
          ),
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, IT5: Text-to-text pretraining for Italian language understanding and generation</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>9422</fpage>
          -
          <lpage>9433</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>823</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Papucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miaschi</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Dell'Orletta, Lost in labels: An ongoing quest to optimize text-to-text label selection for classification</article-title>
          , in: F. Boschetti,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Lebani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          , N. Novielli (Eds.),
          <source>Proceedings of the 9th Italian Conference on Computational Linguistics</source>
          , Venice, Italy,
          <source>November 30 - December 2</source>
          ,
          <year>2023</year>
          , volume
          <volume>3596</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3596</volume>
          /paper39.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Luo,</surname>
          </string-name>
          <article-title>AttentionRank: Unsupervised keyphrase extraction using self and cross attentions</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>1919</fpage>
          -
          <lpage>1928</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>146</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>146</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <article-title>SAMRank: Unsupervised keyphrase extraction using self-attention map in BERT and GPT-2</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>10188</fpage>
          -
          <lpage>10201</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>630</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>630</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          , H. Kumar,
          <article-title>VeeAlign: Multifaceted context representation using dual attention for ontology alignment</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>10780</fpage>
          -
          <lpage>10792</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>842</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>842</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , FAA:
          <article-title>Fine-grained attention alignment for cascade document ranking</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.), Proceedings
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>