<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Fantastic Labels and Where to Find Them: Attention-Based Label Selection for Text-to-Text Classification</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Michele</forename><surname>Papucci</surname></persName>
							<email>michele.papucci97@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Università di Pisa</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;</orgName>
								<orgName type="laboratory">ItaliaNLP Lab</orgName>
								<orgName type="institution">Antonio Zampolli&quot; (CNR-ILC)</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alessio</forename><surname>Miaschi</surname></persName>
							<email>alessio.miaschi@ilc.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;</orgName>
								<orgName type="laboratory">ItaliaNLP Lab</orgName>
								<orgName type="institution">Antonio Zampolli&quot; (CNR-ILC)</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Felice</forename><surname>Dell'orletta</surname></persName>
							<email>felice.dellorletta@ilc.cnr.it</email>
							<affiliation key="aff1">
								<orgName type="department">Istituto di Linguistica Computazionale &quot;</orgName>
								<orgName type="laboratory">ItaliaNLP Lab</orgName>
								<orgName type="institution">Antonio Zampolli&quot; (CNR-ILC)</orgName>
								<address>
									<settlement>Pisa</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Fantastic Labels and Where to Find Them: Attention-Based Label Selection for Text-to-Text Classification</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E60E1D5C283E6934E5A1D01FBEB1761A</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>label selection</term>
					<term>label representations</term>
					<term>encoder-decoder</term>
					<term>topic classification</term>
					<term>attention mechanism</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Generative language models, particularly adopting text-to-text frameworks, have shown significant success in NLP tasks. While much research has focused on input representations via prompting techniques, less attention has been given to optimizing output representations. Previous studies found inconsistent effects of label representations on model performance in classification tasks using these models. In this work, we introduce a novel method for selecting well-performing label representations by leveraging the attention mechanisms of Transformer models. We used an Italian T5 model fine-tuned on a topic classification task, trained on posts extracted from online forums and categorized into 11 classes, to evaluate different label representation selection strategies. We've employed a context-mixing score called Value Zeroing to assess each token's impact to select possible representations from the training set. Our results include a detailed qualitative analysis to identify which label choices most significantly affect classification outcomes, suggesting that using our approach to select label representations can enhance performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction and Background</head><p>In recent years, generative language models have become increasingly prevalent for solving a wide range of NLP tasks. Among these models, the text-to-text paradigm has demonstrated significant success across numerous applications <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>. The text-to-text paradigm creates a unifying framework where each task is transformed to accommodate a textual input and output, resulting in a single abstraction capable of handling any task. Recently, the adoption and refinement of pre-trained Large Language Models (LLMs) have made this paradigm popular even in zero-or few-shot settings <ref type="bibr" target="#b4">[5]</ref>. In these scenarios, most of the studies have focused on prompting techniques or verbalizers, i.e., how to better represent the input for the model, by specifying instructions or tasks. Few works have instead focused on how to better represent the output of the models. Among these, <ref type="bibr" target="#b5">[6]</ref> designed different kinds of label representations and tested their impact on the T5 model on four classification tasks, showing that for most of these tasks, the performance was unaffected by the representations. Similarly, <ref type="bibr" target="#b6">[7]</ref> showed that modifying the textual representation of the labels in a binary classification task (i.e. gender prediction) the performance of the IT5 model <ref type="bibr" target="#b7">[8]</ref> does not change. On the contrary, shuffling the labels for a topic classification task leads to worse performance. By training several IT5 models with different label representations, <ref type="bibr" target="#b8">[9]</ref> found that the textual representation of the label had a big impact on the model's discriminatory abilities for the same task of topic classification, especially for lower frequency classes. Nevertheless, an in-depth analysis focused on identifying correlations between model performance and several properties of the textual representations (e.g. the cosine distance between the encodings of the representation and the original label name, the frequencies of the representations) yielded no significant insights on how to better chose these representations in order to maximize model performance.</p><p>Starting from these premises, in this work we propose a novel methodology for selecting label representations in a text-to-text classification scenario exploiting the potential of the attention mechanism of Transformer models. In fact, previous work showed that attention can be successfully employed in several scenarios, such as in the automatic identification of keyphrases from documents <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>, ontology alignment <ref type="bibr" target="#b11">[12]</ref>, document ranking <ref type="bibr" target="#b12">[13]</ref> or semantic similarity <ref type="bibr" target="#b13">[14]</ref>. Our purpose is to understand whether it is possible to define an automated approach for identifying a well-performing set of candidate labels in a classification task relying on a text-to-text model.</p><p>To investigate this, we conducted our experiments by fine-tuning the IT5 model on the topic classification task <ref type="bibr" target="#b14">[15]</ref> using various label representations. Specifically, we tested different approaches for selecting candidate labels relying on Value Zeroing <ref type="bibr" target="#b15">[16]</ref>, a context-mixing score based on the attention mechanism aimed at quantifying the contribution each context token has in determining the final representation of a target token. Moreover, we performed a thorough qualitative analysis to determine which labels have the most substantial impact on the improvement or decline of classification results.</p><p>Contributions In this paper we: i) present a novel technique for label representation selection based on the attention mechanics of Transformers models. We tested three different configurations and found that one shows promising results in finding the best possible representations to maximize performances; ii) show an in-depth qualitative analysis of the chosen representation, with the intent to find usable correlations to improve the performance of our label representation selection technique.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Our Approach</head><p>When employing a text-to-text model for classification tasks, the class names must be represented as specific sequences of tokens (hereafter label representations) that the model outputs to assign an input to a particular class. We aim to find a set of suitable label representations that maximize the model's performance.</p><p>To do so, we hypothesize that we can use the attention mechanism of the model to find suitable representations for each class inside the training set of the target task. Particularly, we look at which tokens were the most salient for building the vectorial representations of important tokens in the post using Value Zeroing. We tested three different ways to select the important tokens inside the posts. First, we tried looking at the tokens that were used to build the representation of the End-of-Sentence special character of T5 &lt;/s&gt; (EOS). Then we also tried to append class-related tokens to the end of the posts. The idea was to inject class-related words into the posts to see which original tokens from the posts were useful in building them:</p><p>• In the Appended Label method, we define 𝑝 as the translation of the original class names Formally, let 𝑆 ∈ 𝐷 as one of the training posts in the dataset 𝐷. Each post 𝑆 is tagged with one of the classes 𝑐 ∈ 𝐶 where 𝐶 is the set of the possible topics. The posts are tokenized using the provided IT5-trained tokenizer 𝑇 . For each post 𝑆, we injected a series of tokens 𝑝 tokenized with 𝑇 . The objective is to study which tokens from the original post 𝑆 are more salient for the model to build the representation of the tokens in 𝑝.</p><p>As explained before, the difference between the three methods is how 𝑝 is defined: in the EOS method 𝑝 = &lt;/s&gt;, in the Appended Label method 𝑝 is equal to the appended and translated class name, and in the Appended Label with Prompt method 𝑝 is equal to the predefined prompt completed with the translated class name.</p><p>After injecting 𝑝 in each 𝑆 ∈ 𝐷 we pass each post in inference through our modified implementation of IT5, whose Encoder is able to calculate the Value-Zeroing matrix (see Section 3.1). Then, we define as a candidate label representation 𝑙 the token 𝑠 ∈ 𝑆 that obtained the highest Value-Zeroing score with respect to the tokens in 𝑝: 𝑙 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑣𝑎𝑙𝑢𝑒_𝑧𝑒𝑟𝑜𝑖𝑛𝑔_𝑠𝑐𝑜𝑟𝑒(𝑆, 𝑝))</p><p>By doing so we obtain, for each post, the most important token whose embedding vector is used to construct the representation of 𝑝<ref type="foot" target="#foot_1">2</ref> . After doing it for the whole dataset, we obtain, for each category 𝑐 ∈ 𝐶, a list of representations 𝑅 𝑐 . 𝑅 𝑐 contains 𝑛 tuples 𝑟 𝑖 equal to the number of posts in the dataset 𝐷 tagged with 𝑐. Each tuple is composed by the candidate label representation 𝑙 and the Value-Zeroing score 𝑣𝑧 it obtained with respect to 𝑝, 𝑅 𝑐 = [(𝑙 1 , 𝑣𝑧 1 ), ..., (𝑙 𝑛 , 𝑣𝑧 𝑛 )].</p><p>Since some of these representations may be duplicates (i.e. the same representation 𝑙 has been chosen from multiple posts), we decided to aggregate those representations, in a way that rewards their higher frequency count. We aggregate all the tuples that have the same representation 𝑙 and sum together their Value-Zeroing score 𝑣𝑧 creating a single element in the 𝑅 𝑐 list. After doing these aggregation steps for all categories ∀𝑅 𝑐 : 𝑐 ∈ 𝐶, we have, for each category 𝑐, a set of representations 𝑅 𝑐 that we sort based on the 𝑣𝑧 value of the tuples in descending order, obtaining a ranked Representation Set.</p><p>Finally, we define a set of representations 𝐸 𝑖 , called the Representation Set of rank 𝑖 where, for each category 𝑐, we have the 𝑖 ranked representation 𝑟 𝑖 in 𝑅 𝑐 . E.g. in the set 𝐸 0 , for each category, we have the best-ranked representation, while in the set 𝐸 10 , for each category we have the representation that ranked 11th.</p><p>An overview of our approach is illustrated in Figure <ref type="figure" target="#fig_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental Setting</head><p>We tested our approach to solving a topic classification task by training our models on forum posts categorized into 11 classes. We tested all three previously presented selection methods and evaluated their performances: we used as the target output the first ten best-performing Representation Sets 𝐸 0 , 𝐸 1 , ..., 𝐸 9 for all three methods, training ten models for each, for a total of 30 trained models. Then, having assessed that the most promising strategy was the EOS method, we trained 100 models using the Representation Sets 𝐸 0 , ..., 𝐸 99 extracted with the EOS method to study the effectiveness of this+ approach.</p><p>In the following sections, we detail how the Value Zeroing technique works (Sec. 3.1), we present the data, the model and the evaluation methods used in our experiments (Sec. 3.2 and 3.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Value Zeroing</head><p>Value Zeroing <ref type="bibr" target="#b15">[16]</ref> draws inspiration from traditional interpretability techniques, where the influence of a feature (in this case, a token representation) on the model's output is extracted by removing that feature from the input, i.e. feature importance methods <ref type="bibr" target="#b16">[17]</ref>. Since deleting a word from a sentence, without changing the semantics of it, is either challenging or impossible, the method opts to eliminate it during the Attention computation of the considered layer, by zeroing its value vector, i.e. setting each element in the vector to 0. Inside the Self-Attention layer of a Transformer, for each Attention head ℎ, the input vector x 𝑖 , for the 𝑖 𝑡ℎ token in the sequence is transformed in three distinct vectors through the use of different sets of weight: the Query vector q ℎ 𝑖 , the key vector k ℎ 𝑖 and the Value vector v ℎ 𝑖 . The context vector z ℎ 𝑖 for the 𝑖 𝑡ℎ token of each Attention head is generated as a weighted sum over the Value vector:</p><formula xml:id="formula_0">z ℎ 𝑖 = 𝑛 ∑︁ 𝑗=1 𝛼 ℎ 𝑖𝑗 v ℎ 𝑗<label>(1)</label></formula><p>where 𝛼 ℎ 𝑖𝑗 is the raw Attention weight assigned to the 𝑗 𝑡ℎ token and computed as a Softmax-normalized dot product between the corresponding Query and Key vectors. In Value-zeroing Equation 1 is changed by replacing the Value vector associated to 𝑗 with a zero vector v ℎ 𝑗 ← 0, ∀ℎ ∈ 𝐻, where the context vector for the 𝑖 𝑡ℎ token is being computed. This provides a new representation x ¬𝑗 𝑖 that has excluded 𝑗. By comparing the original representation x 𝑖 with this new one, usually by means of a pairwise distance metric, we obtain a measure of how much the output representation is affected by the exclusion of 𝑗. In our experiment, we chose the cosine distance as a distance metric:</p><formula xml:id="formula_1">C 𝑖𝑗 = 𝑐𝑠(x ¬𝑗 𝑖 , x 𝑖 )<label>(2)</label></formula><p>Computing Equation <ref type="formula" target="#formula_1">2</ref>for each token 𝑖, 𝑗 generates a Value-Zeroing Matrix C where the value of cell C 𝑖𝑗 in the map indicates the degree to which the 𝑖 𝑡ℎ token is dependent on the 𝑗 𝑡ℎ to form its contextualized vectorial representation.</p><p>For our experiments, we modified the implementation of T5 in the Python transformers library<ref type="foot" target="#foot_2">3</ref> such that the model's encoder can calculate the Value-Zeroing Matrix C. In particular, we look at the section of the matrix C 𝑛𝑠:𝑛𝑝,0:𝑛𝑠 where 𝑛 𝑠 is the number of tokens in the original sentence and 𝑛 𝑝 is the number of tokens that compose the appended specially placed tokens (See 2 for how we chose these tokens). This section of C illustrates how each original token in the sentence contributes to the vectorial representation of the appended tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data</head><p>We relied on posts extracted from TAG-IT <ref type="bibr" target="#b14">[15]</ref>, the profiling shared task presented at EVALITA 2020 <ref type="bibr" target="#b17">[18]</ref>. The dataset, based on the corpus defined in <ref type="bibr" target="#b18">[19]</ref>, consists of more than 18,000 posts written in Italian and collected from different blogs. Each post is labeled with three different labels: age (binned into 5 classes) and gender (male or female) of the writer, and topic (11 classes). Since previous works have shown that tasks that are solved through the use of lexical and semantic information benefit the most from a well-chosen label representation <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b8">9]</ref>, we have decided to focus only on the Topic classification task. Moreover, to have comparable results with previous studies, we used the same dataset configuration used in <ref type="bibr" target="#b8">[9]</ref>. This setting is different from how the original task was defined in <ref type="bibr" target="#b14">[15]</ref>: instead of predicting the label of a given collection of texts (multiple posts), we fine-tuned our model to predict the topic from each single post and, since a fair amount of posts was quite short, we removed the posts shorter than 10 tokens. At the end of this process, we obtained a dataset consisting of 13,553 posts as the training set and 5,055 posts as the test set. The distribution of posts according to each label is reported in Table <ref type="table" target="#tab_1">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Model and Evaluation</head><p>We used the T5 base version pre-trained on the Italian language, i.e. IT5 <ref type="foot" target="#foot_3">4</ref> . In particular, the model was trained on the Italian sentences extracted from a cleaned version of the mC4 corpus <ref type="bibr" target="#b19">[20]</ref>, a multilingual version of the C4 corpus including 107 languages. Models' performances on the topic classification task were computed using the F-Score on the test set. To evaluate the capability of our selection method to find suitable labels, we trained up to 100 models with 100 different Representation Sets. Each of these sets was composed of representations chosen by our method and was ranked based on its prediction, from the set predicted as the best (Rank 0) to the set predicted to be the worst (Rank 99). We then calculated the Spearman correlation between the set's ranking, and the obtained F-Score using that set. If our method can reliably predict the best representation to maximize performance, we expect a correlation between the ranking and the model performances. We used the traditional approach of using translated class names for classification as our baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>As a first step, we evaluated the first ten Representation Sets 𝐸 0 , ..., 𝐸 9 from each of the three tested methods to assess their potential in predicting the most effective representations. Figure <ref type="figure" target="#fig_1">2</ref> reports the scatter-plots showing the F-scores obtained on the test set by each model according to the 10 Representations Sets. As we can notice, the first two methods, Appended Label and Appended Label with Prompt, don't show any particularly interesting trends. The first one has a slightly negative coefficient and a Spearman Correlation of 0.03 with a p-value of 0.934. With such a low correlation value and high p-value we can't reject the null hypothesis and the obtained trend is probably random. The same can be said for the second method too, where we have a slightly positive trend, with a Spearman correlation of 0.151 with a p-value of 0.67. On the contrary, the third method shows a more pronounced negative trend (𝑆𝑝𝑒𝑎𝑟𝑚𝑎𝑛 = −0.552), i.e. as the rank increases the performance of the models tends to decrease. Although the p-value of the correlation is below the standard cut-off threshold (𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.098), we decided to use the EOS method for testing with a total of 100 representation sets. Before proceeding, we removed from the original dataset the posts belonging to TECHNOLOGY. This was done since for this class we extracted only 23 sets, due to the small number of samples in the training set. After removing this class, we evaluated the method with the rest of the categories training 100 models with the first 100 ranked Representations Sets.</p><p>Correlation results are reported in Figure <ref type="figure" target="#fig_2">3</ref>, while Table <ref type="table" target="#tab_2">2</ref> shows the performances of the models obtained with the representation set of rank 0, the best performing model (Ranked 20th), and the worst performing model (Ranked 95th) along with the baseline (A IT5 trained with the original class label translated into Italian). As we can see, the negative trend between models' performance and Representation Sets observed previously can still be noted, although less pronounced (𝑆𝑝𝑒𝑎𝑟𝑚𝑎𝑛 = −0.314, 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 0.001). In terms of classification scores, we obtained a difference of 0.05 in terms of F-score between the best-performing model obtained by rank 20 (0.68), and the worst-performing one obtained by rank 95 (0.63). Although general conclusions about the method cannot be drawn, it appears that, in this setting, selecting labels from the training set using Attention attribution techniques, such as Value-Zeroing, effectively identifies keywords with meaningful semantic connections that IT5 can leverage to achieve higher performance.</p><p>Interestingly, the lowest-performing model using the EOS method achieved the same F-score (0.63) as the baseline method, i.e. the standard approach of using translated class names. From this perspective,  the EOS method demonstrates superiority over the standard approach: in fact, the model trained on the Representation Set ranked 0, identified by the EOS method as the best set, achieved an F-score of 0.656. While this is not the highest score produced by the EOS method, it still outperforms the standard approach. A possible explanation of the effectiveness of the EOS method could be that for building the &lt;/s&gt; character, the Encoder of the model uses particularly informative words that we can leverage if used as label representation. The role of the EOS character and other similar characters that are used for modeling purposes, like the [CLS] character in BERT-like models, is to be used as input for the final Language Modeling classifier. That pushes the model during the pre-training phase to learn to construct a representation of such a token that summarizes all the relevant information in the sentence that is needed to complete the language modeling task <ref type="bibr" target="#b20">[21]</ref>. So, by taking the highest Value-Zeroing score for constructing &lt;/s&gt;, we find tokens that are usually very contextually informative to the language modelling task and contain a lot of useful information. It's likely, then, that this information is also useful during the fine-tuning phase, to construct that lexical connection between input sentences and output classes. Moreover, when using the other two techniques, we focus on injected tokens that are often appended without sufficient context to justify their presence at the end of the post. It may be that appending tokens to the end of the post may change the semantics of the sequence too much. The first method, which simply appends the label to the end of the post, often creates scenarios where the word appears to be out of place. The same applies to the second method, but thanks to the prompt, this effect is less noticeable. This effect may also be the reason why the first method is the worst performing one, while the Appended Label with Prompt method achieves F-Scores almost as high as the EOS one but without showing any useful correlation between the chosen representation and the model F-Score, thus not being usable as a Label Representation Selection method. Figure <ref type="figure" target="#fig_3">4</ref> shows the variation in F-Scores obtained for each class. As we can observe, there is a quite low degree of variance between the classes, in contrast with the results obtained in <ref type="bibr" target="#b8">[9]</ref>, which used the same dataset and task, but represented the classes using 10 human-selected representations and 90  randomly selected ones. This is especially pronounced for the lower frequency categories, where high F-score scores also correspond to lower variance. For instance, in contrast to the results reported by <ref type="bibr" target="#b8">[9]</ref>, where the MEDICINE-AESTHETICS class exhibited numerous outliers with F-scores dropping to as low as 0, our selection method does not encounter such extreme variations. Even when accounting for outliers, the performance of the class remains relatively stable, with F-scores that are acceptable even in the worst-case scenario. A similar trend is observed for the ENTERTAINMENT class. Correlations between frequencies, TF-IDFs, number of subtokens and F-scores and the Representation Sets rank.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Class</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Qualitative Representations Analysis</head><p>To have a deeper understanding of the effectiveness of the approach, we performed a more qualitative analysis to determine which labels have the most substantial impact on the improvement or decline of the classification accuracy. Table <ref type="table" target="#tab_3">3</ref> reports the representations for each class obtained with the best and worst performing models. As we can see, and in line with previous work <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b8">9]</ref>, it would seem there are no clear patterns that could justify why certain words work better than others. Focusing on the best-performing set, the only two words that are somehow related to their class seem to be schedina for SPORTS, referring to the betting ticket used to bet for sports games, and the proper noun ilaria for CELEBRITIES. For the worst representation, the only representation that can fit in its corresponding domain (ENTERTAINMENT) is again a proper noun: dragonette, which is the name of a Canadian band. Another interesting case is piaciuto,provato, which was treated as a single word by the IT5 tokenizer giving the missing space after the comma. While our aggregation method for multiply selected tokens also rewarded the frequency, from the best-performing representations, only four had been chosen as the most salient token in the text multiple times: schedina (3 times), troverai (2 times), premuto (2 times), gippi (2 times). This could mean that we should re-evaluate how important frequency is, and maybe change the aggregation method to something that doesn't reward the frequency as much.</p><p>To better understand the role of the representations frequencies in the training set we computed both the raw frequency of each representation in the whole dataset and the TF-IDF of the representations. We then calculated the Spearman Rank between the frequencies, the TF-IDFs, and the number of subtokens of the representations against the obtained F-Score and the Representation Sets rank the representations are in. The TF-IDF has been calculated by considering all the documents of a single category as a single document, and the documents' length has been calculated as the total number of tokens (document lengths are reported in Appendix A). As we can observe (Table <ref type="table" target="#tab_4">4</ref>) representation frequency does not correlate to any class with the obtained F-Score. This, again, confirms that the role of the absolute frequency of a certain term in the training set doesn't seem to have any positive or negative effect on the ability of the model to use such representation for its classes. However, by using the aggregation method that rewarded frequency mentioned in Sec. 2, we can see that for some classes the more frequent a word is, the more likely it is to be placed at a better rank. (Rank x Frequency column in Table <ref type="table" target="#tab_4">4</ref>). In particular, for two classes (SPORTS and AUTO-MOTO) more frequent representations had a higher chance to be placed in the best ranks. This could mean that the most informative words are frequently the same in these particular categories. Focusing on the TF-IDFs correlations, we can notice two negative statistically significant correlations with the F-score: SPORTS and ANIME. This is probably due to the fact that in-domain words, that don't appear as often in the other categories, had a slightly positive impact on performances. Moreover, we noticed that these two categories are also those for which our model utilized several domain-specific words. In fact, the first ten ranked representations for the two categories are mostly domain-specific: • for SPORTS: campionato, gol, pareggio, centrocampo, milan, juventus, atalanta, tifosi, trequartista and derby; • for ANIME: streaming/download, graffio, manga, pokémon, pokemon, ko, morso, pokèmon, cmq, drago <ref type="foot" target="#foot_4">5</ref> .</p><p>Again, the correlation between the TF-IDF and the Representations' rank was to be expected, based on the method we've used to aggregate the representations. We've also empirically noticed that the method we've used to extract the representations was keen on choosing domain-specific, low-frequency words. This is why often it chooses typos and similar words with errors in them. This is probably because low-frequency words are usually full words with much more semantic value in them, and by being domain-specific they carry high contextual information, useful for constructing the other tokens' representations. This could explain why the TF-IDF, which is a metric that is specifically built to find such words, correlates so highly and significantly with the extracted words' ranks.</p><p>Since Transformer models tokenize text by splitting them into subwords, we also tried to understand whether there is any correlation between both the F-score and the Representation rank with the number of subwords of the representations. From our results, we can see that subword length doesn't seem to affect the model's performance, nor does our selection technique seem to prefer words that are split into more or less subwords. The only two exceptions are AUTO-MOTO, where a higher number of subwords leads to a decrease in performances, and SPORTS, where our model seemed to place words with a higher subword number in lower places in the ranking system.</p><p>Finally, we investigated the impact of the Part-of-Speech (PoS) associated with the representations, both globally (See Figure <ref type="figure" target="#fig_4">5</ref>) and on a per-class basis (Class-based distribution are reported in Appendix B). The PoS are extracted from an Italian Word Form Vocabulary developed by the Institute for Computational Linguistics (ILC) of the National Research Council of Italy (CNR), which contains all the word forms and their possible POs from the Italian language. As we can see from Figure <ref type="figure" target="#fig_4">5</ref>, the most frequent PoS are UNKNOWN, VERB, NOUN, and ADJ. The class UNKNOWN contains the words that are not found in the Word Form Vocabulary, and these usually consist of typos, English words, proper nouns, etc. and are going to be seen more in detail for each category. The categories with the highest number of UNKNOWNs are ENTERTAINMENT, CELEBERITIES, ANIME, and SPORTS:</p><p>• in ENTERTAINMENT, the majority of the UNKNOWNs are typos (e.g. cioe instead of cioè), abbreviations (e.g. nnt instead of niente), words with an increased vocal length in the last character (e.g. iniziaaaaaa instead of inizia) or english words (e.g. wish); • in CELEBRITIES, the majority are proper nouns (e.g. alessia, mirco, federica, etc.) and typos; • in ANIME, the majority are proper nouns of video games or tv shows characters (e.g. pokémon or charmender) or Japanese words (e.g. manga); • in SPORTS, the majority are proper nouns of soccer teams or players (e.g. milan, juventus, higuain, etc.) or match names composed by multiple teams or nation names (e.g. italia-uruguay or brasile-olanda) that our system didn't split since they didn't contain any spaces.</p><p>We also noted that for BIKES, NATURE, and AUTO-MOTO more VERBs are chosen instead of NOUNs, while for METAL-DETECTING, SPORTS, and SMOKE is the contrary. That being said, all the Parts-of-Speech seem reasonably distributed and it seems that no particular one is preferred by the method when choosing representations from the training set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we presented a novel technique for reliably choosing label representation in text-totext classification scenarios. This novel technique, based upon an Attention attribution technique called Value Zeroing, provides a set of labels used to represent the class names for a text-to-text model. We tested the approach on a Topic Classification task using IT5, an Italian pre-trained T5 model, by training 100 different models with 100 sets of representation chosen this way. We found that choosing representation with Value Zeroing and ranking them based on its value, leads to a useful correlation with the trained model's scores. Moreover, we noticed that choosing representation this way, leads to better average performances and lower variance in performance, against both human-and random-chosen representations <ref type="bibr" target="#b8">[9]</ref>. Compared to the standard approach of using the class names directly as their representation (in this case, by also translating them to Italian) our method performed better, and even the worst-performing Representation set matched the standard approach.</p><p>We also conducted an in-depth analysis to understand whether either the performance of the model or our rankings were related to some simple statistics (frequency, TF-IDF, and the number of subtokens of the representations). Results showed some statistically significant correlations, especially when focusing on the TF-IDFs of the representations. We also found no interesting trend among the Parts-of-Speech of the representation chosen this way. While NOUNs and VERBs were the most popular, there weren't any interesting findings, and some distributions suggest that the chosen representations are usually low-frequency in-domain words for that class.</p><p>In conclusion, our findings highlight again that the choice of label representations isn't trivial and has an important impact on text-to-text classification performances and our technique seems to be a way to find a good solution for the label representation selection task.</p><p>Future research should focus on applying this technique to different kinds of tasks, primarily on those tasks where lexical and semantic clues from the text are essential in solving the task. Also, other aggregation methods should be tested, reducing the impact of the selection frequency, which showed not to be an important factor in the fine-tuned models' performances.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure1: Overview of the proposed approach. First, we extract the most salient token from each training instance using Value Zeroing. Then, according to their scores, we use the best ones as label representations.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure2: Scatter-plots with regression lines where each point is a model. On the y-axis we have the weighted F-Score on the test set and on the x-axis we have the rank of the Representation Set used to train it. On the top-left we have the Label Appended method, then on the top-right the Appended Label with Prompt method and on the bottom the EOS method.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Scatter plot for the first 100 Representation Set extracted with the EOS method from the training set where the low-frequency TECHNOLOGY class was removed.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Boxplot for each Category to show the variations in F-Score obtained using the various Representation Sets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Distribution of the Parts-of-Speech across all the representations extracted using the EOS method.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Dataset statistics.</figDesc><table><row><cell>Categories</cell><cell cols="3"># Data # Training # Test</cell></row><row><cell>Anime</cell><cell>3,972</cell><cell>2,894</cell><cell>1,078</cell></row><row><cell>Auto-Moto</cell><cell>3,783</cell><cell>2,798</cell><cell>985</cell></row><row><cell>Bikes</cell><cell>520</cell><cell>365</cell><cell>155</cell></row><row><cell>Celebrities</cell><cell>1,115</cell><cell>754</cell><cell>361</cell></row><row><cell>Entertainment</cell><cell>469</cell><cell>354</cell><cell>115</cell></row><row><cell>Medicine-Aesthetics</cell><cell>447</cell><cell>310</cell><cell>137</cell></row><row><cell>Metal-Detecting</cell><cell>1,382</cell><cell>1,034</cell><cell>348</cell></row><row><cell>Nature</cell><cell>516</cell><cell>394</cell><cell>122</cell></row><row><cell>Smoke</cell><cell>1,478</cell><cell>1,101</cell><cell>377</cell></row><row><cell>Sports</cell><cell>4,790</cell><cell>3,498</cell><cell>1,292</cell></row><row><cell>Technology</cell><cell>136</cell><cell>51</cell><cell>85</cell></row><row><cell>All</cell><cell>18,608</cell><cell>13,553</cell><cell>5,055</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 F</head><label>2</label><figDesc></figDesc><table><row><cell>Representation Set</cell><cell>F-Score</cell></row><row><cell>0 (First Set)</cell><cell>0.66</cell></row><row><cell>20 (Best Performing Set)</cell><cell>0.68</cell></row><row><cell>95 (Worst Performing Set)</cell><cell>0.63</cell></row><row><cell>Baseline (Trained with original class names)</cell><cell>0.63</cell></row></table><note>-scores for different representation sets.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Table showing the representations for the best performing and worst performing set in the first 100 Representation Sets extracted from the training set where the posts of the TECHNOLOGY class were removed.</figDesc><table><row><cell></cell><cell>Best Performing</cell><cell>Worst Performing</cell></row><row><cell></cell><cell>Representation Set</cell><cell>Representation Set</cell></row><row><cell></cell><cell>(Rank 20) F-Score: 0.68</cell><cell>Rank (95) F-Score: 0.63</cell></row><row><cell>BIKES</cell><cell>risolvo</cell><cell>temperatura</cell></row><row><cell>SPORTS</cell><cell>schedina</cell><cell>decidesse</cell></row><row><cell>ANIME</cell><cell>troverai</cell><cell>principiante</cell></row><row><cell>AUTO-MOTO</cell><cell>premuto</cell><cell>abbassato</cell></row><row><cell>NATURE</cell><cell>gippi</cell><cell>causarne</cell></row><row><cell>METAL-DETECTING</cell><cell>cosa</cell><cell>pistolina</cell></row><row><cell>MEDICINE-AESTHETICS</cell><cell>capelluto</cell><cell>soffermarmi</cell></row><row><cell>CELEBRITIES</cell><cell>ilaria</cell><cell>scherzo</cell></row><row><cell>SMOKE</cell><cell>eccola</cell><cell>piaciuto,proviamo</cell></row><row><cell>ENTERTAINMENT</cell><cell>origini</cell><cell>dragonette</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc></figDesc><table><row><cell>Class</cell><cell>F-score x Frequency</cell><cell>Rank x Frequency</cell><cell>F-Score x TF-IDF</cell><cell>Rank x TF-IDF</cell><cell>F-Score x Subtoken Length</cell><cell>Rank x Subtoken Length</cell></row><row><cell>BIKES</cell><cell>-0.080</cell><cell>0.080</cell><cell>-0.118</cell><cell>0.007</cell><cell>-0.031</cell><cell>0.029</cell></row><row><cell>SPORTS</cell><cell>0.138</cell><cell>-0.307*</cell><cell>0.213*</cell><cell>-0.499*</cell><cell>-0.166</cell><cell>0.303*</cell></row><row><cell>ANIME</cell><cell>0.125</cell><cell>-0.101</cell><cell>0.283*</cell><cell>-0.408*</cell><cell>0.059</cell><cell>-0.071</cell></row><row><cell>AUTO-MOTO</cell><cell>0.147</cell><cell>-0.346*</cell><cell>0.081</cell><cell>-0.290*</cell><cell>-0.408*</cell><cell>0.126</cell></row><row><cell>NATURE</cell><cell>0.049</cell><cell>-0.026</cell><cell>0.148</cell><cell>-0.159</cell><cell>0.113</cell><cell>0.132</cell></row><row><cell>METAL-DETECTING</cell><cell>-0.053</cell><cell>-0.128</cell><cell>0.146</cell><cell>-0.204*</cell><cell>-0.130</cell><cell>0.181</cell></row><row><cell>MEDICINE-AESTHETICS</cell><cell>0.152</cell><cell>-0.149</cell><cell>0.025</cell><cell>-0.067</cell><cell>-0.116</cell><cell>0.074</cell></row><row><cell>CELEBRITIES</cell><cell>-0.182</cell><cell>-0.014</cell><cell>0.031</cell><cell>-0.329*</cell><cell>-0.071</cell><cell>0.210*</cell></row><row><cell>SMOKE</cell><cell>0.030</cell><cell>-0.034</cell><cell>0.080</cell><cell>-0.388*</cell><cell>-0.090</cell><cell>0.127</cell></row><row><cell>ENTERTAINMENT</cell><cell>-0.099</cell><cell>-0.005</cell><cell>-0.103</cell><cell>0.004</cell><cell>0.116</cell><cell>0.048</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">List of translated labels: anime, automobilismo, bicicletta, sport, natura, metal detector, medicina, celebrità, fumo, intrattenimento and tecnologia.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Since Transformers' tokenizers split tokens in multiple subtokens, to obtain the full word, we reconstruct it by reconnecting all the tokens that are part of the word that the token with the highest Value-Zeroing score is from. The Value-Zeroing score we consider for the full word is the one of the token that was selected. We decided to avoid aggregating the score of the full word in any way, because that could reward or punish multi-token words .</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">The modified class is T5ForConditionalGeneration available in https://github.com/huggingface/transformers/blob/main/src/ transformers/models/t5/modeling_t5.py. To do so, we adapted the original Value Zeroing implementation for the BERT transformer modelling class: https://github.com/hmohebbi/ValueZeroing.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://huggingface.co/gsarti/it5-base.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">morso, graffio and ko are all domain-specific words in the settings of the popular anime, cartoon and video-game Pokémon, with the first two being moves and the latter being a specific status.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been supported by: FAIR -Future AI Research (PE00000013) project under the NRRP MUR program funded by the NextGenerationEU. TEAMING-UP -Teaming up with Social Artificial Agents project under the PRIN grant no. 20177FX2A7 funded by the Italian Ministry of University and Research.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>F. Dell'Orletta) https://michelepapucci.github.io/ (M. Papucci); https://alemiaschi.github.io/ (A. Miaschi); http://www.italianlp.it/people/felice-dellorletta/ (F. Dell'Orletta) 0000-0003-4251-7254 (M. Papucci); 0000-0002-0736-5411 (A. Miaschi); 0000-0003-3454-9387 (F. Dell'Orletta</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Preface to the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI)</title>
		<author>
			<persName><forename type="first">G</forename><surname>Bonetta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Hromei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Siciliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Stranisci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</title>
				<meeting>the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Ext5: Towards extreme multi-task scaling for transfer learning</title>
		<author>
			<persName><forename type="first">V</forename><surname>Aribandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">S</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">V</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">Q</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bahri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Mach. Learn. Res</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="1" to="67" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Multitask prompted training enables zero-shot task generalization</title>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Webson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sutawika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Alyafeai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chaffin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Stiegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">Le</forename><surname>Scao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Tenth International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">W</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Longpre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fedus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brahma</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.11416</idno>
		<title level="m">Scaling instruction-finetuned language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Label representations in modeling classification as text generation</title>
		<author>
			<persName><forename type="first">X</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop</title>
				<meeting>the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="160" to="164" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Evaluating text-to-text framework for topic and style classification of italian texts</title>
		<author>
			<persName><forename type="first">M</forename><surname>Papucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>De Nigris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Miaschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2022)</title>
				<meeting>the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI 2022) co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI* IA 2022)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">IT5: Text-to-text pretraining for Italian language understanding and generation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Sarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nissim</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.lrec-main.823" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Hoste</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sakti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Xue</surname></persName>
		</editor>
		<meeting>the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)<address><addrLine>Torino, Italia</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA and ICCL</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="9422" to="9433" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Lost in labels: An ongoing quest to optimize text-to-text label selection for classification</title>
		<author>
			<persName><forename type="first">M</forename><surname>Papucci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Miaschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-3596/paper39.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th Italian Conference on Computational Linguistics</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">F</forename><surname>Boschetti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Lebani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Magnini</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Novielli</surname></persName>
		</editor>
		<meeting>the 9th Italian Conference on Computational Linguistics<address><addrLine>Venice, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023-12-02">November 30 -December 2, 2023. 2023</date>
			<biblScope unit="volume">3596</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">AttentionRank: Unsupervised keyphrase extraction using self and cross attentions</title>
		<author>
			<persName><forename type="first">H</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Luo</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.146</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.146.doi:10.18653/v1/2021.emnlp-main.146" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1919" to="1928" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">SAMRank: Unsupervised keyphrase extraction using self-attention map in BERT and GPT-2</title>
		<author>
			<persName><forename type="first">B</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shin</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.emnlp-main.630</idno>
		<ptr target="https://aclanthology.org/2023.emnlp-main.630.doi:10.18653/v1/2023.emnlp-main.630" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="10188" to="10201" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">VeeAlign: Multifaceted context representation using dual attention for ontology alignment</title>
		<author>
			<persName><forename type="first">V</forename><surname>Iyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.842</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.842.doi:10.18653/v1/2021.emnlp-main.842" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="10780" to="10792" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">FAA: Fine-grained attention alignment for cascade document ranking</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Geng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jiang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.acl-long.94</idno>
		<ptr target="https://aclanthology.org/2023.acl-long.94.doi:10.18653/v1/2023.acl-long.94" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Boyd-Graber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Okazaki</surname></persName>
		</editor>
		<meeting>the 61st Annual Meeting of the Association for Computational Linguistics<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1688" to="1700" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Improving word mover&apos;s distance by leveraging selfattention matrix</title>
		<author>
			<persName><forename type="first">H</forename><surname>Yamagiwa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yokoi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shimodaira</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.findings-emnlp.746</idno>
		<ptr target="https://aclanthology.org/2023.findings-emnlp.746.doi:10.18653/v1/2023.findings-emnlp.746" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting><address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="11160" to="11183" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">Dell</forename><forename type="middle">'</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nissim</forename><surname>Orletta</surname></persName>
		</author>
		<title level="m">Tag-it -topic, age and gender prediction</title>
				<imprint>
			<publisher>EVALITA</publisher>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Quantifying context mixing in transformers</title>
		<author>
			<persName><forename type="first">H</forename><surname>Mohebbi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zuidema</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chrupała</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Alishahi</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.eacl-main.245</idno>
		<ptr target="https://aclanthology.org/2023.eacl-main.245.doi:10.18653/v1/2023.eacl-main.245" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Augenstein</surname></persName>
		</editor>
		<meeting>the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Dubrovnik, Croatia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="3378" to="3400" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">A survey on the robustness of feature importance and counterfactual explanations</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dutta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Long</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Magazzeni</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2111.00358.arXiv:2111.00358" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Overview of the 7th evaluation campaign of natural language processing and speech tools for italian, in</title>
		<author>
			<persName><forename type="first">V</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Di Maro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Croce</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Passaro</surname></persName>
		</author>
		<author>
			<persName><surname>Evalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop</title>
				<meeting><address><addrLine>EVALITA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020. 2020</date>
			<biblScope unit="volume">2765</biblScope>
		</imprint>
	</monogr>
	<note>CEUR-ws</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Quanti anni hai? age identification for italian</title>
		<author>
			<persName><forename type="first">A</forename><surname>Maslennikova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Labruna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cimino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dell'orletta</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLiC-it</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">mT5: A massively multilingual pre-trained text-to-text transformer</title>
		<author>
			<persName><forename type="first">L</forename><surname>Xue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Constant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Al-Rfou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Siddhant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.naacl-main.41</idno>
		<ptr target="https://aclanthology.org/2021.naacl-main.41.doi:10.18653/v1/2021.naacl-main.41" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="483" to="498" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">What does BERT look at? an analysis of BERT&apos;s attention</title>
		<author>
			<persName><forename type="first">K</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Linzen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Chrupała</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Belinkov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Hupkes</surname></persName>
		</editor>
		<meeting>the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="276" to="286" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
