1. Introduction and Motivation

SimilEx: the First Italian Dataset for Sentence Similarity with Natural Language Explanations

Chiara Alzetta

Felice Dell'Orletta

Chiara Fazzone

Giulia Venturi

0 0 ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli' , Pisa , Italy

Large language models (LLMs) demonstrate great performance in natural language processing and understanding tasks. However, much work remains to enhance their interpretability. Annotated datasets with explanations could be key to addressing this issue, as they enable the development of models that provide human-like explanations for their decisions. In this paper, we introduce the SimilEx dataset, the first Italian dataset reporting human judgments of semantic similarity between pairs of sentences. For a subset of these pairs, the annotators also provided explanations in natural language for the scores assigned. The SimilEx dataset is valuable for exploring the variability in similarity perception between sentences and among human explanations of similarity judgments.

eol>Sentence similarity Italian dataset human judgements explanations annotation

1. Introduction and Motivation

reasoning. To the best of our knowledge, the only existing dataset enriched with explanations for Italian is Large language models (LLMs) display impressive linguis- ‘e-RTE-3-it’ [ 15 ], an Italian version of the RTE-3 dataset tic skills and demonstrate outstanding performances on a for textual entailment. variety of tasks concerning natural language processing In this paper, we introduce the SimilEx dataset1, as far and understanding. This is particularly true for the most as we are aware, the first Italian dataset of 2,112 pairs recent and ground-breaking models such as GPT-3.5\4 of sentences manually annotated for semantic similar[ 1 ], LLama-2 [ 2 ] and Gemini [ 3 ]. LLMs, however, also ity. About half of the pairs are further enriched with present risky limitations such as lack of factuality [ 4, 5 ], free-form human-written explanations that justify the poor interpretability [ 6, 7 ] and hallucinations [ 8 ]. Con- similarity score. sequently, it has become important to verify whether The identification of textual similarity is a natural lanthese models are explainable, and specifically whether guage understanding (NLU) task that involves determinthey can provide human-like explanations using natural ing the degree of semantic equivalence between two texts language for decisions made [ 9, 10 ]. The ability of LLMs [ 16, 17 ]. It is a foundational NLU problem relevant to to explain the reasoning needed to solve a given task many applications such as summarisation, question anis fundamental, particularly for tasks where there is no swering and conversational systems [18]. Despite its established or shared evaluation protocol or benchmark. relevance, this task is highly challenging even for hu

Annotated datasets with explanations are key to ad- mans due to its subjective nature: human annotations dressing this issue, as they enable the development of often widely disagree on similarity scores [19] suggestmodels that provide human-like explanations for their ing that the cues driving sentence similarity are neither decisions. Therefore, multiple datasets have been created well codified nor transparent and that their perceived with free-form explanations to be incorporated into the relevance may vary among annotators. Possibly due to model training process and used as benchmarks at test these challenges, and as far as we know, datasets includtime, mostly focusing on English [ 10 ]. Some examples ing human explanations for the sentence similarity task are the e-SNLI dataset [ 11 ], a version of the Stanford are lacking. However, they are invaluable as they force Natural Language Inference (SNLI) dataset [ 12 ] enriched annotators to reason about their choices and identify the with human-annotated explanations, and the Common most relevant traits influencing their annotations. Sense Explanations (CoS-E) [ 13 ] and Semi-Structured Explanations for COPA (COPA-SSE) [ 14 ] datasets, which include natural language explanations for commonsense

Contributions. In this paper, we i) introduce SimilEx, the first Italian dataset featuring human annotations and explanations of sentence semantic similarity; ii) provide an extensive study of the degree of subjectivity in the perception of sentence semantic similarity; and iii) investigate the relationship between the stylistic variation of the paired sentences and the human ratings and natural recruited among native Italian speakers and presented language explanations of sentence semantic similarity. with a questionnaire of 30 pairs plus 2 control pairs. Annotation Guidelines. The task consisted of scoring each sentence pair of the questionnaire for the perceived 2. The SimilEx Dataset sentence similarity using a 5-point Likert scale, where 1 is described as “Completamente diverse” (Completely 2.1. Data Collection diferent) and 5 as “Pressoché identiche” (Almost identiThe sentence pairs of SimilEx are acquired from a collec- cal). Any formal definition of similarity is provided, only tion of novels from the late XIX century translated into a few examples of highly similar and highly diferent Italian. We used Sentence-BERT (SBERT) [20] to com- pairs along with motivations for the extreme similarity bine pairs of sentences to present to annotators. SBERT scores, as shown in the annotation instructions provided is a modification of BERT [ 21] made adequate to pro- to the annotators fully reported in Appendix C. This duce sentence embeddings that can be easily compared represents the main novelty of our approach compared to evaluate their similarity using cosine similarity, which to the methodology used to create datasets for Semanranges from 0 (no similarity) to 1 (identical sentences). tic Textual Similarity tasks, typically organized within We included in the SimilEx dataset only pairs obtaining a the SemEval evaluation campaign (see among the others similarity score ≥ 0.65, for a total of 2,112 sentence pairs. [24, 18]). These datasets are usually built with clear and

The textual genre of the sentences (i.e., novels) in- specific instructions for annotators, who are explicitly troduces specific stylistic properties that cause potential asked to evaluate whether paired text portions refer to diferences from standard Italian. We assessed the linguis- the same person, action, or event, or to focus their judgtic style of SimilEx sentences using Profiling-UD [ 22]2, a ment on similarity types such as the same author, time web-based tool that captures multiple aspects of sentence period, or location. Some examples of annotation with structure. The tool extracts around 130 properties repre- similarity scores averaged across annotators are shown sentative of the underlying linguistic structure of a sen- in Table 1. tence, derived from raw, morphosyntactic, and syntactic Demographics. Participants could share information levels of sentence annotation, all based on the Univer- about their age, gender and occupation and complete sal Dependencies (UD) formalism [23]. These properties multiple questionnaires. Eventually, 317 distinct particihave been shown to be highly predictive when used as pants took part in the study. After a preliminary analysis, features by learning models in various classification tasks, we excluded 34 annotators deemed unreliable because such a Automatic Readability and Linguistic Complexity they either took too short to complete the questionnaire, Assessment or Native Language Identification. Among assigned systematically divergent scores compared to the these caracteristics, the average length computed on Sim- rest of the participants, failed the control questions or ilex sentences is 30.18 tokens (± 22.36), above the average submitted blank answers. The resulting dataset includes length of standard Italian sentences, typically around 20 2,112 sentence pairs annotated by the remaining 283 antokens. Interestingly, within pairs, the average length notators, who took 18 minutes on average to complete diference is 17.02 tokens ( ± 19.55). This value, combined a questionnaire4. Each pair received a minimum of 5 with such a high standard deviation, suggests a large vari- and a maximum of 7 annotations from diferent particiability of style within the pairs. This notable variability pants. The set of annotators is quite balanced for gender extends, e.g., to the distribution of subordinate clauses (51% males) and the average age of annotators is 27.05 and lexical overlap. Within pairs, the average diference (± 6.56). Regarding occupation, 50% of participants indiin the number of subordinate clauses is 2.25 (± 1.81), and cated that they have a full- or part-time job, around 25% the overlap of content words is 12.60%, which are signif- declared themselves unemployed, and the remaining 25% icant given that this variation occurs within individual preferred not to disclose their occupational status. sentence pairs. Having pairs with such stylistic diferences provides an opportunity to investigate the impact 2.3. Human Explanations of Similarity of stylistic variation on the perception of similarity.

2.2. Human Similarity Annotation Sentence pairs of SimilEx were annotated through the on

line crowdsourcing platform Prolific 3. Annotators were

2The complete set of linguistic characteristics used for the stylistic

analysis can be found in Appendix B. 3https://www.prolific.com/

We recruited 2 native Italian speakers who volunteered to

enrich the pairs of sentences with free-form explanations. These annotators are graduate students, one male and one female, aged 23 years. They were asked to score the similarity of a random subset of 907 sentence pairs on the same 5-point Likert scale as the other participants. Additionally, they should provide a short explanation for

4The compensation is fair according to the platform: 6.30£/hour.

Sentence 1 Sentence 2 Sì, grazie a Dio non è male.

Non hanno mandato a prendere il latte fresco? Solo lo zar può far la grazia.

Accidenti a voi, mi fate perdere il filo! Io invece non ce l’ho: tante grazie! E per me, chiedi almeno del latte.

Voglio chiedere la grazia allo zar.

Intanto voi però mi avete fatto perdere il filo. their scores, in the form of a single concise sentence.

3. Human Similarity Perception

fewer annotators in agreement, while 5 or more annotators (up to 9) gave identical values in 20.08% of pairs. Figure 2 displays the distribution of similarity scores within these groups. Notably, when few annotators agree on a pair, the scores are evenly distributed across the five labels, indicating that disagreement can occur for pairs seen as both similar and diferent. In contrast, when more annotators agree, the most commonly assigned score is 1, indicating that annotators converge more frequently on dissimilarity judgments. This is supported by the negative Pearson correlation between the number of agreeing annotators and the average similarity score of the pair ( = − 0.344, < 0.001).

Agreement, Style and Similarity. We explored the relationship between style and similarity judgments by comparing scores and stylistic traits of sentences. As a general remark, we found that style minimally afects than that reported among the Prolific annotators. pairs’ similarity: the Pearson correlation between the Linguistic Style of Explanations. We explored the similarity scores and the distribution of stylistic proper- style of explanation relying on the linguistic profiling ties is either non-significant ( > 0.05) or extremely low method described in Section 2.1. We noted that the ex( < 0.1). However, a more in-depth analysis of specific planations written by the two students exhibit partial stylistic properties revealed a nuanced relationship be- similarity as can be seen by inspecting the results of the tween style and the consistency of human judgments. For stylistic analysis distributed as supplementary materials example, contrary to our expectations, sentence length, (see Appendix A). For example, they both tend to write a raw yet informative feature reflecting stylistic varia- quite short sentences, i.e. on average 6.35 (± 3.93) and 7.67 tion, did not impact the similarity scores assigned by (± 5.12) token-long, and characterized by a nominal style. annotators. In fact, when we computed the correlation This is evidenced by the low percentage distribution of between the length diference of paired sentences and verbal roots (i.e. sentences with a verb as the syntactic the variance between similarity judgments, we observed root), computed over the total number of roots reprea lack of correlation (0.05). To further investigate, we sented by other morpho-syntactic categories (i.e. 58.21% grouped pairs based on the diference between the length (± 49.35) and 61.43% (± 48.70)). This percentage is notably of their sentences, and specifically, based on whether low when compared to the distribution in the ISDT [26], their length diference was above or below the average the largest Italian Treebank, where the distribution is value of 17 tokens. We noticed that also from this per- 85.73%. spective of analysis sentence length did not afect the Content of Explanations. The content analysis of the IAA of the scores either, as = 0.265 for both groups. explanations reveals that both students share some arguHowever, when focusing on diferent stylistic traits more ments when justifying the similarity scores for SimilEx closely related to sentence structure, we observed a sub- sentence pairs. Specifically, the average cosine similarity stantial relationship with higher annotator agreement. between their explanations, computed using SBERT, is For instance, the IAA is moderate (0.49) for pairs where 0.46, indicating a moderate level of similarity. neither sentence contains a subordinate clause, but drops Given that a qualitative analysis reveals several recurto fair (0.25) when both sentences contain at least one ring arguments and templates in the explanations, such subordinate. Similarly, the IAA is higher (0.37) when the as Entrambe descrivono (‘Both describe’), In entrambe le syntactic tree depth diference between paired sentences frasi si parla di un argomento militare (‘In both sentences is below the average value of 1.98, compared to 0.29 when a military topic is mentioned’), we further explored the the diference is greater. These results are extremely in- possibility of identifying homogenous content among teresting as they indicate that while stylistic traits may them. To this end, we clustered the 907 explanations of not directly influence the semantic similarity between each student (1,814 in total) based on their SBERT vecsentences, some of them play a role in the convergence tors. We initially configured the clustering algorithm of human judgments. to partition the data into 10 clusters5. However, only 4 of these clusters were found to be semantically homogeneous. Specifically, these homogeneous clusters con4. Human Similarity Explanation tain explanations where either Student 1 or 2: i) writes that the evaluated sentences contain positive or negative emotions such as love or anger, ii) uses the phrase Pressochè identiche (‘Almost identical’), iii) uses the phrase Completamente diverse (‘Completely diferent’), and iv) notes that the evaluated sentences refer to a military topic. Since the explanations in the remaining 6 clusters were not semantically homogeneous, we reconfigured the clustering algorithm to partition the data into 5 clusters. This time, we included only the explanations that had not been previously clustered, representing 72.76% of all SimilEx explanations. However, we were still unable to isolate explanations with similar content. This suggests that the two students often focused on diferent aspects when evaluating sentence similarity. As proof, consider the examples reported in Table 2, where stu

In this section, we focus on the analysis of the subset of 907 sentence pairs of SimilEx annotated by the two students with both human similarity judgments and natural language explanations for the assigned scores.

Comparison with Prolific annotators. The comparison between the similarity judgments of the graduate students and Prolific annotators reveals a strong alignment between the two groups. The Pearson correlation between the average similarity score of the Prolific annotators and the average score between the two graduate students is significantly high and positive ( = 0.779, < 0.001). This high correlation is also observed when computed separately for each of the two students, indicating that their perceptions of similarity closely match the judgements obtained from the crowdsourcing campaign.

Additionally, the IAA between the two students suggests alignment between the students since = 0.49, higher 5We employed agglomerative clustering using Euclidean distance and Ward variance minimization as the clustering method. (1) (2) (3) (4) (5) (6)

Sentence 1 "Vedeva lo scintillio degli occhi, tremulo e avvampante, e il riso di felicità e di eccitamento che senza volere le increspava le labbra; vedeva la grazia misurata, la sicurezza e la levità dei movimenti." Sentence 2 "Era così bella, che non solo non appariva in lei ombra di civetteria, ma pareva al contrario che le rimordesse il forte ed immancabile efetto di una grazia trionfatrice, che avrebbe voluto temperare, se le fosse stato possibile." Explanations S1: Completamente diverse. (1) (Sim. scores) S2: Parlano di donne che sono molto graziose. (4) Sentence 1 "Ma che volete farci: questa è la vocazione dell’autore, ormai malato della propria imperfezione, e il suo talento è fatto apposta per rappresentare la povertà della nostra vita, scovando la gente in buchi sperduti, in angoletti remoti dell’impero!" Sentence 2 "Perché mettere in mostra la povertà della nostra vita e la nostra triste imperfezione, andando a scovare gli uomini in buchi sperduti, in angoletti remoti dell’impero?" Explanations S1: Completamente diverse anche se esprimono lo stesso concetto. (1) (Sim. scores) S2: Stessa frase impostata diversamente a livello sintattico. (4) Sentence 1 "L’agente di polizia che l’accompagnava, discese e scosse il braccio intormentito; poi si tolse il berretto e si fece il segno della croce." Sentence 2 "Nell’osteria entrò un agente di polizia." Explanations S1: In entrambe le frasi si parla di un agente della polizia. (2) (Sim. scores) S2: Il soggetto è un agente di polizia. (2) Sentence 1 "Napoleone si volse ad Alessandro, come per dire che quanto ora faceva era fatto per l’augusto e caro alleato." Sentence 2 "Tutti gli alleati di Napoleone gli divennero nemici." Explanations S1: In entrambe le frasi si parla di Napoleone e dei suoi alleati. (2) (Sim. scores) S2: Parlano degli alleati di Napoleone. (3) Sentence 1 "Ma l’amore con un marito inquinato dalla gelosia e da ogni sorta di difetti non era più per lei." Sentence 2 "Era forse, semplicemente, un sentimento di gelosia: egli era talmente avvezzo all’amore di lei, che non poteva ammettere che ella potesse amarne un altro." Explanations S1: Nel primo caso il focus della frase è la moglie, nella seconda lo è il marito. (2) (Sim. scores) S2: Parlano di uomini gelosi. (2) Sentence 1 "Tonfi, spruzzi, strida, ingiurie, lazzi, risate, un allegro pandemonio." Sentence 2 "E fino a quel momento, chiasso, baccano, sghignazzi, ingiurie, rumore di catene, acido carbonico e fuliggine, teste rase, facce marchiate, vestiti a brandelli, tutto fatto oggetto di ludibrio e di infamia... sì, grande è la vitalità dell’uomo!" Explanations S1: Entrambe le frasi descrivono vitalità. (4) (Sim. scores) S2: Descrivono degli scenari di caos, disordine; sintassi frasi simile. (3) dents focused on diverse aspects of the paired sentences similar the explanations are, the smaller the diference while they assigned either similar (see #5 and #6) or dif- in the students’ similarity judgments. Notably, students’ ferent (see #1 and #2) similarity scores. While this may explanations tend to be more similar when the similarity result in underspecification and inconsistency in the col- scores assigned by both of them are lower (i.e. 1 or 2), as lected explanations, it confirms the inherent subjectivity in example #3 of Table 2. and expressivity involved in providing free-text natural language explanations for a highly subjective task such as evaluating semantic sentence similarity [ 10 ]. 5. Conclusion and Future Work The content analyses above were enriched with an

This paper presented SimilEx, the first Italian dataset in-depth investigation into whether there is a correlaon sentence similarity enriched with human judgments tion between the SBERT cosine similarity of the explaand free-form explanations. The analyses of the collected nations of each student and their similarity judgments. judgments confirmed that the perception of sentence simThe Pearson correlation between SBERT scores and the ilarity is inherently subjective, as evidenced by the fair absolute diference in the students’ similarity judgments agreement between the scores. Notably, annotators tend reveals a moderate negative relationship ( = − 0.459, to agree less on similar sentence pairs, showing greater < 0.001). This indicates that the more semantically convergence when sentences are markedly diferent. The style of the paired sentences appears to influence this convergence: while most linguistic traits may not directly impact the similarity score, some of them afect the homogeneity of judgments assigned by diferent annotators. These features mostly concern properties of sentence structure rather than raw sentence features such as lenght, which does not play a role in homogeneity. Regarding explanations, we found a correlation between the similarity of the content of the explanations and the similarity scores assigned, indicating that annotators tend to write more similar explanations, using a similar writing style, when their scores align.

The findings from this study open several prospects.

Expanding SimilEx to include sentences from diferent textual genera could provide further insights into the factors afecting similarity judgments. Additionally, incorporating more annotators with varying linguistic backgrounds could foster a better understanding of the subjectivity in similarity perception. Lastly, our dataset could help develop automated tools to evaluate the explainability of LLMs. By leveraging SimilEx, researchers can create models that predict similarity scores and generate explanations, enhancing the interpretability of LLMs.

6. Acknowledgments This paper is supported by the PRIN 2022 PNRR Project

EKEEL - Empowering Knowledge Extraction to Empower Learners (P20227PEPK) funded by the European Union – NextGenerationEU and the project CHANGES - Cultural Heritage Innovation for Next-Gen Sustainable Society (PE00000020) project under the NRRP MUR program funded by the NextGenerationEU. a survey, Information 11 (2020) 421. ity with Discourse, 2013. [17] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre,

W. Guo, * sem 2013 shared task: Semantic textual similarity, in: Second joint conference on lexical Appendix and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, 2013, pp. 32–43. A. Supplementary materials [18] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia,

SemEval-2017 task 1: Semantic textual similarity The complete SimilEx dataset is freely available at http: multilingual and crosslingual focused evaluation, //www.italianlp.it/resources/ along with the results of the in: S. Bethard, M. Carpuat, M. Apidianaki, S. M. stylistic analysis of both paired sentences and the natural Mohammad, D. Cer, D. Jurgens (Eds.), Proceedings language explanations provided by the two students. of the 11th International Workshop on Semantic Specifically, on the dedicated page, you can find the Evaluation (SemEval-2017), Association for Compu- following materials: tational Linguistics, Vancouver, Canada, 2017, pp. SimilEx dataset. The dataset is organized in columns, 1–14. each reporting the following information: [19] Y. Wang, S. Tao, N. Xie, H. Yang, T. Baldwin, K. Ver- • Pair_ID: the unique identifier of the paired senspoor, Collective Human Opinions in Semantic tences; Textual Similarity, Transactions of the Association • Sentence_1 and Sentence_2: the text of each of for Computational Linguistics 11 (2023) 997–1013. the two paired sentences; [20] N. Reimers, I. Gurevych, Sentence-BERT: Sentence

Embeddings using Siamese BERT-Networks, in: • A1-A7: the similarity judgments of the Prolific Proceedings of the 2019 Conference on Empirical annotators; Methods in Natural Language Processing and the • Stud_1: the similarity judgment assigned by the 9th International Joint Conference on Natural Lan- ifrst student; guage Processing (EMNLP-IJCNLP), 2019, pp. 3982– • Explanation_Stud1: the natural language expla3992. nation provided by Stud_1; [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: • Stud_2: the similarity judgment assigned by the Pre-training of deep bidirectional transformers for second student; language understanding, 2019. • Explanation_Stud2: the natural language expla[22] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, nation provided by Stud_2.

S. Montemagni, Profiling-UD: a tool for linguistic profiling of texts, in: Proceedings of the Conference Linguistic profiling of the paired sentences. The on Language Resources and Evaluation (LREC), results of the stylistic analysis of each of the paired senELRA, 2020, pp. 7147–7153. tences included in SimilEx are contained in the “Sen[23] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Ze- tence_profiling” sheet, reporting for each column the man, Universal Dependencies, Computational Lin- following information: guistics 47 (2021) 255–308. doi:10.1162/coli_a_ • Pair_ID: the unique identifier of the paired sen00402. tences in the SimilEx dataset; [24] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, • Sent_in_pair: the unique identifier of each indiW. Guo, *SEM 2013 shared task: Semantic textual vidual sentence in the pair; similarity, in: M. Diab, T. Baldwin, M. Baroni (Eds.), Second Joint Conference on Lexical and Compu- • all other columns report the value of the distributational Semantics (*SEM), Volume 1: Proceedings tion of the complete set of linguistic characterisof the Main Conference and the Shared Task: Se- tics derived with Profiling-UD by each individual mantic Textual Similarity, Association for Compu- sentence. tational Linguistics, Atlanta, Georgia, USA, 2013, Linguistic profiling of the explanations. The repp. 32–43. URL: https://aclanthology.org/S13-1004. sults of the stylistic analysis of each explanation pro[25] J. R. Landis, G. G. Koch, The measurement of ob- vided by the two students are contained in the “Explaserver agreement for categorical data, Biometrics nations_profiling” sheet, reporting for each column the (1977) 159–174. following information: [26] C. Bosco, S. Montemagni, M. Simi, Converting italian treebanks: Towards an italian stanford de- • PairID_of_explanied_pair: the unique identifier pendency treebank, in: Proceedings of the ACL of each individual sentence in the pairs of the Linguistic Annotation Workshop & Interoperabil- SimilEx dataset; • Explanation_of_student: the identifier of the stu

dent; • all other columns report the value of the distribution of the complete set of linguistic characteristics derived with Profiling-UD by each individual explanation.

B. Linguistic Features The set of linguistic features derived by Profiling–UD are extracted from diferent levels of linguistic annotation and capture a wide number of linguistic phenomena and can be grouped as follows: Per farlo, ti mostreremo delle coppie di frasi estratte da

romanzi e ti chiederemo di assegnare ad ogni coppia un punteggio compreso fra 1 e 5.

Usa 1 per dire che le due frasi sono fra loro completamente diverse; Usa 5 per dire che sono pressoché uguali.

Gli altri punteggi ti serviranno per valutare i casi intermedi.

Due frasi possono dirsi uguali o diverse sulla base di diversi elementi. Ecco alcuni esempi per aiutarti nella valutazione.

Coppie di frasi diverse (punteggio 1).

Esempio 1: a) Io desidererei tanto non sentire così intensamente e non prendermi tanto a cuore tutto quello che succede. b) Sì, non sono in me, sono tutta nell’aspettativa e vedo tutto un po’ troppo facile.

Esempio 2: a) Anche il vecchio principe t’è afezionato. b) - Non mi sembra di averveli chiesti, - scattò il principe irritatissimo.

Esempio 3: a) Il the veramente era del color della birra, ma io ne bevvi un bicchiere. b) Ma non passò neanche un minuto, che la birra gli diede alla testa e per la schiena gli corse un leggero e perfin piacevole brivido. • Raw text - Number of tokens in sentence; - Average characters per token. • Morphosyntactic information - Distibution of UD POS; - Lexical density. • Inflectional morphology - Distribution of lexical verbs and auxiliaries for inflectional categories (tense, mood, person, number). • Verbal Predicate Structure - Distribution of verbal heads and verbal roots; - Average verb arity and distribution of verbs by Fai particolare attenzione agli esempi 2 e 3: anche se arity. le frasi hanno delle parole in comune (come ’principe’ e • Global and Local Parsed Tree Structures ’birra’ negli esempi) non è detto che siano uguali! - Average depth of the whole syntactic trees; - Average length of dependency links and of the Coppie di frasi molto simili (punteggio 5). longest link; Esempio 1: - Average length of prepositional chains and dis- a) Signori della giuria, la psicologia è a doppio taglio e tribution by depth; anche noi siamo in grado di comprenderla.

Average clause length. b) Vedete allora, signori della giuria, dal momento che • Relative order of elements la psicologia è un’arma a doppio taglio, permettetemi di - Distribution of subjects and objects in post- and occuparmi del secondo taglio e vediamo che cosa viene pre-verbal position. fuori. • Syntactic Relations Esempio 2:

Distribution of dependency relations. a) "Un rettile divorerà l’altro", aveva detto il giorno prima • Use of Subordination Ivan, parlando con rabbia del padre e del fratello. - Distribution of subordinate and principal b) "Un rettile divorerà l’altro, quella è la fine che faranno!". clauses; Esempio 3: - Average length of subordination chains and dis- a) Ma una volta deciso, continuò con la sua voce stridula, tribution by depth; senza timori, senza esitazioni e sottolineando alcune - Distribution of subordinates in post- and pre- parole. principal clause position. b) Parlava rapido, senza fermarsi un momento, senza la minima esitazione, quasi rimproverasse a sè stesso di aver tanto indugiato a mettere Marianna a parte di tutti i suoi segreti, quasi scusandosi presso di lei.

C. Annotation Instructions C.1. Original Instructions in Italian Gli esempi 1 e 2 riportano frasi che non solo con

Stai per svolgere un questionario nel quale ti verrà chiesto tengono molte parole in comune ma sono simili anche di valutare se due frasi sono fra di loro simili o diverse. per quanto riguarda la scena descritta. Nel terzo esempio, entrambe le frasi descrivono una persona intenta a parlare in modo svelto e deciso. Possiamo dire che in questi esempi l’alta similarità fra le frasi è data dal fatto che, ad eccezione di alcuni dettagli, esse descrivono scene o immagini molto simili, anche se si svolgono in contesti diverse.

C.2. Instructions Translations into English You are about to take a questionnaire in which you will

be asked to assess whether two sentences are similar or diferent to each other. To do this, we will show you pairs of sentences extracted from novels and ask you to give each pair a score between 1 and 5.

Use 1 to say that the two sentences are completely different from each other; Use 5 to say that they are almost the same. The other scores will be used to evaluate the intermediate cases.

Two sentences can be equal or diferent based on several elements. Here are some examples to help you in your evaluation.

Pairs of diferent sentences (score 1)

Examples: Please refer to the above section to see the original examples in Italian.

Pay particular attention to examples 2 and 3: although the sentences have words in common (like ‘prince’ and ‘beer’ in the examples) they are not necessarily the same! Pairs of very similar sentences (score 5)

Examples: Please refer to the above section to see the original examples in Italian.

Examples 1 and 2 show sentences that not only contain many words in common but are also similar in terms of the scene described. In the third example, both sentences describe a person speaking quickly and decisively. We can say that the high similarity between the sentences in these examples is due to the fact that, except for a few details, they describe very similar scenes or images, even though they take place in diferent contexts.

D. Translations of Explanations English translations of the similarity explanations originally written in Italian by the two students and reported in Table 1.

[1]

Achiam ,

Adler ,

Agarwal ,

Ahmad ,

Akkaya ,

F. L.

Aleman ,

Almeida ,

Altenschmidt ,

Altman ,

Anadkat , et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[2]

Touvron ,

Martin ,

Stone ,

Albert ,

Almahairi ,

Babaei ,

Bashlykov ,

Batra ,

Bhargava ,

Bhosale , et al., Llama 2 : Open foundation and finetuned chat models , arXiv preprint arXiv:2307.09288 ( 2023 ).

[3]

Gemini Team ,

Anil ,

Borgeaud ,

Wu , J.- B. Alayrac , J.

Yu , R.

Soricut , J.

Schalkwyk , A. M.

Dai , A.

Hauth , et al., Gemini: a family of highly capable multimodal models , arXiv preprint arXiv:2312.11805 ( 2023 ).

[4]

Maynez ,

Narayan ,

Bohnet ,

McDonald , On faithfulness and factuality in abstractive summarization , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 1906 - 1919 .

[5]

Augenstein ,

Baldwin ,

Cha ,

Chakraborty ,

G. L.

Ciampaglia ,

Corney ,

DiResta , E. Ferrara,

Hale ,

Halevy , et al., Factuality challenges in the era of large language models , arXiv preprint arXiv:2310.05189 ( 2023 ).

[6]

Belinkov ,

Glass , Analysis methods in neural language processing: A survey, Transactions of the Association for Computational Linguistics 7 ( 2019 ) 49 - 72 .

[7]

Doshi-Velez ,

Kim , Towards a rigorous science of interpretable machine learning , arXiv preprint arXiv:1702.08608 ( 2017 ).

[8]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Y. J.

Bang ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2023 ) 1 - 38 .

[9]

Zhao ,

Chen ,

Yang ,

Liu ,

Deng ,

Cai ,

Wang ,

Yin ,

Du , Explainability for large language models: A survey , ACM Transactions on Intelligent Systems and Technology 15 ( 2024 ) 1 - 38 .

[10]

Wiegrefe ,

Marasovic , Teach me to explain: A review of datasets for explainable natural language processing , in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1) , 2021 .

[11] O.-M. Camburu , T.

Rocktäschel , T.

Lukasiewicz , P. Blunsom, e -snli: Natural language inference with natural language explanations , Advances in Neural Information Processing Systems 31 ( 2018 ).

[12]

Bowman , G. Angeli,

Potts ,

C. D.

Manning , A large annotated corpus for learning natural language inference , in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , 2015 , pp. 632 - 642 .

[13]

N. F.

Rajani ,

McCann ,

Xiong ,

Socher , Explain yourself! leveraging language models for commonsense reasoning , arXiv preprint arXiv: 1906 . 02361 ( 2019 ).

[14]

Brassard ,

Heinzerling ,

Kavumba ,

Inui , COPA-SSE: Semi-structured explanations for commonsense reasoning , in: N. Calzolari , F.

Béchet , P.

Blache , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference , Marseille, France, 2022 , pp. 3994 - 4000 .

[15]

Zaninello ,

Brenna ,

Magnini , Textual entailment with natural language explanations: The italian e-RTE-3 Dataset , in: F. Boschetti, alii (Eds.), Proceedings of the 9th Italian Conference on Computational Linguistics (CLiC-it 2023 ), November 30 - December

2nd

, Venice (Italy), 2023 .

[16]

Wang ,

Dong , Measurement of text similarity: • Example (1) S1: Completely diferent . S2: They talk about women who are very pretty .

• Example (2) S1: Completely diferent although they express the same concept . S2: Same sentences with diferent syntactic structures .

• Example (3) S1: In both sentences, a police oficer is mentioned. S2: The subject is a police oficer .

• Example (4) S1: In both sentences, Napoleon and his allies are mentioned. S2: They speak of Napoleon's allies.

• Example (5) S1: In the first case the focus of the sentence is the wife, in the second it is the husband. S2: They talk about jealous men .

• Example (6) S1: Both sentences describe vitality. S2: They describe scenarios of chaos, disorder; similar sentence syntax .