<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SimilEx: the First Italian Dataset for Sentence Similarity with Natural Language Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chiara Alzetta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiara Fazzone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Venturi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ItaliaNLP Lab, CNR, Istituto di Linguistica Computazionale 'A.Zampolli'</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large language models (LLMs) demonstrate great performance in natural language processing and understanding tasks. However, much work remains to enhance their interpretability. Annotated datasets with explanations could be key to addressing this issue, as they enable the development of models that provide human-like explanations for their decisions. In this paper, we introduce the SimilEx dataset, the first Italian dataset reporting human judgments of semantic similarity between pairs of sentences. For a subset of these pairs, the annotators also provided explanations in natural language for the scores assigned. The SimilEx dataset is valuable for exploring the variability in similarity perception between sentences and among human explanations of similarity judgments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentence similarity</kwd>
        <kwd>Italian dataset</kwd>
        <kwd>human judgements</kwd>
        <kwd>explanations</kwd>
        <kwd>annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>
        reasoning. To the best of our knowledge, the only
existing dataset enriched with explanations for Italian is
Large language models (LLMs) display impressive linguis- ‘e-RTE-3-it’ [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], an Italian version of the RTE-3 dataset
tic skills and demonstrate outstanding performances on a for textual entailment.
variety of tasks concerning natural language processing In this paper, we introduce the SimilEx dataset1, as far
and understanding. This is particularly true for the most as we are aware, the first Italian dataset of 2,112 pairs
recent and ground-breaking models such as GPT-3.5\4 of sentences manually annotated for semantic
similar[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], LLama-2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Gemini [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. LLMs, however, also ity. About half of the pairs are further enriched with
present risky limitations such as lack of factuality [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], free-form human-written explanations that justify the
poor interpretability [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] and hallucinations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Con- similarity score.
sequently, it has become important to verify whether The identification of textual similarity is a natural
lanthese models are explainable, and specifically whether guage understanding (NLU) task that involves
determinthey can provide human-like explanations using natural ing the degree of semantic equivalence between two texts
language for decisions made [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. The ability of LLMs [
        <xref ref-type="bibr" rid="ref16">16, 17</xref>
        ]. It is a foundational NLU problem relevant to
to explain the reasoning needed to solve a given task many applications such as summarisation, question
anis fundamental, particularly for tasks where there is no swering and conversational systems [18]. Despite its
established or shared evaluation protocol or benchmark. relevance, this task is highly challenging even for
hu
      </p>
      <p>
        Annotated datasets with explanations are key to ad- mans due to its subjective nature: human annotations
dressing this issue, as they enable the development of often widely disagree on similarity scores [19]
suggestmodels that provide human-like explanations for their ing that the cues driving sentence similarity are neither
decisions. Therefore, multiple datasets have been created well codified nor transparent and that their perceived
with free-form explanations to be incorporated into the relevance may vary among annotators. Possibly due to
model training process and used as benchmarks at test these challenges, and as far as we know, datasets
includtime, mostly focusing on English [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Some examples ing human explanations for the sentence similarity task
are the e-SNLI dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a version of the Stanford are lacking. However, they are invaluable as they force
Natural Language Inference (SNLI) dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] enriched annotators to reason about their choices and identify the
with human-annotated explanations, and the Common most relevant traits influencing their annotations.
Sense Explanations (CoS-E) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Semi-Structured
Explanations for COPA (COPA-SSE) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] datasets, which
include natural language explanations for commonsense
      </p>
      <p>Contributions. In this paper, we i) introduce SimilEx,
the first Italian dataset featuring human annotations and
explanations of sentence semantic similarity; ii) provide
an extensive study of the degree of subjectivity in the
perception of sentence semantic similarity; and iii)
investigate the relationship between the stylistic variation of
the paired sentences and the human ratings and natural recruited among native Italian speakers and presented
language explanations of sentence semantic similarity. with a questionnaire of 30 pairs plus 2 control pairs.
Annotation Guidelines. The task consisted of scoring
each sentence pair of the questionnaire for the perceived
2. The SimilEx Dataset sentence similarity using a 5-point Likert scale, where
1 is described as “Completamente diverse” (Completely
2.1. Data Collection diferent) and 5 as “Pressoché identiche” (Almost
identiThe sentence pairs of SimilEx are acquired from a collec- cal). Any formal definition of similarity is provided, only
tion of novels from the late XIX century translated into a few examples of highly similar and highly diferent
Italian. We used Sentence-BERT (SBERT) [20] to com- pairs along with motivations for the extreme similarity
bine pairs of sentences to present to annotators. SBERT scores, as shown in the annotation instructions provided
is a modification of BERT [ 21] made adequate to pro- to the annotators fully reported in Appendix C. This
duce sentence embeddings that can be easily compared represents the main novelty of our approach compared
to evaluate their similarity using cosine similarity, which to the methodology used to create datasets for
Semanranges from 0 (no similarity) to 1 (identical sentences). tic Textual Similarity tasks, typically organized within
We included in the SimilEx dataset only pairs obtaining a the SemEval evaluation campaign (see among the others
similarity score ≥ 0.65, for a total of 2,112 sentence pairs. [24, 18]). These datasets are usually built with clear and</p>
      <p>The textual genre of the sentences (i.e., novels) in- specific instructions for annotators, who are explicitly
troduces specific stylistic properties that cause potential asked to evaluate whether paired text portions refer to
diferences from standard Italian. We assessed the linguis- the same person, action, or event, or to focus their
judgtic style of SimilEx sentences using Profiling-UD [ 22]2, a ment on similarity types such as the same author, time
web-based tool that captures multiple aspects of sentence period, or location. Some examples of annotation with
structure. The tool extracts around 130 properties repre- similarity scores averaged across annotators are shown
sentative of the underlying linguistic structure of a sen- in Table 1.
tence, derived from raw, morphosyntactic, and syntactic Demographics. Participants could share information
levels of sentence annotation, all based on the Univer- about their age, gender and occupation and complete
sal Dependencies (UD) formalism [23]. These properties multiple questionnaires. Eventually, 317 distinct
particihave been shown to be highly predictive when used as pants took part in the study. After a preliminary analysis,
features by learning models in various classification tasks, we excluded 34 annotators deemed unreliable because
such a Automatic Readability and Linguistic Complexity they either took too short to complete the questionnaire,
Assessment or Native Language Identification. Among assigned systematically divergent scores compared to the
these caracteristics, the average length computed on Sim- rest of the participants, failed the control questions or
ilex sentences is 30.18 tokens (± 22.36), above the average submitted blank answers. The resulting dataset includes
length of standard Italian sentences, typically around 20 2,112 sentence pairs annotated by the remaining 283
antokens. Interestingly, within pairs, the average length notators, who took 18 minutes on average to complete
diference is 17.02 tokens ( ± 19.55). This value, combined a questionnaire4. Each pair received a minimum of 5
with such a high standard deviation, suggests a large vari- and a maximum of 7 annotations from diferent
particiability of style within the pairs. This notable variability pants. The set of annotators is quite balanced for gender
extends, e.g., to the distribution of subordinate clauses (51% males) and the average age of annotators is 27.05
and lexical overlap. Within pairs, the average diference (± 6.56). Regarding occupation, 50% of participants
indiin the number of subordinate clauses is 2.25 (± 1.81), and cated that they have a full- or part-time job, around 25%
the overlap of content words is 12.60%, which are signif- declared themselves unemployed, and the remaining 25%
icant given that this variation occurs within individual preferred not to disclose their occupational status.
sentence pairs. Having pairs with such stylistic
diferences provides an opportunity to investigate the impact 2.3. Human Explanations of Similarity
of stylistic variation on the perception of similarity.</p>
      <sec id="sec-1-1">
        <title>2.2. Human Similarity Annotation</title>
        <sec id="sec-1-1-1">
          <title>Sentence pairs of SimilEx were annotated through the on</title>
          <p>line crowdsourcing platform Prolific 3. Annotators were</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>2The complete set of linguistic characteristics used for the stylistic</title>
          <p>analysis can be found in Appendix B.
3https://www.prolific.com/</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>We recruited 2 native Italian speakers who volunteered to</title>
          <p>enrich the pairs of sentences with free-form explanations.
These annotators are graduate students, one male and
one female, aged 23 years. They were asked to score the
similarity of a random subset of 907 sentence pairs on
the same 5-point Likert scale as the other participants.
Additionally, they should provide a short explanation for</p>
        </sec>
        <sec id="sec-1-1-4">
          <title>4The compensation is fair according to the platform: 6.30£/hour.</title>
          <p>Sentence 1
Sentence 2
Sì, grazie a Dio non è male.</p>
          <p>Non hanno mandato a prendere il latte fresco?
Solo lo zar può far la grazia.</p>
          <p>Accidenti a voi, mi fate perdere il filo!
Io invece non ce l’ho: tante grazie!
E per me, chiedi almeno del latte.</p>
          <p>Voglio chiedere la grazia allo zar.</p>
          <p>Intanto voi però mi avete fatto perdere il filo.
their scores, in the form of a single concise sentence.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Human Similarity Perception</title>
      <p>fewer annotators in agreement, while 5 or more
annotators (up to 9) gave identical values in 20.08% of pairs.
Figure 2 displays the distribution of similarity scores within
these groups. Notably, when few annotators agree on
a pair, the scores are evenly distributed across the five
labels, indicating that disagreement can occur for pairs
seen as both similar and diferent. In contrast, when more
annotators agree, the most commonly assigned score is 1,
indicating that annotators converge more frequently on
dissimilarity judgments. This is supported by the
negative Pearson correlation between the number of agreeing
annotators and the average similarity score of the pair
( = − 0.344,  &lt; 0.001).</p>
      <p>Agreement, Style and Similarity. We explored the
relationship between style and similarity judgments by
comparing scores and stylistic traits of sentences. As a
general remark, we found that style minimally afects than that reported among the Prolific annotators.
pairs’ similarity: the Pearson correlation between the Linguistic Style of Explanations. We explored the
similarity scores and the distribution of stylistic proper- style of explanation relying on the linguistic profiling
ties is either non-significant (  &gt; 0.05) or extremely low method described in Section 2.1. We noted that the
ex( &lt; 0.1). However, a more in-depth analysis of specific planations written by the two students exhibit partial
stylistic properties revealed a nuanced relationship be- similarity as can be seen by inspecting the results of the
tween style and the consistency of human judgments. For stylistic analysis distributed as supplementary materials
example, contrary to our expectations, sentence length, (see Appendix A). For example, they both tend to write
a raw yet informative feature reflecting stylistic varia- quite short sentences, i.e. on average 6.35 (± 3.93) and 7.67
tion, did not impact the similarity scores assigned by (± 5.12) token-long, and characterized by a nominal style.
annotators. In fact, when we computed the correlation This is evidenced by the low percentage distribution of
between the length diference of paired sentences and verbal roots (i.e. sentences with a verb as the syntactic
the variance between similarity judgments, we observed root), computed over the total number of roots
reprea lack of correlation (0.05). To further investigate, we sented by other morpho-syntactic categories (i.e. 58.21%
grouped pairs based on the diference between the length (± 49.35) and 61.43% (± 48.70)). This percentage is notably
of their sentences, and specifically, based on whether low when compared to the distribution in the ISDT [26],
their length diference was above or below the average the largest Italian Treebank, where the distribution is
value of 17 tokens. We noticed that also from this per- 85.73%.
spective of analysis sentence length did not afect the Content of Explanations. The content analysis of the
IAA of the scores either, as  = 0.265 for both groups. explanations reveals that both students share some
arguHowever, when focusing on diferent stylistic traits more ments when justifying the similarity scores for SimilEx
closely related to sentence structure, we observed a sub- sentence pairs. Specifically, the average cosine similarity
stantial relationship with higher annotator agreement. between their explanations, computed using SBERT, is
For instance, the IAA is moderate (0.49) for pairs where 0.46, indicating a moderate level of similarity.
neither sentence contains a subordinate clause, but drops Given that a qualitative analysis reveals several
recurto fair (0.25) when both sentences contain at least one ring arguments and templates in the explanations, such
subordinate. Similarly, the IAA is higher (0.37) when the as Entrambe descrivono (‘Both describe’), In entrambe le
syntactic tree depth diference between paired sentences frasi si parla di un argomento militare (‘In both sentences
is below the average value of 1.98, compared to 0.29 when a military topic is mentioned’), we further explored the
the diference is greater. These results are extremely in- possibility of identifying homogenous content among
teresting as they indicate that while stylistic traits may them. To this end, we clustered the 907 explanations of
not directly influence the semantic similarity between each student (1,814 in total) based on their SBERT
vecsentences, some of them play a role in the convergence tors. We initially configured the clustering algorithm
of human judgments. to partition the data into 10 clusters5. However, only 4
of these clusters were found to be semantically
homogeneous. Specifically, these homogeneous clusters
con4. Human Similarity Explanation tain explanations where either Student 1 or 2: i) writes
that the evaluated sentences contain positive or negative
emotions such as love or anger, ii) uses the phrase
Pressochè identiche (‘Almost identical’), iii) uses the phrase
Completamente diverse (‘Completely diferent’), and iv)
notes that the evaluated sentences refer to a military
topic. Since the explanations in the remaining 6 clusters
were not semantically homogeneous, we reconfigured
the clustering algorithm to partition the data into 5
clusters. This time, we included only the explanations that
had not been previously clustered, representing 72.76%
of all SimilEx explanations. However, we were still
unable to isolate explanations with similar content. This
suggests that the two students often focused on diferent
aspects when evaluating sentence similarity. As proof,
consider the examples reported in Table 2, where
stu</p>
      <sec id="sec-2-1">
        <title>In this section, we focus on the analysis of the subset of 907 sentence pairs of SimilEx annotated by the two students with both human similarity judgments and natural language explanations for the assigned scores.</title>
        <p>Comparison with Prolific annotators. The
comparison between the similarity judgments of the graduate
students and Prolific annotators reveals a strong
alignment between the two groups. The Pearson correlation
between the average similarity score of the Prolific
annotators and the average score between the two graduate
students is significantly high and positive (  = 0.779,
 &lt; 0.001). This high correlation is also observed when
computed separately for each of the two students,
indicating that their perceptions of similarity closely match the
judgements obtained from the crowdsourcing campaign.</p>
        <p>Additionally, the IAA between the two students suggests
alignment between the students since  = 0.49, higher
5We employed agglomerative clustering using Euclidean distance
and Ward variance minimization as the clustering method.
(1)
(2)
(3)
(4)
(5)
(6)</p>
        <p>
          Sentence 1 "Vedeva lo scintillio degli occhi, tremulo e avvampante, e il riso di felicità e di eccitamento che senza
volere le increspava le labbra; vedeva la grazia misurata, la sicurezza e la levità dei movimenti."
Sentence 2 "Era così bella, che non solo non appariva in lei ombra di civetteria, ma pareva al contrario che le
rimordesse il forte ed immancabile efetto di una grazia trionfatrice, che avrebbe voluto temperare, se
le fosse stato possibile."
Explanations S1: Completamente diverse. (1)
(Sim. scores) S2: Parlano di donne che sono molto graziose. (4)
Sentence 1 "Ma che volete farci: questa è la vocazione dell’autore, ormai malato della propria imperfezione, e il
suo talento è fatto apposta per rappresentare la povertà della nostra vita, scovando la gente in buchi
sperduti, in angoletti remoti dell’impero!"
Sentence 2 "Perché mettere in mostra la povertà della nostra vita e la nostra triste imperfezione, andando a scovare
gli uomini in buchi sperduti, in angoletti remoti dell’impero?"
Explanations S1: Completamente diverse anche se esprimono lo stesso concetto. (1)
(Sim. scores) S2: Stessa frase impostata diversamente a livello sintattico. (4)
Sentence 1 "L’agente di polizia che l’accompagnava, discese e scosse il braccio intormentito; poi si tolse il berretto
e si fece il segno della croce."
Sentence 2 "Nell’osteria entrò un agente di polizia."
Explanations S1: In entrambe le frasi si parla di un agente della polizia. (2)
(Sim. scores) S2: Il soggetto è un agente di polizia. (2)
Sentence 1 "Napoleone si volse ad Alessandro, come per dire che quanto ora faceva era fatto per l’augusto e caro
alleato."
Sentence 2 "Tutti gli alleati di Napoleone gli divennero nemici."
Explanations S1: In entrambe le frasi si parla di Napoleone e dei suoi alleati. (2)
(Sim. scores) S2: Parlano degli alleati di Napoleone. (3)
Sentence 1 "Ma l’amore con un marito inquinato dalla gelosia e da ogni sorta di difetti non era più per lei."
Sentence 2 "Era forse, semplicemente, un sentimento di gelosia: egli era talmente avvezzo all’amore di lei, che non
poteva ammettere che ella potesse amarne un altro."
Explanations S1: Nel primo caso il focus della frase è la moglie, nella seconda lo è il marito. (2)
(Sim. scores) S2: Parlano di uomini gelosi. (2)
Sentence 1 "Tonfi, spruzzi, strida, ingiurie, lazzi, risate, un allegro pandemonio."
Sentence 2 "E fino a quel momento, chiasso, baccano, sghignazzi, ingiurie, rumore di catene, acido carbonico e
fuliggine, teste rase, facce marchiate, vestiti a brandelli, tutto fatto oggetto di ludibrio e di infamia... sì,
grande è la vitalità dell’uomo!"
Explanations S1: Entrambe le frasi descrivono vitalità. (4)
(Sim. scores) S2: Descrivono degli scenari di caos, disordine; sintassi frasi simile. (3)
dents focused on diverse aspects of the paired sentences similar the explanations are, the smaller the diference
while they assigned either similar (see #5 and #6) or dif- in the students’ similarity judgments. Notably, students’
ferent (see #1 and #2) similarity scores. While this may explanations tend to be more similar when the similarity
result in underspecification and inconsistency in the col- scores assigned by both of them are lower (i.e. 1 or 2), as
lected explanations, it confirms the inherent subjectivity in example #3 of Table 2.
and expressivity involved in providing free-text natural
language explanations for a highly subjective task such
as evaluating semantic sentence similarity [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. 5. Conclusion and Future Work
The content analyses above were enriched with an
        </p>
        <p>This paper presented SimilEx, the first Italian dataset
in-depth investigation into whether there is a
correlaon sentence similarity enriched with human judgments
tion between the SBERT cosine similarity of the
explaand free-form explanations. The analyses of the collected
nations of each student and their similarity judgments.
judgments confirmed that the perception of sentence
simThe Pearson correlation between SBERT scores and the
ilarity is inherently subjective, as evidenced by the fair
absolute diference in the students’ similarity judgments
agreement between the scores. Notably, annotators tend
reveals a moderate negative relationship ( = − 0.459,
to agree less on similar sentence pairs, showing greater
 &lt; 0.001). This indicates that the more semantically
convergence when sentences are markedly diferent. The
style of the paired sentences appears to influence this
convergence: while most linguistic traits may not
directly impact the similarity score, some of them afect
the homogeneity of judgments assigned by diferent
annotators. These features mostly concern properties of
sentence structure rather than raw sentence features such
as lenght, which does not play a role in homogeneity.
Regarding explanations, we found a correlation between the
similarity of the content of the explanations and the
similarity scores assigned, indicating that annotators tend to
write more similar explanations, using a similar writing
style, when their scores align.</p>
        <p>The findings from this study open several prospects.</p>
        <p>Expanding SimilEx to include sentences from diferent
textual genera could provide further insights into the
factors afecting similarity judgments. Additionally,
incorporating more annotators with varying linguistic
backgrounds could foster a better understanding of the
subjectivity in similarity perception. Lastly, our dataset could
help develop automated tools to evaluate the
explainability of LLMs. By leveraging SimilEx, researchers can
create models that predict similarity scores and generate
explanations, enhancing the interpretability of LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Acknowledgments</title>
      <sec id="sec-3-1">
        <title>This paper is supported by the PRIN 2022 PNRR Project</title>
        <p>EKEEL - Empowering Knowledge Extraction to Empower
Learners (P20227PEPK) funded by the European Union
– NextGenerationEU and the project CHANGES -
Cultural Heritage Innovation for Next-Gen Sustainable
Society (PE00000020) project under the NRRP MUR program
funded by the NextGenerationEU.
a survey, Information 11 (2020) 421. ity with Discourse, 2013.
[17] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre,</p>
        <p>W. Guo, * sem 2013 shared task: Semantic textual
similarity, in: Second joint conference on lexical Appendix
and computational semantics (* SEM), volume 1:
proceedings of the Main conference and the shared
task: semantic textual similarity, 2013, pp. 32–43. A. Supplementary materials
[18] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia,</p>
        <p>SemEval-2017 task 1: Semantic textual similarity The complete SimilEx dataset is freely available at http:
multilingual and crosslingual focused evaluation, //www.italianlp.it/resources/ along with the results of the
in: S. Bethard, M. Carpuat, M. Apidianaki, S. M. stylistic analysis of both paired sentences and the natural
Mohammad, D. Cer, D. Jurgens (Eds.), Proceedings language explanations provided by the two students.
of the 11th International Workshop on Semantic Specifically, on the dedicated page, you can find the
Evaluation (SemEval-2017), Association for Compu- following materials:
tational Linguistics, Vancouver, Canada, 2017, pp. SimilEx dataset. The dataset is organized in columns,
1–14. each reporting the following information:
[19] Y. Wang, S. Tao, N. Xie, H. Yang, T. Baldwin, K. Ver- • Pair_ID: the unique identifier of the paired
senspoor, Collective Human Opinions in Semantic tences;
Textual Similarity, Transactions of the Association • Sentence_1 and Sentence_2: the text of each of
for Computational Linguistics 11 (2023) 997–1013. the two paired sentences;
[20] N. Reimers, I. Gurevych, Sentence-BERT: Sentence</p>
        <p>Embeddings using Siamese BERT-Networks, in: • A1-A7: the similarity judgments of the Prolific
Proceedings of the 2019 Conference on Empirical annotators;
Methods in Natural Language Processing and the • Stud_1: the similarity judgment assigned by the
9th International Joint Conference on Natural Lan- ifrst student;
guage Processing (EMNLP-IJCNLP), 2019, pp. 3982– • Explanation_Stud1: the natural language
expla3992. nation provided by Stud_1;
[21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: • Stud_2: the similarity judgment assigned by the
Pre-training of deep bidirectional transformers for second student;
language understanding, 2019. • Explanation_Stud2: the natural language
expla[22] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, nation provided by Stud_2.</p>
        <p>S. Montemagni, Profiling-UD: a tool for linguistic
profiling of texts, in: Proceedings of the Conference Linguistic profiling of the paired sentences. The
on Language Resources and Evaluation (LREC), results of the stylistic analysis of each of the paired
senELRA, 2020, pp. 7147–7153. tences included in SimilEx are contained in the
“Sen[23] M.-C. de Marnefe, C. D. Manning, J. Nivre, D. Ze- tence_profiling” sheet, reporting for each column the
man, Universal Dependencies, Computational Lin- following information:
guistics 47 (2021) 255–308. doi:10.1162/coli_a_ • Pair_ID: the unique identifier of the paired
sen00402. tences in the SimilEx dataset;
[24] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, • Sent_in_pair: the unique identifier of each
indiW. Guo, *SEM 2013 shared task: Semantic textual vidual sentence in the pair;
similarity, in: M. Diab, T. Baldwin, M. Baroni (Eds.),
Second Joint Conference on Lexical and Compu- • all other columns report the value of the
distributational Semantics (*SEM), Volume 1: Proceedings tion of the complete set of linguistic
characterisof the Main Conference and the Shared Task: Se- tics derived with Profiling-UD by each individual
mantic Textual Similarity, Association for Compu- sentence.
tational Linguistics, Atlanta, Georgia, USA, 2013, Linguistic profiling of the explanations. The
repp. 32–43. URL: https://aclanthology.org/S13-1004. sults of the stylistic analysis of each explanation
pro[25] J. R. Landis, G. G. Koch, The measurement of ob- vided by the two students are contained in the
“Explaserver agreement for categorical data, Biometrics nations_profiling” sheet, reporting for each column the
(1977) 159–174. following information:
[26] C. Bosco, S. Montemagni, M. Simi, Converting
italian treebanks: Towards an italian stanford de- • PairID_of_explanied_pair: the unique identifier
pendency treebank, in: Proceedings of the ACL of each individual sentence in the pairs of the
Linguistic Annotation Workshop &amp; Interoperabil- SimilEx dataset;
• Explanation_of_student: the identifier of the
stu</p>
        <p>dent;
• all other columns report the value of the
distribution of the complete set of linguistic
characteristics derived with Profiling-UD by each individual
explanation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>B. Linguistic Features</title>
      <sec id="sec-4-1">
        <title>The set of linguistic features derived by Profiling–UD are extracted from diferent levels of linguistic annotation and capture a wide number of linguistic phenomena and can be grouped as follows:</title>
      </sec>
      <sec id="sec-4-2">
        <title>Per farlo, ti mostreremo delle coppie di frasi estratte da</title>
        <p>romanzi e ti chiederemo di assegnare ad ogni coppia un
punteggio compreso fra 1 e 5.</p>
        <p>Usa 1 per dire che le due frasi sono fra loro
completamente diverse; Usa 5 per dire che sono pressoché uguali.</p>
        <p>Gli altri punteggi ti serviranno per valutare i casi
intermedi.</p>
        <p>Due frasi possono dirsi uguali o diverse sulla base di
diversi elementi. Ecco alcuni esempi per aiutarti nella
valutazione.</p>
        <p>Coppie di frasi diverse (punteggio 1).</p>
        <p>Esempio 1:
a) Io desidererei tanto non sentire così intensamente e
non prendermi tanto a cuore tutto quello che succede.
b) Sì, non sono in me, sono tutta nell’aspettativa e vedo
tutto un po’ troppo facile.</p>
        <p>Esempio 2:
a) Anche il vecchio principe t’è afezionato.
b) - Non mi sembra di averveli chiesti, - scattò il principe
irritatissimo.</p>
        <p>Esempio 3:
a) Il the veramente era del color della birra, ma io ne
bevvi un bicchiere.
b) Ma non passò neanche un minuto, che la birra gli
diede alla testa e per la schiena gli corse un leggero e
perfin piacevole brivido.
• Raw text
- Number of tokens in sentence;
- Average characters per token.
• Morphosyntactic information
- Distibution of UD POS;
- Lexical density.
• Inflectional morphology
- Distribution of lexical verbs and auxiliaries for
inflectional categories (tense, mood, person,
number).
• Verbal Predicate Structure
- Distribution of verbal heads and verbal roots;
- Average verb arity and distribution of verbs by Fai particolare attenzione agli esempi 2 e 3: anche se
arity. le frasi hanno delle parole in comune (come ’principe’ e
• Global and Local Parsed Tree Structures ’birra’ negli esempi) non è detto che siano uguali!
- Average depth of the whole syntactic trees;
- Average length of dependency links and of the Coppie di frasi molto simili (punteggio 5).
longest link; Esempio 1:
- Average length of prepositional chains and dis- a) Signori della giuria, la psicologia è a doppio taglio e
tribution by depth; anche noi siamo in grado di comprenderla.</p>
        <p>Average clause length. b) Vedete allora, signori della giuria, dal momento che
• Relative order of elements la psicologia è un’arma a doppio taglio, permettetemi di
- Distribution of subjects and objects in post- and occuparmi del secondo taglio e vediamo che cosa viene
pre-verbal position. fuori.
• Syntactic Relations Esempio 2:</p>
        <p>Distribution of dependency relations. a) "Un rettile divorerà l’altro", aveva detto il giorno prima
• Use of Subordination Ivan, parlando con rabbia del padre e del fratello.
- Distribution of subordinate and principal b) "Un rettile divorerà l’altro, quella è la fine che faranno!".
clauses; Esempio 3:
- Average length of subordination chains and dis- a) Ma una volta deciso, continuò con la sua voce stridula,
tribution by depth; senza timori, senza esitazioni e sottolineando alcune
- Distribution of subordinates in post- and pre- parole.
principal clause position. b) Parlava rapido, senza fermarsi un momento, senza la
minima esitazione, quasi rimproverasse a sè stesso di
aver tanto indugiato a mettere Marianna a parte di tutti i
suoi segreti, quasi scusandosi presso di lei.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>C. Annotation Instructions</title>
      <sec id="sec-5-1">
        <title>C.1. Original Instructions in Italian</title>
        <sec id="sec-5-1-1">
          <title>Gli esempi 1 e 2 riportano frasi che non solo con</title>
          <p>Stai per svolgere un questionario nel quale ti verrà chiesto tengono molte parole in comune ma sono simili anche
di valutare se due frasi sono fra di loro simili o diverse. per quanto riguarda la scena descritta. Nel terzo esempio,
entrambe le frasi descrivono una persona intenta a
parlare in modo svelto e deciso. Possiamo dire che in questi
esempi l’alta similarità fra le frasi è data dal fatto che,
ad eccezione di alcuni dettagli, esse descrivono scene o
immagini molto simili, anche se si svolgono in contesti
diverse.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>C.2. Instructions Translations into English</title>
        <sec id="sec-5-2-1">
          <title>You are about to take a questionnaire in which you will</title>
          <p>be asked to assess whether two sentences are similar or
diferent to each other. To do this, we will show you pairs
of sentences extracted from novels and ask you to give
each pair a score between 1 and 5.</p>
          <p>Use 1 to say that the two sentences are completely
different from each other; Use 5 to say that they are almost
the same. The other scores will be used to evaluate the
intermediate cases.</p>
          <p>Two sentences can be equal or diferent based on
several elements. Here are some examples to help you in
your evaluation.</p>
          <p>Pairs of diferent sentences (score 1)</p>
          <p>Examples: Please refer to the above section to see the
original examples in Italian.</p>
          <p>Pay particular attention to examples 2 and 3: although
the sentences have words in common (like ‘prince’ and
‘beer’ in the examples) they are not necessarily the same!
Pairs of very similar sentences (score 5)</p>
          <p>Examples: Please refer to the above section to see the
original examples in Italian.</p>
          <p>Examples 1 and 2 show sentences that not only contain
many words in common but are also similar in terms of
the scene described. In the third example, both sentences
describe a person speaking quickly and decisively. We
can say that the high similarity between the sentences
in these examples is due to the fact that, except for a few
details, they describe very similar scenes or images, even
though they take place in diferent contexts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>D. Translations of Explanations</title>
      <sec id="sec-6-1">
        <title>English translations of the similarity explanations originally written in Italian by the two students and reported in Table 1.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and finetuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gemini</surname>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.- B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
          </string-name>
          , et al.,
          <article-title>Gemini: a family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bohnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <article-title>On faithfulness and factuality in abstractive summarization</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1906</fpage>
          -
          <lpage>1919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Ciampaglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>DiResta</surname>
          </string-name>
          , E. Ferrara,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          , et al.,
          <article-title>Factuality challenges in the era of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2310.05189</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belinkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Analysis methods in neural language processing: A survey, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>49</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Towards a rigorous science of interpretable machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:1702.08608</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Explainability for large language models: A survey</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasovic</surname>
          </string-name>
          ,
          <article-title>Teach me to explain: A review of datasets for explainable natural language processing</article-title>
          ,
          <source>in: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>O.-M. Camburu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rocktäschel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lukasiewicz</surname>
          </string-name>
          , P. Blunsom, e
          <article-title>-snli: Natural language inference with natural language explanations</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , G. Angeli,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>A large annotated corpus for learning natural language inference</article-title>
          ,
          <source>in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>632</fpage>
          -
          <lpage>642</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Rajani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Explain yourself! leveraging language models for commonsense reasoning</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>02361</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Brassard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinzerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kavumba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          , COPA-SSE:
          <article-title>Semi-structured explanations for commonsense reasoning</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>3994</fpage>
          -
          <lpage>4000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaninello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brenna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <article-title>Textual entailment with natural language explanations: The italian e-RTE-3 Dataset</article-title>
          , in: F. Boschetti, alii (Eds.),
          <source>Proceedings of the 9th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2023</year>
          ), November 30 -
          <string-name>
            <surname>December</surname>
            <given-names>2nd</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venice</surname>
          </string-name>
          (Italy),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Measurement of text similarity: • Example (1) S1: Completely diferent</article-title>
          . S2:
          <article-title>They talk about women who are very pretty</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>• Example (2) S1: Completely diferent although they express the same concept</article-title>
          . S2:
          <article-title>Same sentences with diferent syntactic structures</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>• Example (3) S1: In both sentences, a police oficer is mentioned. S2: The subject is a police oficer</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>• Example (4) S1: In both sentences, Napoleon and his allies are mentioned. S2: They speak of Napoleon's allies.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>• Example (5) S1: In the first case the focus of the sentence is the wife, in the second it is the husband. S2: They talk about jealous men</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <article-title>• Example (6) S1: Both sentences describe vitality. S2: They describe scenarios of chaos, disorder; similar sentence syntax</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>