<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>URL: https://arxiv.org/abs/</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>- ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marianna Marcella Bolognesi</string-name>
          <email>m.bolognesi@unibo.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Puccetti</string-name>
          <email>giovanni.puccetti@isti.cnr.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Collacciani</string-name>
          <email>claudia.collacciani2@unibo.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Esuli</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2407</year>
      </pub-date>
      <volume>21783</volume>
      <abstract>
        <p>ion, Inclusiveness, Context, LLM evaluation, Italian Language Models The ability to convey both specific information (about individuals or events) and generalisations (about cate- languages do not have explicit markers for generic NPs</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org
A</p>
    </sec>
    <sec id="sec-2">
      <title>1. Challenge: Introduction and</title>
    </sec>
    <sec id="sec-3">
      <title>Motivation</title>
      <p>gories) with the same lexical item is one of the key feature
of natural languages. Consider the examples in 1:</p>
      <sec id="sec-3-1">
        <title>a) the lion escaped yesterday from the zoo.</title>
      </sec>
      <sec id="sec-3-2">
        <title>b) the lion is a predatory cat.</title>
      </sec>
      <sec id="sec-3-3">
        <title>The noun phrase (NP) the lion can describe either a</title>
        <p>specific individual ( 1a) or the entire category of large
African felines (1b), thus it expresses a variable degree
of inclusiveness of the possible number of individuals
to which the NP correctly applies in each sentence it
occurs. This demonstrates how human language follows
a principle of economy, enabling a one-to-many mapping
between lexical labels and meanings.</p>
      </sec>
      <sec id="sec-3-4">
        <title>The syntactic form of the NP (definite, indefinite, or plural) does not provide suficient information to discriminate between the two meanings, and we need to</title>
        <p>
          Dec 04 — 06, 2024, Pisa, Italy
∗Corresponding author.
https://www.unibo.it/sitoweb/andreaamelio.ravelli (A. A. Ravelli);
https://esuli.it/ (A. Esuli);
https://www.unibo.it/sitoweb/m.bolognesi (M. M. Bolognesi)
enlarge our focus to take into account the whole context
in which the NP occurs [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This phenomenon can be
observed in all languages [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], afecting nearly all nouns
that can be used in referring expressions. Indeed, natural
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]; the genericity/specificity of an NP is derived from
the meaning of the entire sentence. In other words, we
cannot interpret language one word at a time; we need
to consider the whole sentence or utterance as context
to disambiguate and decipher the meaning of each single
word composing it, and thus to understand the message
conveyed through language.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Generalizations about kinds and categories, as in 1b,</title>
        <p>
          are called generics and are fundamental to human
cognition, because they allow us to conceptualize properties
linked to categories, shaping how we perceive the world
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Moreover, distinguishing between generic and
nongeneric meanings for abstract entities is less
straightforward than for concrete ones, and for this reason evaluate
the inclusiveness of an abstract noun or a NP is even
more challenging. Indeed, inclusiveness is not an
exclusive feature of concrete-only entities. Consider the
examples in 1:
2.</p>
      </sec>
      <sec id="sec-3-6">
        <title>a) Colorless green ideas sleep furiously.</title>
      </sec>
      <sec id="sec-3-7">
        <title>b) Be less curious about people and more cu</title>
        <p>rious about ideas.</p>
      </sec>
      <sec id="sec-3-8">
        <title>The concept behind the word idea is always referring</title>
        <p>to an abstract entity, with slightly diferent grades of
abstractness, but it shows a greater variation in terms of
inclusiveness. The noun ideas in 2a includes only a
re© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License stricted number of elements with respect to the universe
Attribution 4.0 International (CC BY 4.0).</p>
        <sec id="sec-3-8-1">
          <title>Token: Margherita</title>
        </sec>
        <sec id="sec-3-8-2">
          <title>Text: Le margherite di fronte alla mia casa saranno</title>
          <p>in piena fioritura.
Abstractness: 0.177
Inclusiveness: 0.187</p>
        </sec>
        <sec id="sec-3-8-3">
          <title>Token: Ambizione</title>
        </sec>
        <sec id="sec-3-8-4">
          <title>Text: La sua ambizione lo rovinerà.</title>
          <p>Abstractness: 0.478
Inclusiveness: 0.083
(a) Example of sample for the Margherita token.
(b) Example of sample for the Ambizione token.</p>
        </sec>
        <sec id="sec-3-8-5">
          <title>Token: Benzina</title>
        </sec>
        <sec id="sec-3-8-6">
          <title>Text: La benzina è nella bottiglia del latte.</title>
          <p>Abstractness: 0.064
Inclusiveness: 0.063</p>
        </sec>
        <sec id="sec-3-8-7">
          <title>Token: Benzina</title>
        </sec>
        <sec id="sec-3-8-8">
          <title>Text: In Italia è disponibile la benzina a 95 ottani.</title>
          <p>Abstractness: 0.573
Inclusiveness: 0.653
(c) Example of sample for a more concrete Benzina token.</p>
          <p>
            (d) Example of sample for a more abstract Benzina token.
of the ideas (namely, only colorless green ones), while the sentence.
reference in 2b shows a higher level of inclusiveness, not This task have some similarities with the CONcreTEXT
distinguishing among them on the basis of their color. Task1 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], which has been presented at the 2020 edition of
          </p>
          <p>The ability to distinguish, interpret and use correctly EVALITA.2 Both tasks focus on the
abstractness/concretethe variability that natural language ofers along these ness of target words in natural Italian sentences, asking
two graduated semantic features, abstractness and inclu- judgments by means of Likert scales, but the ABRICOT
siveness, is of paramount importance if we want to make Task goes beyond by including also the inclusiveness
talking machines which not only simulate language, but feature of the targets. Moreover, for the construction of
can also reason about natural language and the knowl- this dataset we considered exclusively nouns or NPs as
edge of the world it depicts. targets, and in order to limit to the minimum the impact</p>
          <p>
            The CALAMITA special event [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] ofers the possibil- of the variability deriving from diferent semantic role or
ity to challenge Large Language Models on their ability syntactic function, all the sentences have been selected
to understand the abstractness and inclusiveness of the with the target noun as subject of the main verb.
words, and compare with humans their behaviour in
judging Italian sentences. With this report we present 2.1. Tasks
the ABRICOT Task: ABstRactness and Inclusiveness
in COntexT.
          </p>
          <p>We propose two separate tasks for this benchmark, Task
1: abstractness and Task 2: inclusiveness the two tasks are
formally identical, we use the same metric and the same
2. Challenge: Description samples, however they measure two diferent scores,
respectively abstractness_mean and inclusiveness_mean, the
The ABRICOT Task aims to challenge Italian lan- first meant to measure the abstractness of the word in
guage models on their understanding of abstractness context and the second its inclusiveness.
and inclusiveness, features that we, as humans, naturally Since both these concepts are evident but fuzzy also
express in everyday language. These features are not for humans, we don´t expect language models to have
discrete binary dichotomies like abstract/concrete or a perfect understanding of them and we will limit our
inclusive/exclusive; instead, they shade on a contin- metrics to regression ones. Despite the tasks being very
uous spectrum, with the two extremes at opposite ends. similar from a formal perspective, we show how
modThe collection of sentences in this Task shows the same els’ performance on these two tasks varies and there is
NP in a variety of diferent contexts, so that its meaning sensible diference between the results in the two tasks.
can oscillate between the extremes of both the axis of
abstractness and inclusiveness.</p>
          <p>We ask the participant models to express a judgment
on a 5 point Likert scale for both the features of
inclusiveness and abstractness of the target noun or NP in each</p>
        </sec>
      </sec>
      <sec id="sec-3-9">
        <title>1lablita.github.io/CONcreTEXT 2www.evalita.it</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Data description</title>
      <p>3.1. Origin of data</p>
      <sec id="sec-4-1">
        <title>The 20 target NPs of the dataset for the ABRICOT</title>
        <p>
          Task are derived (and translated in Italian) from the set
of target nouns in the Situation Entities Corpus (SitEnt
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]), a collection of English sentences in which
speciifcity and genericity have been annotated with a binary
labelling scheme (i.e., GENERIC vs. NON-GENERIC).
Using those as seeds, representative Italian sentences have
been manually harvested from OpenSubtitles3 and
WikiHow.4 These are widely used sources, the first contains
the openly available subtitles of an extensive collection of
movies and TV series, while the second is a website
gathering articles on how-to do a variety of diferent things.
        </p>
        <p>
          More specifically, the sentences have been extracted
from the Italian section of the multilingual The Human
Instruction Dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a structured collection of
WikiHow instructions pages, and from the Italian sub-corpus
of the OpenSubtitles2018 corpus [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Our protocol proposes to the annotators groups of
sentences (from a minimum of 4 to a maximum of 8), all
containing the same noun, each to be evaluated using a
continuous slider, from which values ranging from 0 to 1
will then be extracted.</p>
        <p>After the annotation, the reliability of our data has
been computed using the Intraclass Correlation
Coeficient (ICC(k)). Human ratings have been then averaged,
and the resulting figures will be used as gold standard.</p>
        <p>An example of the samples present in the dataset
can be seen in Figure ?? where examples with the NPs
margherita (lilly), ambizione (ambition) and benzina
(gasoline) are reported. In particular, Figure ?? and ??
show two examples containing the same token but in
diferent contexts and report the efect of the context on
the abstractness and inclusiveness of the token.</p>
        <p>
          The data is stored on OSF [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].5
3.2. Data format
The data is proposed in a tabular format, with 12 columns:
• ID: a unique identifier for the sample;
• target token: the focus of the dataset, to be
assinged an abstraction score in context;
• target lemma: the lemma of the target token;
• text: the sentence where the token appears;
• begin: the index of the first character of the token
in the sentence;
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3https://www.opensubtitles.org 4https://www.wikihow.com 5https://osf.io/ja89x/?view_only=91d683c7399c45f9aa63f2b34cfe6617</title>
        <p>Abstractness Prompt:
Assegna un valore di astrazione da 1 a 5 alla parola
parola nel contesto della frase seguente: frase
Descrizione dei valori: 1 - La parola è estremamente
concreta (e.g. un cane specifico) 2 - La parola è
lievemente concreta (e.g. un cane di una certa razza) 3
- La parola è neutra (e.g. un cane tra tanti) 4 - La
prola è lievemente astratta (e.g. un cane è un
animale da compagnia) 5 - La parola è estremamente
astratta (e.g. il cane è un mammifero).</p>
        <p>(a) Prompt used for the Inclusiveness Task.</p>
        <p>Inclusiveness Prompt:
Assegna un valore di inclusività da 1 a 5 alla parola
parola nel contesto della frase seguente: frase
Descrizione dei valori: 1 - La parola è estremamente
specifica (e.g. un cane specifico) 2 - La prola è
lievemente specifica (e.g. un cane di una certa razza) 3
- La parola è neutra (e.g. un cane tra tanti) 4 - La
parola è lievemente inclusiva (e.g. un cane è un
animale da compagnia) 5 - La parola è estremamente
inclusiva (e.g. il cane è un mammifero)</p>
        <p>(b) Prompt used for the Inclusiveness Task.</p>
        <p>• end: the index of the last character of the token
in the sentence;
• domain: the source where the token come from;
• inclusiveness mean: the average inclusiveness
score assigned by the annotators;
• inclusiveness std: the standard deviation of the
inclusiveness scores;
• abstractness mean: the average abstractness score
assigned by the annotators;
• abstractness std: the standard deviatio n of the
abstractness scores;
3.3. Example of prompts used for zero
or/and few shots
We use diferent prompts for the two tasks, they are
shown in Figure 2, we ask the model to directly output a
score from 1 to 5 specific to the task, we then propose an
explanation for each point from 1 to 5, explaining the
(approximate) meaning of assigning that score together with
a very high-level example and on top of the explanation,
we use 3-shot evaluation, we found 0-shot to be dificult
abstractness
inclusiveness
abstractness
inclusiveness
mean
std
mean
std
mean
std
mean
std
ness value around 0.8 while for inclusiveness the peak is
around 0.1, showing a partial anti-pattern between the
two scores, and the concept they are meant to distill.</p>
        <p>To investigate the relevance of the context in the
assessment of abstraction and inclusiveness, Table 1 shows
the mean and standard deviation of the abstractness and
inclusiveness of a token when varying context, for all
the tokens in the dataset. The standard deviation is often
between 0.2 and 0.4 for a score bound between 0 and
1, this shows significant sensitivity to context and
highlights how, even if tokens are repeated, each sample is
valuable on its own and provides diferent insights about
the token.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Metrics</title>
      <p>for this dataset as without some reference example, the
scoring becomes too variable.</p>
      <p>With a 3-shot approach and the prompts we used, all
models we test appear to be able to understand the task
and performance improves with these prompts when
compared to less specific ones.</p>
      <p>We measure Pearson correlation between the
abstractness and inclusiveness scores predicted by the model and
the gold human annotation. More specifically, since it
3.4. Detailed data statistics is challenging to have the models output a continuous
The dataset contains 127 samples each sample focused value for the abstractness or inclusiveness of a token in
on a token, the same token appears more than once in context, we have them generate a discrete score from 1
the dataset, on average 6.35 times, in diferent contexts. to 5.</p>
      <p>While the dataset contains 127 samples (a limited The evaluation is done following a likelihood based
amount), Figure 3 shows that both abstractness and in- approach, after prompting the model to answer our
quesclusiveness are well spread across the dataset and there tion, we pick the highest likelihood token among 1, 2, 3,
are samples for all values between 0 and 1. Interest- 4 and 5 and pick that as the model selection. After doing
ingly, while the two concept under study are diferent, so for each sample, we compute the Pearson correlation
the two scores are similarly distributed across the dataset, between these values and a discretized version of the
conbut there is a higher number of samples with abstract- tinuous scores (discretization does not afect the results)
assigned by humans to the same samples. 3.1 outperforms mistral 7b also by a large margin.</p>
      <p>
        Table 2 shows our evaluation of three powerful, Finally, we remark that we avoid testing models that
Emglish-first language models, mistral 7b [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], llama- have been tuned for Italian to let participants to the
Chal3.1-8b and llama-3.1-70b [12], note that we use the in- lenge measure the performance improvements provided
struct version of all three models, and we omit it from by Italian focused training.
the names.
      </p>
      <p>These initial results show that the models are able to
capture both abstractness and inclusiveness, with the 5. Conclusions
exception of mistral 7b that fails at understanding
inclusiveness (Pearson correlation is 0). At the same time, a We propose the ABRICOT benchmark, a dataset
compowerful LLM like llama-3.1-70b is not able to capture posed of 127 humanly annotated samples to measure the
the full complexity of the task, with a Pearson correlation abstraction and concreteness of words. Each sample is
that is as low as 0.53 for abstractness and 0.41 for inclu- annotated by 5 - 7 raters who ranked them with a
consiveness. This shows that while not alien to the concept tinuous score from 0 to 1 from most concrete to most
of abstractness and inclusiveness, the models are still far abstract and a second one measured in the same way
from fully understanding it. from least to most inclusive.</p>
      <p>Assessing abstractness seems to be easier for LLMs, We propose two Tasks, measuring abstractness and
insince every model performs better in this task than in the clusiveness and we test three powerful language models
inclusiveness one. This is interesting although hard to on our benchmark, mistral 7b, llama 3 8b and llama 3 70b,
interpret. One possible explanation is that abstractness is we show that when correlating their generations with the
a feature that is already made explicit by the choice of the humans scores, the highest result on abstractness is 0.53
stimuli. Those words do show a variation between dif- achieved by the largest llama 3 while on inclusiveness the
ferent contexts of use, and this is one of the objectives of correlation is bound by 0.41, showing that inclusiveness
such challenges with contextual information, but we can is harder to understand than abstractness.
also organize these nouns, out of context, discretely along We hope that the ABRICOT benchmark will foster
the axis of variation between abstract (e.g. ambizione – the development of new language models in Italian as
well as new benchmarks investigating phenomena with
ambition) and concrete (e.g. benzina – petrol). On the
contrary, inclusiveness cannot be resolved in any way a theoretical linguistic foundation such as abstractness
without considering a proper context; a word form by and inclusiveness.
itself does not convey any information about how much
generic, thus inclusive, is the concept behind that lexical 6. Limitations
label. In light of this, we can hypothesize that when a
model has to deal with abstractness/concreteness, it may The main limitation of the datasets is the low number
not be able to rank two occurrences of the same word of samples it contains, in particular since samples can
in slightly diferent contexts, but for sure it can judge as repeat tokens and there are indeed only 20 unique ones.
more concrete or more abstract all the occurrences of one This can limit the validity of the models assessment, since
target word with respect to those of another. But when it the topics and vocabulary we cover is rather limited,
alcomes to inclusiveness, thus evaluate if one occurrence though we have shown that in terms of both abstractness
is more specific or generic than another, the model is and inclusiveness, the dataset is well spread and provides
probably struggling more. a good coverage of both concepts.</p>
      <p>Another possible interpretation of these unbalanced
results between abstractness and inclusiveness may depend
on the quantity of information about the two features: Acknowledgments
while on abstractness/concreteness there are many
studies available online (on English and Italian, as well as on This work was partially supported by the Project PRIN
other languages), inclusiveness (and also genericity/speci- 2022EPTPJ9 (WEMB – “Word EMBeddings: From
Cogifcity, which are the most used terms in literature to refer nitive Linguistics to Language Engineering, and Back”),
to this semantic feature) is an understudied topic. We funded by the Italian Ministry of University and Research
can thus hypothesize that knowledge about abstractness (MUR), and the Project ERC-2021-STG-101039777
(ABis more formalised in training data, while inclusiveness STRACTION), funded by the European Union. Views and
is not. opinions expressed are however those of the author(s)</p>
      <p>Moreover, we confirm that also for this task larger only and do not necessarily reflect those of the
Euromodels perform better, Llama 3.1-70b outperforms llama- pean Union or the European Research Council Executive
3.1-8b by a large margin, and that training on more data Agency. Neither the European Union nor the granting
provides stronger models also in this case, indeed, llama authority can be held responsible for them.</p>
      <sec id="sec-5-1">
        <title>G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.</title>
        <p>A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
//arxiv.org/abs/2310.06825. arXiv:2310.06825.
[12] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A.
AlDahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
A. Fan, A. Goyal, A. Hartshorn, A. Yang, A.
Mitra, A. Sravankumar, A. Korenev, A. Hinsvark,
A. Rao, A. Zhang, A. Rodriguez, A. Gregerson,
A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern,
C. Caucheteux, C. Nayak, C. Bi, C. Marra, C.
McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C.
Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,
D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,
D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,
E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,
F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.
Anderson, G. Nail, G. Mialon, G. Pang, G.
Cucurell, H. Nguyen, H. Korevaar, H. Xu, H.
Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,
I. Evtimov, J. Copet, J. Lee, J. Gefert, J. Vranes,
J. Park, J. Mahadeokar, J. Shah, J. van der Linde,
J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,
J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,
J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,
K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,
K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,
L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,
L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,
L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,
M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,
M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.
Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,
N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,
P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P.
Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,
Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,
R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R.
Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,
R. Silva, R. Hou, R. Wang, S. Hosseini, S.
Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,
S. Narang, S. Raparthy, S. Shen, S. Wan, S.
Bhosale, S. Zhang, S. Vandenhende, S. Batra, S.
Whitman, S. Sootla, S. Collot, S. Gururangan, S.
Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,
T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,
U. Karn, V. Goswami, V. Gupta, V. Ramanathan,
V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V.
Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X.
Martinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A.
Victoria, A. Goldstand, A. Menon, A. Sharma, A.
Boe</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Krifka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Pelletier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Carlson</surname>
          </string-name>
          , A. ter Meulen, G. Chierchia, G. Link,
          <string-name>
            <surname>Genericity:</surname>
          </string-name>
          <article-title>An introduction</article-title>
          , in: G. N.
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          <string-name>
            <surname>Pelletier</surname>
          </string-name>
          (Eds.),
          <source>The Generic Book</source>
          , University of Chicago Press,
          <year>1995</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Behrens</surname>
          </string-name>
          ,
          <article-title>Genericity from a cross-linguistic perspective</article-title>
          ,
          <source>Linguistics</source>
          (
          <year>2005</year>
          )
          <fpage>275</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Dahl</surname>
          </string-name>
          ,
          <article-title>The marking of the episodic/generic distinction in tense-aspect systems</article-title>
          , in: G. N.
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          <string-name>
            <surname>Pelletier</surname>
          </string-name>
          (Eds.),
          <source>The Generic Book</source>
          , University of Chicago Press,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Chatzigoga</surname>
          </string-name>
          , Genericity,
          <source>in: The Oxford Handbook of Experimental Semantics and Pragmatics</source>
          , Oxford University Press,
          <year>2019</year>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gregori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montefinese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Radicioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          , R. Varvara, CONcreTEXT@EVALITA2020:
          <article-title>The Concreteness in Context Task</article-title>
          ., in: EVALITA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Palmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Sørensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinkal</surname>
          </string-name>
          ,
          <article-title>Annotating genericity: a survey, a scheme, and a corpus</article-title>
          ,
          <source>in: Proceedings of the 9th Linguistic Annotation Workshop</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chocron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pareti</surname>
          </string-name>
          ,
          <article-title>Vocabulary alignment for collaborative agents: a study with real-world multilingual how-to instructions</article-title>
          ,
          <source>in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>165</lpage>
          . URL: https://doi.org/10.24963/ ijcai.
          <year>2018</year>
          /22. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2018</year>
          /22.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kouylekov, OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hasida</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Piperidis</surname>
          </string-name>
          , T. Tokunaga (Eds.),
          <source>Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <article-title>European Language Resources Association (ELRA), Miyazaki</article-title>
          , Japan,
          <year>2018</year>
          . URL: https://aclanthology.org/L18-1275.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          , G. Puccetti,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bolognesi</surname>
          </string-name>
          , Abricot: Abstractness and inclusiveness in context,
          <year>2024</year>
          . URL: osf.io/ja89x. doi:
          <volume>10</volume>
          .17605/OSF.IO/JA89X.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>