=Paper= {{Paper |id=Vol-3878/128_calamita_long |storemode=property |title=ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge |pdfUrl=https://ceur-ws.org/Vol-3878/128_calamita_long.pdf |volume=Vol-3878 |authors=Giovanni Puccetti,Claudia Collacciani,Andrea Amelio Ravelli,Andrea Esuli,Marianna Bolognesi |dblpUrl=https://dblp.org/rec/conf/clic-it/0002CREB24 }} ==ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge== https://ceur-ws.org/Vol-3878/128_calamita_long.pdf
                                ABRICOT   - ABstRactness and Inclusiveness in COntexT:
                                A CALAMITA Challenge
                                Giovanni Puccetti1,∗ , Claudia Collacciani2 , Andrea Amelio Ravelli3 , Andrea Esuli1 and
                                Marianna Marcella Bolognesi3
                                1
                                  Istituto di Scienza e Tecnologia dell’Informazione “A. Faedo”
                                2
                                  Independent researcher
                                3
                                  ABSTRACTION Research Group – Università di Bologna


                                              Abstract
                                              The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness
                                              and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike
                                              binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with
                                              varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in
                                              different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge
                                              aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.

                                              Keywords
                                              Abstraction, Inclusiveness, Context, LLM evaluation, Italian Language Models



                                1. Challenge: Introduction and                                       enlarge our focus to take into account the whole context
                                                                                                     in which the NP occurs [1]. This phenomenon can be
                                   Motivation                                                        observed in all languages [2], affecting nearly all nouns
                                The ability to convey both specific information (about that can be used in referring expressions. Indeed, natural
                                individuals or events) and generalisations (about cate- languages do not have explicit markers for generic NPs
                                gories) with the same lexical item is one of the key feature [3]; the genericity/specificity of an NP is derived from
                                of natural languages. Consider the examples in 1:                    the meaning of the entire sentence. In other words, we
                                                                                                     cannot interpret language one word at a time; we need
                                     1.      a) the lion escaped yesterday from the zoo. to consider the whole sentence or utterance as context
                                             b) the lion is a predatory cat.                         to disambiguate and decipher the meaning of each single
                                                                                                     word composing it, and thus to understand the message
                                   The noun phrase (NP) the lion can describe either a conveyed through language.
                                specific individual (1a) or the entire category of large                Generalizations about kinds and categories, as in 1b,
                                African felines (1b), thus it expresses a variable degree are called generics and are fundamental to human cogni-
                                of inclusiveness of the possible number of individuals tion, because they allow us to conceptualize properties
                                to which the NP correctly applies in each sentence it linked to categories, shaping how we perceive the world
                                occurs. This demonstrates how human language follows [4].
                                a principle of economy, enabling a one-to-many mapping                  Moreover, distinguishing between generic and non-
                                between lexical labels and meanings.                                 generic meanings for abstract entities is less straightfor-
                                   The syntactic form of the NP (definite, indefinite, or ward than for concrete ones, and for this reason evaluate
                                plural) does not provide sufficient information to dis- the inclusiveness of an abstract noun or a NP is even
                                criminate between the two meanings, and we need to more challenging. Indeed, inclusiveness is not an ex-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, clusive feature of concrete-only entities. Consider the
                                Dec 04 — 06, 2024, Pisa, Italy                                       examples in 1:
                                ∗
                                     Corresponding author.
                                Envelope-Open giovanni.puccetti@isti.cnr.it (G. Puccetti);                              2.       a) Colorless green ideas sleep furiously.
                                claudia.collacciani2@unibo.it (C. Collacciani);                                                  b) Be less curious about people and more cu-
                                andreamelio.ravelli@unibo.it (A. A. Ravelli); andrea.esuli@isti.cnr.it                              rious about ideas.
                                (A. Esuli); m.bolognesi@unibo.it (M. M. Bolognesi)
                                GLOBE https://gpucce.github.io/ (G. Puccetti);                                                            The concept behind the word idea is always referring
                                https://github.com/claudiacollacciani (C. Collacciani);
                                                                                                                                       to an abstract entity, with slightly different grades of ab-
                                https://www.unibo.it/sitoweb/andreaamelio.ravelli (A. A. Ravelli);
                                https://esuli.it/ (A. Esuli);                                                                          stractness, but it shows a greater variation in terms of
                                https://www.unibo.it/sitoweb/m.bolognesi (M. M. Bolognesi)                                             inclusiveness. The noun ideas in 2a includes only a re-
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                          Attribution 4.0 International (CC BY 4.0).                                                   stricted number of elements with respect to the universe




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
     Token: Margherita                                           Token: Ambizione
     Text: Le margherite di fronte alla mia casa saranno         Text: La sua ambizione lo rovinerà.
     in piena fioritura.                                         Abstractness: 0.478
     Abstractness: 0.177                                         Inclusiveness: 0.083
     Inclusiveness: 0.187


       (a) Example of sample for the Margherita token.             (b) Example of sample for the Ambizione token.

      Token: Benzina                                              Token: Benzina
      Text: La benzina è nella bottiglia del latte.               Text: In Italia è disponibile la benzina a 95 ottani.
      Abstractness: 0.064                                         Abstractness: 0.573
      Inclusiveness: 0.063                                        Inclusiveness: 0.653



   (c) Example of sample for a more concrete Benzina token.    (d) Example of sample for a more abstract Benzina token.

Figure 1: Examples from the abricot dataset.



of the ideas (namely, only colorless green ones), while the   sentence.
reference in 2b shows a higher level of inclusiveness, not       This task have some similarities with the CONcreTEXT
distinguishing among them on the basis of their color.        Task1 [6], which has been presented at the 2020 edition of
   The ability to distinguish, interpret and use correctly    EVALITA.2 Both tasks focus on the abstractness/concrete-
the variability that natural language offers along these      ness of target words in natural Italian sentences, asking
two graduated semantic features, abstractness and inclu-      judgments by means of Likert scales, but the ABRICOT
siveness, is of paramount importance if we want to make           Task goes beyond by including also the inclusiveness
talking machines which not only simulate language, but        feature of the targets. Moreover, for the construction of
can also reason about natural language and the knowl-         this dataset we considered exclusively nouns or NPs as
edge of the world it depicts.                                 targets, and in order to limit to the minimum the impact
   The CALAMITA special event [5] offers the possibil-        of the variability deriving from different semantic role or
ity to challenge Large Language Models on their ability       syntactic function, all the sentences have been selected
to understand the abstractness and inclusiveness of the       with the target noun as subject of the main verb.
words, and compare with humans their behaviour in
judging Italian sentences. With this report we present        2.1. Tasks
the ABRICOT         Task: ABstRactness and Inclusiveness
in COntexT.                                                   We propose two separate tasks for this benchmark, Task
                                                              1: abstractness and Task 2: inclusiveness the two tasks are
                                                              formally identical, we use the same metric and the same
2. Challenge: Description                                     samples, however they measure two different scores, re-
                                                              spectively abstractness_mean and inclusiveness_mean, the
The ABRICOT           Task aims to challenge Italian lan- first meant to measure the abstractness of the word in
guage models on their understanding of abstractness context and the second its inclusiveness.
and inclusiveness, features that we, as humans, naturally         Since both these concepts are evident but fuzzy also
express in everyday language. These features are not for humans, we don´t expect language models to have
discrete binary dichotomies like abstract/concrete or a perfect understanding of them and we will limit our
inclusive/exclusive ; instead, they shade on a contin- metrics to regression ones. Despite the tasks being very
uous spectrum, with the two extremes at opposite ends. similar from a formal perspective, we show how mod-
The collection of sentences in this Task shows the same els’ performance on these two tasks varies and there is
NP in a variety of different contexts, so that its meaning sensible difference between the results in the two tasks.
can oscillate between the extremes of both the axis of
abstractness and inclusiveness.
   We ask the participant models to express a judgment
on a 5 point Likert scale for both the features of inclusive- 1
                                                                lablita.github.io/CONcreTEXT
ness and abstractness of the target noun or NP in each 2 www.evalita.it
3. Data description                                                  Abstractness Prompt:
                                                                     Assegna un valore di astrazione da 1 a 5 alla parola
3.1. Origin of data                                                  parola nel contesto della frase seguente: frase De-
                                                                     scrizione dei valori: 1 - La parola è estremamente
The 20 target NPs of the dataset for the ABRICOT                     concreta (e.g. un cane specifico) 2 - La parola è lieve-
Task are derived (and translated in Italian) from the set            mente concreta (e.g. un cane di una certa razza) 3
of target nouns in the Situation Entities Corpus (SitEnt             - La parola è neutra (e.g. un cane tra tanti) 4 - La
[7]), a collection of English sentences in which speci-              prola è lievemente astratta (e.g. un cane è un ani-
                                                                     male da compagnia) 5 - La parola è estremamente
ficity and genericity have been annotated with a binary
                                                                     astratta (e.g. il cane è un mammifero).
labelling scheme (i.e., GENERIC vs. NON-GENERIC ). Us-
ing those as seeds, representative Italian sentences have
been manually harvested from OpenSubtitles3 and Wiki-            (a) Prompt used for the Inclusiveness Task.
How.4 These are widely used sources, the first contains
the openly available subtitles of an extensive collection of
                                                             Inclusiveness Prompt:
movies and TV series, while the second is a website gath-
                                                             Assegna un valore di inclusività da 1 a 5 alla parola
ering articles on how-to do a variety of different things.
                                                             parola nel contesto della frase seguente: frase De-
   More specifically, the sentences have been extracted      scrizione dei valori: 1 - La parola è estremamente
from the Italian section of the multilingual The Human       specifica (e.g. un cane specifico) 2 - La prola è lieve-
Instruction Dataset [8], a structured collection of Wiki-    mente specifica (e.g. un cane di una certa razza) 3
How instructions pages, and from the Italian sub-corpus      - La parola è neutra (e.g. un cane tra tanti) 4 - La
of the OpenSubtitles2018 corpus [9].                         parola è lievemente inclusiva (e.g. un cane è un an-
   Our protocol proposes to the annotators groups of         imale da compagnia) 5 - La parola è estremamente
sentences (from a minimum of 4 to a maximum of 8), all       inclusiva (e.g. il cane è un mammifero)
containing the same noun, each to be evaluated using a
continuous slider, from which values ranging from 0 to 1
                                                                 (b) Prompt used for the Inclusiveness Task.
will then be extracted.
   After the annotation, the reliability of our data has Figure 2: Prompts used for the evaluation.
been computed using the Intraclass Correlation Coeffi-
cient (ICC(k)). Human ratings have been then averaged,
and the resulting figures will be used as gold standard.
   An example of the samples present in the dataset           • end: the index of the last character of the token
can be seen in Figure ?? where examples with the NPs            in the sentence;
margherita (lilly), ambizione (ambition) and benzina
                                                              • domain: the source where the token come from;
(gasoline) are reported. In particular, Figure ?? and ??
show two examples containing the same token but in            • inclusiveness mean: the average inclusiveness
different contexts and report the effect of the context on      score assigned by the annotators;
the abstractness and inclusiveness of the token.
   The data is stored on OSF [10].5                           • inclusiveness std: the standard deviation of the
                                                                inclusiveness scores;
3.2. Data format                                                     • abstractness mean: the average abstractness score
                                                                       assigned by the annotators;
The data is proposed in a tabular format, with 12 columns:
                                                                     • abstractness std: the standard deviatio n of the
     • ID: a unique identifier for the sample;                         abstractness scores;
     • target token: the focus of the dataset, to be
       assinged an abstraction score in context;                3.3. Example of prompts used for zero
     • target lemma: the lemma of the target token;
                                                                     or/and few shots
                                                                   We use different prompts for the two tasks, they are
     • text: the sentence where the token appears;
                                                                   shown in Figure 2, we ask the model to directly output a
       • begin: the index of the first character of the token score from 1 to 5 specific to the task, we then propose an
           in the sentence;                                        explanation for each point from 1 to 5, explaining the (ap-
3
                                                                   proximate) meaning of assigning that score together with
  https://www.opensubtitles.org                                    a very high-level example and on top of the explanation,
4
  https://www.wikihow.com
5
  https://osf.io/ja89x/?view_only=91d683c7399c45f9aa63f2b34cfe6617 we use 3-shot evaluation, we found 0-shot to be difficult
                        ambizione    benzina     bicchiere     bici        bottiglia   cameriere    coscienza     effetto       farina       giardino
                 mean     0.65         0.42        0.51        0.52          0.34        0.47         0.81         0.57          0.46            0.50
 abstractness
                 std      0.18         0.26        0.19        0.27          0.26        0.22         0.06         0.24          0.26            0.29
                 mean     0.41         0.48        0.52        0.58          0.35        0.42         0.53         0.43          0.48            0.54
 inclusiveness
                 std      0.35         0.34        0.26        0.30          0.32        0.30         0.28         0.29          0.32            0.34
                         ironia     margherita    mucca      orchestra     orologio    ospedale      patata       persona      saggezza     strategia
                 mean     0.77         0.38        0.43        0.43          0.44        0.63         0.47         0.55          0.72            0.66
 abstractness
                 std      0.14         0.22        0.25        0.29          0.27        0.22         0.27         0.27          0.13            0.12
                 mean     0.38         0.36        0.45        0.32          0.47        0.71         0.56         0.41          0.49            0.51
 inclusiveness
                 std      0.29         0.36        0.38        0.31          0.35        0.28         0.31         0.30          0.33            0.33

Table 1
Mean and standard deviation of the abstractness and inclusiveness for each token across all different possible contexts.



                                                                                                mistral 7b      llama-3.1-8b      llama-3.1-70b
                                                                            abstractness           0.22             0.30                  0.53
                                                                            inclusiveness          0.00             0.30                  0.41

                                                                         Table 2
                                                                         Pearson correlation between the model predicsions and the
                                                                         human annotations for abstractness and inclusiveness scores,
                                                                         measure for three different models, mistral 7b, llama-3.1-8b
                                                                         and llama-3.1-70b.



                                                                         ness value around 0.8 while for inclusiveness the peak is
                                                                         around 0.1, showing a partial anti-pattern between the
                                                                         two scores, and the concept they are meant to distill.
                                                                            To investigate the relevance of the context in the as-
                                                                         sessment of abstraction and inclusiveness, Table 1 shows
                                                                         the mean and standard deviation of the abstractness and
                                                                         inclusiveness of a token when varying context, for all
                                                                         the tokens in the dataset. The standard deviation is often
Figure 3: Distribution of the abstractness and inclusiveness             between 0.2 and 0.4 for a score bound between 0 and
scores in the dataset.
                                                                         1, this shows significant sensitivity to context and high-
                                                                         lights how, even if tokens are repeated, each sample is
                                                                         valuable on its own and provides different insights about
for this dataset as without some reference example, the                  the token.
scoring becomes too variable.
  With a 3-shot approach and the prompts we used, all
models we test appear to be able to understand the task                  4. Metrics
and performance improves with these prompts when
compared to less specific ones.                                          We measure Pearson correlation between the abstract-
                                                                         ness and inclusiveness scores predicted by the model and
                                                                         the gold human annotation. More specifically, since it
3.4. Detailed data statistics                                            is challenging to have the models output a continuous
The dataset contains 127 samples each sample focused                     value for the abstractness or inclusiveness of a token in
on a token, the same token appears more than once in                     context, we have them generate a discrete score from 1
the dataset, on average 6.35 times, in different contexts.               to 5.
   While the dataset contains 127 samples (a limited                        The evaluation is done following a likelihood based
amount), Figure 3 shows that both abstractness and in-                   approach, after prompting the model to answer our ques-
clusiveness are well spread across the dataset and there                 tion, we pick the highest likelihood token among 1, 2, 3,
are samples for all values between 0 and 1. Interest-                    4 and 5 and pick that as the model selection. After doing
ingly, while the two concept under study are different,                  so for each sample, we compute the Pearson correlation
the two scores are similarly distributed across the dataset,             between these values and a discretized version of the con-
but there is a higher number of samples with abstract-                   tinuous scores (discretization does not affect the results)
assigned by humans to the same samples.                        3.1 outperforms mistral 7b also by a large margin.
   Table 2 shows our evaluation of three powerful,                Finally, we remark that we avoid testing models that
Emglish-first language models, mistral 7b [11], llama-         have been tuned for Italian to let participants to the Chal-
3.1-8b and llama-3.1-70b [12], note that we use the in-        lenge measure the performance improvements provided
struct version of all three models, and we omit it from        by Italian focused training.
the names.
   These initial results show that the models are able to
capture both abstractness and inclusiveness, with the          5. Conclusions
exception of mistral 7b that fails at understanding inclu-
                                                               We propose the ABRICOT benchmark, a dataset com-
siveness (Pearson correlation is 0). At the same time, a
                                                               posed of 127 humanly annotated samples to measure the
powerful LLM like llama-3.1-70b is not able to capture
                                                               abstraction and concreteness of words. Each sample is
the full complexity of the task, with a Pearson correlation
                                                               annotated by 5 - 7 raters who ranked them with a con-
that is as low as 0.53 for abstractness and 0.41 for inclu-
                                                               tinuous score from 0 to 1 from most concrete to most
siveness. This shows that while not alien to the concept
                                                               abstract and a second one measured in the same way
of abstractness and inclusiveness, the models are still far
                                                               from least to most inclusive.
from fully understanding it.
                                                                  We propose two Tasks, measuring abstractness and in-
   Assessing abstractness seems to be easier for LLMs,
                                                               clusiveness and we test three powerful language models
since every model performs better in this task than in the
                                                               on our benchmark, mistral 7b, llama 3 8b and llama 3 70b,
inclusiveness one. This is interesting although hard to
                                                               we show that when correlating their generations with the
interpret. One possible explanation is that abstractness is
                                                               humans scores, the highest result on abstractness is 0.53
a feature that is already made explicit by the choice of the
                                                               achieved by the largest llama 3 while on inclusiveness the
stimuli. Those words do show a variation between dif-
                                                               correlation is bound by 0.41, showing that inclusiveness
ferent contexts of use, and this is one of the objectives of
                                                               is harder to understand than abstractness.
such challenges with contextual information, but we can
                                                                  We hope that the ABRICOT benchmark will foster
also organize these nouns, out of context, discretely along
                                                               the development of new language models in Italian as
the axis of variation between abstract (e.g. ambizione –
                                                               well as new benchmarks investigating phenomena with
ambition) and concrete (e.g. benzina – petrol). On the
                                                               a theoretical linguistic foundation such as abstractness
contrary, inclusiveness cannot be resolved in any way
                                                               and inclusiveness.
without considering a proper context; a word form by
itself does not convey any information about how much
generic, thus inclusive, is the concept behind that lexical    6. Limitations
label. In light of this, we can hypothesize that when a
model has to deal with abstractness/concreteness, it may       The main limitation of the datasets is the low number
not be able to rank two occurrences of the same word           of samples it contains, in particular since samples can
in slightly different contexts, but for sure it can judge as   repeat tokens and there are indeed only 20 unique ones.
more concrete or more abstract all the occurrences of one      This can limit the validity of the models assessment, since
target word with respect to those of another. But when it      the topics and vocabulary we cover is rather limited, al-
comes to inclusiveness, thus evaluate if one occurrence        though we have shown that in terms of both abstractness
is more specific or generic than another, the model is         and inclusiveness, the dataset is well spread and provides
probably struggling more.                                      a good coverage of both concepts.
   Another possible interpretation of these unbalanced re-
sults between abstractness and inclusiveness may depend
on the quantity of information about the two features:         Acknowledgments
while on abstractness/concreteness there are many stud-
                                                               This work was partially supported by the Project PRIN
ies available online (on English and Italian, as well as on
                                                               2022EPTPJ9 (WEMB – “Word EMBeddings: From Cog-
other languages), inclusiveness (and also genericity/speci-
                                                               nitive Linguistics to Language Engineering, and Back”),
ficity, which are the most used terms in literature to refer
                                                               funded by the Italian Ministry of University and Research
to this semantic feature) is an understudied topic. We
                                                               (MUR), and the Project ERC-2021-STG-101039777 (AB-
can thus hypothesize that knowledge about abstractness
                                                               STRACTION), funded by the European Union. Views and
is more formalised in training data, while inclusiveness
                                                               opinions expressed are however those of the author(s)
is not.
                                                               only and do not necessarily reflect those of the Euro-
   Moreover, we confirm that also for this task larger
                                                               pean Union or the European Research Council Executive
models perform better, Llama 3.1-70b outperforms llama-
                                                               Agency. Neither the European Union nor the granting
3.1-8b by a large margin, and that training on more data
                                                               authority can be held responsible for them.
provides stronger models also in this case, indeed, llama
References                                                          G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
                                                                    A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
 [1] M. Krifka, F. J. Pelletier, G. Carlson, A. ter Meulen,         T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
     G. Chierchia, G. Link, Genericity: An introduction,            //arxiv.org/abs/2310.06825. arXiv:2310.06825 .
     in: G. N. Carlson, F. J. Pelletier (Eds.), The Generic    [12] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-
     Book, University of Chicago Press, 1995, pp. 1–124.            Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
 [2] L. Behrens, Genericity from a cross-linguistic per-            A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi-
     spective, Linguistics (2005) 275–344.                          tra, A. Sravankumar, A. Korenev, A. Hinsvark,
 [3] O. Dahl, The marking of the episodic/generic dis-              A. Rao, A. Zhang, A. Rodriguez, A. Gregerson,
     tinction in tense-aspect systems, in: G. N. Carlson,           A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern,
     F. J. Pelletier (Eds.), The Generic Book, University           C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc-
     of Chicago Press, 1995.                                        Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C.
 [4] D. L. Chatzigoga, Genericity, in: The Oxford Hand-             Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,
     book of Experimental Semantics and Pragmatics,                 D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,
     Oxford University Press, 2019, pp. 156–177.                    D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,
 [5] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-        E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-        F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.
     naldi, D. Scalena, CALAMITA: Challenge the Abili-              Anderson, G. Nail, G. Mialon, G. Pang, G. Cu-
     ties of LAnguage Models in ITAlian, in: Proceed-               curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou-
     ings of the 10th Italian Conference on Computa-                vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-         I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes,
     ber 4 - December 6, 2024, CEUR Workshop Proceed-               J. Park, J. Mahadeokar, J. Shah, J. van der Linde,
     ings, CEUR-WS.org, 2024.                                       J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,
 [6] L. Gregori, M. Montefinese, D. P. Radicioni, A. A.             J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,
     Ravelli, R. Varvara, CONcreTEXT@EVALITA2020:                   J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,
     The Concreteness in Context Task., in: EVALITA,                K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,
     2020.                                                          K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,
 [7] A. Friedrich, A. Palmer, M. P. Sørensen, M. Pinkal,            L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,
     Annotating genericity: a survey, a scheme, and                 L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,
     a corpus, in: Proceedings of the 9th Linguistic                L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,
     Annotation Workshop, 2015, pp. 21–30.                          M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,
 [8] P. Chocron, P. Pareti, Vocabulary alignment for                M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.
     collaborative agents: a study with real-world multi-           Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,
     lingual how-to instructions, in: Proceedings of the            N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,
     Twenty-Seventh International Joint Conference on               P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar-
     Artificial Intelligence, IJCAI-18, International Joint         gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,
     Conferences on Artificial Intelligence Organization,           Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,
     2018, pp. 159–165. URL: https://doi.org/10.24963/              R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa-
     ijcai.2018/22. doi:10.24963/ijcai.2018/22 .                    tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,
 [9] P. Lison, J. Tiedemann, M. Kouylekov, OpenSub-                 R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas-
     titles2018: Statistical rescoring of sentence align-           appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,
     ments in large, noisy parallel corpora, in: N. Cal-            S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos-
     zolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi,           ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit-
     K. Hasida, H. Isahara, B. Maegaard, J. Mariani,                man, S. Sootla, S. Collot, S. Gururangan, S. Borodin-
     H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Toku-           sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,
     naga (Eds.), Proceedings of the Eleventh Interna-              T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,
     tional Conference on Language Resources and Eval-              U. Karn, V. Goswami, V. Gupta, V. Ramanathan,
     uation (LREC 2018), European Language Resources                V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
     Association (ELRA), Miyazaki, Japan, 2018. URL:                vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
     https://aclanthology.org/L18-1275.                             tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
[10] A. A. Ravelli, G. Puccetti, M. Bolognesi, Abricot: Ab-         Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
     stractness and inclusiveness in context, 2024. URL:            Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
     osf.io/ja89x. doi:10.17605/OSF.IO/JA89X .                      Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
[11] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-               A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
     ford, D. S. Chaplot, D. de las Casas, F. Bressand,             toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
senberg, A. Vaughan, A. Baevski, A. Feinstein,               hury, S. Goldman, T. Remez, T. Glaser, T. Best,
A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al-             T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan,        T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
A. Ramchandani, A. Franco, A. Saraf, A. Chowd-               tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz-          V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
dan, B. James, B. Maurer, B. Leonhardi, B. Huang,            W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu,           X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto-              X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur-           Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu,              Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,
C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer,              Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt,             models, 2024. URL: https://arxiv.org/abs/2407.21783.
D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh,        arXiv:2407.21783 .
D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland,
E. Dowling, E. Jamil, E. Montgomery, E. Presani,
E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun-
bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel,
F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M.
Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern,
G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang,
G. Lakshminarayanan, H. Shojanazeri, H. Zou,
H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk,
H. Aspegren, H. Goldman, I. Damlaj, I. Molybog,
I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge-
boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang,
J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong,
J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep-
ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang,
K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan-
delwal, K. Zand, K. Matosich, K. Veeraraghavan,
K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho-
tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva,
L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich,
L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt,
M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie,
M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke-
neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel,
M. Vyatskov, M. Samvelyan, M. Clark, M. Macey,
M. Wang, M. J. Hermoso, M. Metanat, M. Raste-
gari, M. Bansal, N. Santhanam, N. Parks, N. White,
N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P.
Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz,
O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh,
P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux,
P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj,
Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy,
R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey,
R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J.
Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon,
S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra-
maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin,
S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang,
S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max,
S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad,
S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-