=Paper=
{{Paper
|id=Vol-3878/128_calamita_long
|storemode=property
|title=ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/128_calamita_long.pdf
|volume=Vol-3878
|authors=Giovanni Puccetti,Claudia Collacciani,Andrea Amelio Ravelli,Andrea Esuli,Marianna Bolognesi
|dblpUrl=https://dblp.org/rec/conf/clic-it/0002CREB24
}}
==ABRICOT - ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge==
ABRICOT - ABstRactness and Inclusiveness in COntexT:
A CALAMITA Challenge
Giovanni Puccetti1,∗ , Claudia Collacciani2 , Andrea Amelio Ravelli3 , Andrea Esuli1 and
Marianna Marcella Bolognesi3
1
Istituto di Scienza e Tecnologia dell’Informazione “A. Faedo”
2
Independent researcher
3
ABSTRACTION Research Group – Università di Bologna
Abstract
The ABRICOT Task is designed to evaluate Italian language models on their ability to understand and assess the abstractness
and inclusiveness of language, two nuanced features that humans naturally convey in everyday communication. Unlike
binary categorizations such as abstract/concrete or inclusive/exclusive, these features exist on a continuous spectrum with
varying degrees of intensity. The task is based on a manual collection of sentences that present the same noun phrase (NP) in
different contexts, allowing its interpretation to vary between the extremes of abstractness and inclusiveness. This challenge
aims to verify the how LLMs perceive subtle linguistic variations and their implications in natural language.
Keywords
Abstraction, Inclusiveness, Context, LLM evaluation, Italian Language Models
1. Challenge: Introduction and enlarge our focus to take into account the whole context
in which the NP occurs [1]. This phenomenon can be
Motivation observed in all languages [2], affecting nearly all nouns
The ability to convey both specific information (about that can be used in referring expressions. Indeed, natural
individuals or events) and generalisations (about cate- languages do not have explicit markers for generic NPs
gories) with the same lexical item is one of the key feature [3]; the genericity/specificity of an NP is derived from
of natural languages. Consider the examples in 1: the meaning of the entire sentence. In other words, we
cannot interpret language one word at a time; we need
1. a) the lion escaped yesterday from the zoo. to consider the whole sentence or utterance as context
b) the lion is a predatory cat. to disambiguate and decipher the meaning of each single
word composing it, and thus to understand the message
The noun phrase (NP) the lion can describe either a conveyed through language.
specific individual (1a) or the entire category of large Generalizations about kinds and categories, as in 1b,
African felines (1b), thus it expresses a variable degree are called generics and are fundamental to human cogni-
of inclusiveness of the possible number of individuals tion, because they allow us to conceptualize properties
to which the NP correctly applies in each sentence it linked to categories, shaping how we perceive the world
occurs. This demonstrates how human language follows [4].
a principle of economy, enabling a one-to-many mapping Moreover, distinguishing between generic and non-
between lexical labels and meanings. generic meanings for abstract entities is less straightfor-
The syntactic form of the NP (definite, indefinite, or ward than for concrete ones, and for this reason evaluate
plural) does not provide sufficient information to dis- the inclusiveness of an abstract noun or a NP is even
criminate between the two meanings, and we need to more challenging. Indeed, inclusiveness is not an ex-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, clusive feature of concrete-only entities. Consider the
Dec 04 — 06, 2024, Pisa, Italy examples in 1:
∗
Corresponding author.
Envelope-Open giovanni.puccetti@isti.cnr.it (G. Puccetti); 2. a) Colorless green ideas sleep furiously.
claudia.collacciani2@unibo.it (C. Collacciani); b) Be less curious about people and more cu-
andreamelio.ravelli@unibo.it (A. A. Ravelli); andrea.esuli@isti.cnr.it rious about ideas.
(A. Esuli); m.bolognesi@unibo.it (M. M. Bolognesi)
GLOBE https://gpucce.github.io/ (G. Puccetti); The concept behind the word idea is always referring
https://github.com/claudiacollacciani (C. Collacciani);
to an abstract entity, with slightly different grades of ab-
https://www.unibo.it/sitoweb/andreaamelio.ravelli (A. A. Ravelli);
https://esuli.it/ (A. Esuli); stractness, but it shows a greater variation in terms of
https://www.unibo.it/sitoweb/m.bolognesi (M. M. Bolognesi) inclusiveness. The noun ideas in 2a includes only a re-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). stricted number of elements with respect to the universe
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Token: Margherita Token: Ambizione
Text: Le margherite di fronte alla mia casa saranno Text: La sua ambizione lo rovinerà.
in piena fioritura. Abstractness: 0.478
Abstractness: 0.177 Inclusiveness: 0.083
Inclusiveness: 0.187
(a) Example of sample for the Margherita token. (b) Example of sample for the Ambizione token.
Token: Benzina Token: Benzina
Text: La benzina è nella bottiglia del latte. Text: In Italia è disponibile la benzina a 95 ottani.
Abstractness: 0.064 Abstractness: 0.573
Inclusiveness: 0.063 Inclusiveness: 0.653
(c) Example of sample for a more concrete Benzina token. (d) Example of sample for a more abstract Benzina token.
Figure 1: Examples from the abricot dataset.
of the ideas (namely, only colorless green ones), while the sentence.
reference in 2b shows a higher level of inclusiveness, not This task have some similarities with the CONcreTEXT
distinguishing among them on the basis of their color. Task1 [6], which has been presented at the 2020 edition of
The ability to distinguish, interpret and use correctly EVALITA.2 Both tasks focus on the abstractness/concrete-
the variability that natural language offers along these ness of target words in natural Italian sentences, asking
two graduated semantic features, abstractness and inclu- judgments by means of Likert scales, but the ABRICOT
siveness, is of paramount importance if we want to make Task goes beyond by including also the inclusiveness
talking machines which not only simulate language, but feature of the targets. Moreover, for the construction of
can also reason about natural language and the knowl- this dataset we considered exclusively nouns or NPs as
edge of the world it depicts. targets, and in order to limit to the minimum the impact
The CALAMITA special event [5] offers the possibil- of the variability deriving from different semantic role or
ity to challenge Large Language Models on their ability syntactic function, all the sentences have been selected
to understand the abstractness and inclusiveness of the with the target noun as subject of the main verb.
words, and compare with humans their behaviour in
judging Italian sentences. With this report we present 2.1. Tasks
the ABRICOT Task: ABstRactness and Inclusiveness
in COntexT. We propose two separate tasks for this benchmark, Task
1: abstractness and Task 2: inclusiveness the two tasks are
formally identical, we use the same metric and the same
2. Challenge: Description samples, however they measure two different scores, re-
spectively abstractness_mean and inclusiveness_mean, the
The ABRICOT Task aims to challenge Italian lan- first meant to measure the abstractness of the word in
guage models on their understanding of abstractness context and the second its inclusiveness.
and inclusiveness, features that we, as humans, naturally Since both these concepts are evident but fuzzy also
express in everyday language. These features are not for humans, we don´t expect language models to have
discrete binary dichotomies like abstract/concrete or a perfect understanding of them and we will limit our
inclusive/exclusive ; instead, they shade on a contin- metrics to regression ones. Despite the tasks being very
uous spectrum, with the two extremes at opposite ends. similar from a formal perspective, we show how mod-
The collection of sentences in this Task shows the same els’ performance on these two tasks varies and there is
NP in a variety of different contexts, so that its meaning sensible difference between the results in the two tasks.
can oscillate between the extremes of both the axis of
abstractness and inclusiveness.
We ask the participant models to express a judgment
on a 5 point Likert scale for both the features of inclusive- 1
lablita.github.io/CONcreTEXT
ness and abstractness of the target noun or NP in each 2 www.evalita.it
3. Data description Abstractness Prompt:
Assegna un valore di astrazione da 1 a 5 alla parola
3.1. Origin of data parola nel contesto della frase seguente: frase De-
scrizione dei valori: 1 - La parola è estremamente
The 20 target NPs of the dataset for the ABRICOT concreta (e.g. un cane specifico) 2 - La parola è lieve-
Task are derived (and translated in Italian) from the set mente concreta (e.g. un cane di una certa razza) 3
of target nouns in the Situation Entities Corpus (SitEnt - La parola è neutra (e.g. un cane tra tanti) 4 - La
[7]), a collection of English sentences in which speci- prola è lievemente astratta (e.g. un cane è un ani-
male da compagnia) 5 - La parola è estremamente
ficity and genericity have been annotated with a binary
astratta (e.g. il cane è un mammifero).
labelling scheme (i.e., GENERIC vs. NON-GENERIC ). Us-
ing those as seeds, representative Italian sentences have
been manually harvested from OpenSubtitles3 and Wiki- (a) Prompt used for the Inclusiveness Task.
How.4 These are widely used sources, the first contains
the openly available subtitles of an extensive collection of
Inclusiveness Prompt:
movies and TV series, while the second is a website gath-
Assegna un valore di inclusività da 1 a 5 alla parola
ering articles on how-to do a variety of different things.
parola nel contesto della frase seguente: frase De-
More specifically, the sentences have been extracted scrizione dei valori: 1 - La parola è estremamente
from the Italian section of the multilingual The Human specifica (e.g. un cane specifico) 2 - La prola è lieve-
Instruction Dataset [8], a structured collection of Wiki- mente specifica (e.g. un cane di una certa razza) 3
How instructions pages, and from the Italian sub-corpus - La parola è neutra (e.g. un cane tra tanti) 4 - La
of the OpenSubtitles2018 corpus [9]. parola è lievemente inclusiva (e.g. un cane è un an-
Our protocol proposes to the annotators groups of imale da compagnia) 5 - La parola è estremamente
sentences (from a minimum of 4 to a maximum of 8), all inclusiva (e.g. il cane è un mammifero)
containing the same noun, each to be evaluated using a
continuous slider, from which values ranging from 0 to 1
(b) Prompt used for the Inclusiveness Task.
will then be extracted.
After the annotation, the reliability of our data has Figure 2: Prompts used for the evaluation.
been computed using the Intraclass Correlation Coeffi-
cient (ICC(k)). Human ratings have been then averaged,
and the resulting figures will be used as gold standard.
An example of the samples present in the dataset • end: the index of the last character of the token
can be seen in Figure ?? where examples with the NPs in the sentence;
margherita (lilly), ambizione (ambition) and benzina
• domain: the source where the token come from;
(gasoline) are reported. In particular, Figure ?? and ??
show two examples containing the same token but in • inclusiveness mean: the average inclusiveness
different contexts and report the effect of the context on score assigned by the annotators;
the abstractness and inclusiveness of the token.
The data is stored on OSF [10].5 • inclusiveness std: the standard deviation of the
inclusiveness scores;
3.2. Data format • abstractness mean: the average abstractness score
assigned by the annotators;
The data is proposed in a tabular format, with 12 columns:
• abstractness std: the standard deviatio n of the
• ID: a unique identifier for the sample; abstractness scores;
• target token: the focus of the dataset, to be
assinged an abstraction score in context; 3.3. Example of prompts used for zero
• target lemma: the lemma of the target token;
or/and few shots
We use different prompts for the two tasks, they are
• text: the sentence where the token appears;
shown in Figure 2, we ask the model to directly output a
• begin: the index of the first character of the token score from 1 to 5 specific to the task, we then propose an
in the sentence; explanation for each point from 1 to 5, explaining the (ap-
3
proximate) meaning of assigning that score together with
https://www.opensubtitles.org a very high-level example and on top of the explanation,
4
https://www.wikihow.com
5
https://osf.io/ja89x/?view_only=91d683c7399c45f9aa63f2b34cfe6617 we use 3-shot evaluation, we found 0-shot to be difficult
ambizione benzina bicchiere bici bottiglia cameriere coscienza effetto farina giardino
mean 0.65 0.42 0.51 0.52 0.34 0.47 0.81 0.57 0.46 0.50
abstractness
std 0.18 0.26 0.19 0.27 0.26 0.22 0.06 0.24 0.26 0.29
mean 0.41 0.48 0.52 0.58 0.35 0.42 0.53 0.43 0.48 0.54
inclusiveness
std 0.35 0.34 0.26 0.30 0.32 0.30 0.28 0.29 0.32 0.34
ironia margherita mucca orchestra orologio ospedale patata persona saggezza strategia
mean 0.77 0.38 0.43 0.43 0.44 0.63 0.47 0.55 0.72 0.66
abstractness
std 0.14 0.22 0.25 0.29 0.27 0.22 0.27 0.27 0.13 0.12
mean 0.38 0.36 0.45 0.32 0.47 0.71 0.56 0.41 0.49 0.51
inclusiveness
std 0.29 0.36 0.38 0.31 0.35 0.28 0.31 0.30 0.33 0.33
Table 1
Mean and standard deviation of the abstractness and inclusiveness for each token across all different possible contexts.
mistral 7b llama-3.1-8b llama-3.1-70b
abstractness 0.22 0.30 0.53
inclusiveness 0.00 0.30 0.41
Table 2
Pearson correlation between the model predicsions and the
human annotations for abstractness and inclusiveness scores,
measure for three different models, mistral 7b, llama-3.1-8b
and llama-3.1-70b.
ness value around 0.8 while for inclusiveness the peak is
around 0.1, showing a partial anti-pattern between the
two scores, and the concept they are meant to distill.
To investigate the relevance of the context in the as-
sessment of abstraction and inclusiveness, Table 1 shows
the mean and standard deviation of the abstractness and
inclusiveness of a token when varying context, for all
the tokens in the dataset. The standard deviation is often
Figure 3: Distribution of the abstractness and inclusiveness between 0.2 and 0.4 for a score bound between 0 and
scores in the dataset.
1, this shows significant sensitivity to context and high-
lights how, even if tokens are repeated, each sample is
valuable on its own and provides different insights about
for this dataset as without some reference example, the the token.
scoring becomes too variable.
With a 3-shot approach and the prompts we used, all
models we test appear to be able to understand the task 4. Metrics
and performance improves with these prompts when
compared to less specific ones. We measure Pearson correlation between the abstract-
ness and inclusiveness scores predicted by the model and
the gold human annotation. More specifically, since it
3.4. Detailed data statistics is challenging to have the models output a continuous
The dataset contains 127 samples each sample focused value for the abstractness or inclusiveness of a token in
on a token, the same token appears more than once in context, we have them generate a discrete score from 1
the dataset, on average 6.35 times, in different contexts. to 5.
While the dataset contains 127 samples (a limited The evaluation is done following a likelihood based
amount), Figure 3 shows that both abstractness and in- approach, after prompting the model to answer our ques-
clusiveness are well spread across the dataset and there tion, we pick the highest likelihood token among 1, 2, 3,
are samples for all values between 0 and 1. Interest- 4 and 5 and pick that as the model selection. After doing
ingly, while the two concept under study are different, so for each sample, we compute the Pearson correlation
the two scores are similarly distributed across the dataset, between these values and a discretized version of the con-
but there is a higher number of samples with abstract- tinuous scores (discretization does not affect the results)
assigned by humans to the same samples. 3.1 outperforms mistral 7b also by a large margin.
Table 2 shows our evaluation of three powerful, Finally, we remark that we avoid testing models that
Emglish-first language models, mistral 7b [11], llama- have been tuned for Italian to let participants to the Chal-
3.1-8b and llama-3.1-70b [12], note that we use the in- lenge measure the performance improvements provided
struct version of all three models, and we omit it from by Italian focused training.
the names.
These initial results show that the models are able to
capture both abstractness and inclusiveness, with the 5. Conclusions
exception of mistral 7b that fails at understanding inclu-
We propose the ABRICOT benchmark, a dataset com-
siveness (Pearson correlation is 0). At the same time, a
posed of 127 humanly annotated samples to measure the
powerful LLM like llama-3.1-70b is not able to capture
abstraction and concreteness of words. Each sample is
the full complexity of the task, with a Pearson correlation
annotated by 5 - 7 raters who ranked them with a con-
that is as low as 0.53 for abstractness and 0.41 for inclu-
tinuous score from 0 to 1 from most concrete to most
siveness. This shows that while not alien to the concept
abstract and a second one measured in the same way
of abstractness and inclusiveness, the models are still far
from least to most inclusive.
from fully understanding it.
We propose two Tasks, measuring abstractness and in-
Assessing abstractness seems to be easier for LLMs,
clusiveness and we test three powerful language models
since every model performs better in this task than in the
on our benchmark, mistral 7b, llama 3 8b and llama 3 70b,
inclusiveness one. This is interesting although hard to
we show that when correlating their generations with the
interpret. One possible explanation is that abstractness is
humans scores, the highest result on abstractness is 0.53
a feature that is already made explicit by the choice of the
achieved by the largest llama 3 while on inclusiveness the
stimuli. Those words do show a variation between dif-
correlation is bound by 0.41, showing that inclusiveness
ferent contexts of use, and this is one of the objectives of
is harder to understand than abstractness.
such challenges with contextual information, but we can
We hope that the ABRICOT benchmark will foster
also organize these nouns, out of context, discretely along
the development of new language models in Italian as
the axis of variation between abstract (e.g. ambizione –
well as new benchmarks investigating phenomena with
ambition) and concrete (e.g. benzina – petrol). On the
a theoretical linguistic foundation such as abstractness
contrary, inclusiveness cannot be resolved in any way
and inclusiveness.
without considering a proper context; a word form by
itself does not convey any information about how much
generic, thus inclusive, is the concept behind that lexical 6. Limitations
label. In light of this, we can hypothesize that when a
model has to deal with abstractness/concreteness, it may The main limitation of the datasets is the low number
not be able to rank two occurrences of the same word of samples it contains, in particular since samples can
in slightly different contexts, but for sure it can judge as repeat tokens and there are indeed only 20 unique ones.
more concrete or more abstract all the occurrences of one This can limit the validity of the models assessment, since
target word with respect to those of another. But when it the topics and vocabulary we cover is rather limited, al-
comes to inclusiveness, thus evaluate if one occurrence though we have shown that in terms of both abstractness
is more specific or generic than another, the model is and inclusiveness, the dataset is well spread and provides
probably struggling more. a good coverage of both concepts.
Another possible interpretation of these unbalanced re-
sults between abstractness and inclusiveness may depend
on the quantity of information about the two features: Acknowledgments
while on abstractness/concreteness there are many stud-
This work was partially supported by the Project PRIN
ies available online (on English and Italian, as well as on
2022EPTPJ9 (WEMB – “Word EMBeddings: From Cog-
other languages), inclusiveness (and also genericity/speci-
nitive Linguistics to Language Engineering, and Back”),
ficity, which are the most used terms in literature to refer
funded by the Italian Ministry of University and Research
to this semantic feature) is an understudied topic. We
(MUR), and the Project ERC-2021-STG-101039777 (AB-
can thus hypothesize that knowledge about abstractness
STRACTION), funded by the European Union. Views and
is more formalised in training data, while inclusiveness
opinions expressed are however those of the author(s)
is not.
only and do not necessarily reflect those of the Euro-
Moreover, we confirm that also for this task larger
pean Union or the European Research Council Executive
models perform better, Llama 3.1-70b outperforms llama-
Agency. Neither the European Union nor the granting
3.1-8b by a large margin, and that training on more data
authority can be held responsible for them.
provides stronger models also in this case, indeed, llama
References G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
[1] M. Krifka, F. J. Pelletier, G. Carlson, A. ter Meulen, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https:
G. Chierchia, G. Link, Genericity: An introduction, //arxiv.org/abs/2310.06825. arXiv:2310.06825 .
in: G. N. Carlson, F. J. Pelletier (Eds.), The Generic [12] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-
Book, University of Chicago Press, 1995, pp. 1–124. Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
[2] L. Behrens, Genericity from a cross-linguistic per- A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mi-
spective, Linguistics (2005) 275–344. tra, A. Sravankumar, A. Korenev, A. Hinsvark,
[3] O. Dahl, The marking of the episodic/generic dis- A. Rao, A. Zhang, A. Rodriguez, A. Gregerson,
tinction in tense-aspect systems, in: G. N. Carlson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern,
F. J. Pelletier (Eds.), The Generic Book, University C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. Mc-
of Chicago Press, 1995. Connell, C. Keller, C. Touret, C. Wu, C. Wong, C. C.
[4] D. L. Chatzigoga, Genericity, in: The Oxford Hand- Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz,
book of Experimental Semantics and Pragmatics, D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan,
Oxford University Press, 2019, pp. 156–177. D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin,
[5] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran- E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith,
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri- F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L.
naldi, D. Scalena, CALAMITA: Challenge the Abili- Anderson, G. Nail, G. Mialon, G. Pang, G. Cu-
ties of LAnguage Models in ITAlian, in: Proceed- curell, H. Nguyen, H. Korevaar, H. Xu, H. Tou-
ings of the 10th Italian Conference on Computa- vron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra,
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem- I. Evtimov, J. Copet, J. Lee, J. Geffert, J. Vranes,
ber 4 - December 6, 2024, CEUR Workshop Proceed- J. Park, J. Mahadeokar, J. Shah, J. van der Linde,
ings, CEUR-WS.org, 2024. J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang,
[6] L. Gregori, M. Montefinese, D. P. Radicioni, A. A. J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park,
Ravelli, R. Varvara, CONcreTEXT@EVALITA2020: J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala,
The Concreteness in Context Task., in: EVALITA, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone,
2020. K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla,
[7] A. Friedrich, A. Palmer, M. P. Sørensen, M. Pinkal, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan,
Annotating genericity: a survey, a scheme, and L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher,
a corpus, in: Proceedings of the 9th Linguistic L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti,
Annotation Workshop, 2015, pp. 21–30. M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita,
[8] P. Chocron, P. Pareti, Vocabulary alignment for M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K.
collaborative agents: a study with real-world multi- Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov,
lingual how-to instructions, in: Proceedings of the N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi,
Twenty-Seventh International Joint Conference on P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhar-
Artificial Intelligence, IJCAI-18, International Joint gava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He,
Conferences on Artificial Intelligence Organization, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer,
2018, pp. 159–165. URL: https://doi.org/10.24963/ R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Pa-
ijcai.2018/22. doi:10.24963/ijcai.2018/22 . tel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor,
[9] P. Lison, J. Tiedemann, M. Kouylekov, OpenSub- R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabas-
titles2018: Statistical rescoring of sentence align- appa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie,
ments in large, noisy parallel corpora, in: N. Cal- S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhos-
zolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, ale, S. Zhang, S. Vandenhende, S. Batra, S. Whit-
K. Hasida, H. Isahara, B. Maegaard, J. Mariani, man, S. Sootla, S. Collot, S. Gururangan, S. Borodin-
H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Toku- sky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou,
naga (Eds.), Proceedings of the Eleventh Interna- T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao,
tional Conference on Language Resources and Eval- U. Karn, V. Goswami, V. Gupta, V. Ramanathan,
uation (LREC 2018), European Language Resources V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petro-
Association (ELRA), Miyazaki, Japan, 2018. URL: vic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Mar-
https://aclanthology.org/L18-1275. tinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang,
[10] A. A. Ravelli, G. Puccetti, M. Bolognesi, Abricot: Ab- Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song,
stractness and inclusiveness in context, 2024. URL: Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan,
osf.io/ja89x. doi:10.17605/OSF.IO/JA89X . Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori,
[11] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Vic-
ford, D. S. Chaplot, D. de las Casas, F. Bressand, toria, A. Goldstand, A. Menon, A. Sharma, A. Boe-
senberg, A. Vaughan, A. Baevski, A. Feinstein, hury, S. Goldman, T. Remez, T. Glaser, T. Best,
A. Kallet, A. Sangani, A. Yunus, A. Lupu, A. Al- T. Kohler, T. Robinson, T. Li, T. Zhang, T. Matthews,
varado, A. Caples, A. Gu, A. Ho, A. Poulton, A. Ryan, T. Chou, T. Shaked, V. Vontimitta, V. Ajayi, V. Mon-
A. Ramchandani, A. Franco, A. Saraf, A. Chowd- tanez, V. Mohan, V. S. Kumar, V. Mangla, V. Albiero,
hury, A. Gabriel, A. Bharambe, A. Eisenman, A. Yaz- V. Ionescu, V. Poenaru, V. T. Mihailescu, V. Ivanov,
dan, B. James, B. Maurer, B. Leonhardi, B. Huang, W. Li, W. Wang, W. Jiang, W. Bouaziz, W. Constable,
B. Loyd, B. D. Paola, B. Paranjape, B. Liu, B. Wu, X. Tang, X. Wang, X. Wu, X. Wang, X. Xia, X. Wu,
B. Ni, B. Hancock, B. Wasti, B. Spence, B. Sto- X. Gao, Y. Chen, Y. Hu, Y. Jia, Y. Qi, Y. Li, Y. Zhang,
jkovic, B. Gamido, B. Montalvo, C. Parker, C. Bur- Y. Zhang, Y. Adi, Y. Nam, Yu, Wang, Y. Hao,
ton, C. Mejia, C. Wang, C. Kim, C. Zhou, C. Hu, Y. Qian, Y. He, Z. Rait, Z. DeVito, Z. Rosnbrick,
C.-H. Chu, C. Cai, C. Tindal, C. Feichtenhofer, Z. Wen, Z. Yang, Z. Zhao, The llama 3 herd of
D. Civin, D. Beaty, D. Kreymer, D. Li, D. Wyatt, models, 2024. URL: https://arxiv.org/abs/2407.21783.
D. Adkins, D. Xu, D. Testuggine, D. David, D. Parikh, arXiv:2407.21783 .
D. Liskovich, D. Foss, D. Wang, D. Le, D. Holland,
E. Dowling, E. Jamil, E. Montgomery, E. Presani,
E. Hahn, E. Wood, E. Brinkman, E. Arcaute, E. Dun-
bar, E. Smothers, F. Sun, F. Kreuk, F. Tian, F. Ozgenel,
F. Caggioni, F. Guzmán, F. Kanayet, F. Seide, G. M.
Florez, G. Schwarz, G. Badeer, G. Swee, G. Halpern,
G. Thattai, G. Herman, G. Sizov, Guangyi, Zhang,
G. Lakshminarayanan, H. Shojanazeri, H. Zou,
H. Wang, H. Zha, H. Habeeb, H. Rudolph, H. Suk,
H. Aspegren, H. Goldman, I. Damlaj, I. Molybog,
I. Tufanov, I.-E. Veliche, I. Gat, J. Weissman, J. Ge-
boski, J. Kohli, J. Asher, J.-B. Gaya, J. Marcus, J. Tang,
J. Chan, J. Zhen, J. Reizenstein, J. Teboul, J. Zhong,
J. Jin, J. Yang, J. Cummings, J. Carvill, J. Shep-
ard, J. McPhie, J. Torres, J. Ginsburg, J. Wang,
K. Wu, K. H. U, K. Saxena, K. Prasad, K. Khan-
delwal, K. Zand, K. Matosich, K. Veeraraghavan,
K. Michelena, K. Li, K. Huang, K. Chawla, K. Lakho-
tia, K. Huang, L. Chen, L. Garg, L. A, L. Silva,
L. Bell, L. Zhang, L. Guo, L. Yu, L. Moshkovich,
L. Wehrstedt, M. Khabsa, M. Avalani, M. Bhatt,
M. Tsimpoukelli, M. Mankus, M. Hasson, M. Lennie,
M. Reso, M. Groshev, M. Naumov, M. Lathi, M. Ke-
neally, M. L. Seltzer, M. Valko, M. Restrepo, M. Patel,
M. Vyatskov, M. Samvelyan, M. Clark, M. Macey,
M. Wang, M. J. Hermoso, M. Metanat, M. Raste-
gari, M. Bansal, N. Santhanam, N. Parks, N. White,
N. Bawa, N. Singhal, N. Egebo, N. Usunier, N. P.
Laptev, N. Dong, N. Zhang, N. Cheng, O. Chernoguz,
O. Hart, O. Salpekar, O. Kalinli, P. Kent, P. Parekh,
P. Saab, P. Balaji, P. Rittner, P. Bontrager, P. Roux,
P. Dollar, P. Zvyagina, P. Ratanchandani, P. Yuvraj,
Q. Liang, R. Alao, R. Rodriguez, R. Ayub, R. Murthy,
R. Nayani, R. Mitra, R. Li, R. Hogan, R. Battey,
R. Wang, R. Maheswari, R. Howes, R. Rinott, S. J.
Bondu, S. Datta, S. Chugh, S. Hunt, S. Dhillon,
S. Sidorov, S. Pan, S. Verma, S. Yamamoto, S. Ra-
maswamy, S. Lindsay, S. Lindsay, S. Feng, S. Lin,
S. C. Zha, S. Shankar, S. Zhang, S. Zhang, S. Wang,
S. Agarwal, S. Sajuyigbe, S. Chintala, S. Max,
S. Chen, S. Kehoe, S. Satterfield, S. Govindaprasad,
S. Gupta, S. Cho, S. Virk, S. Subramanian, S. Choud-