-

URL: https://arxiv.org/abs/

1613-0073

- ABstRactness and Inclusiveness in COntexT: A CALAMITA Challenge

Marianna Marcella Bolognesi

m.bolognesi@unibo.it

Giovanni Puccetti

giovanni.puccetti@isti.cnr.it

Claudia Collacciani

claudia.collacciani2@unibo.it

Andrea Amelio Ravelli

Andrea Esuli

2407

21783

ion, Inclusiveness, Context, LLM evaluation, Italian Language Models The ability to convey both specific information (about individuals or events) and generalisations (about cate- languages do not have explicit markers for generic NPs

CEUR ceur-ws.org A

1. Challenge: Introduction and Motivation

gories) with the same lexical item is one of the key feature of natural languages. Consider the examples in 1:

a) the lion escaped yesterday from the zoo. b) the lion is a predatory cat. The noun phrase (NP) the lion can describe either a

specific individual ( 1a) or the entire category of large African felines (1b), thus it expresses a variable degree of inclusiveness of the possible number of individuals to which the NP correctly applies in each sentence it occurs. This demonstrates how human language follows a principle of economy, enabling a one-to-many mapping between lexical labels and meanings.

The syntactic form of the NP (definite, indefinite, or plural) does not provide suficient information to discriminate between the two meanings, and we need to

Dec 04 — 06, 2024, Pisa, Italy ∗Corresponding author. https://www.unibo.it/sitoweb/andreaamelio.ravelli (A. A. Ravelli); https://esuli.it/ (A. Esuli); https://www.unibo.it/sitoweb/m.bolognesi (M. M. Bolognesi) enlarge our focus to take into account the whole context in which the NP occurs [ 1 ]. This phenomenon can be observed in all languages [ 2 ], afecting nearly all nouns that can be used in referring expressions. Indeed, natural [ 3 ]; the genericity/specificity of an NP is derived from the meaning of the entire sentence. In other words, we cannot interpret language one word at a time; we need to consider the whole sentence or utterance as context to disambiguate and decipher the meaning of each single word composing it, and thus to understand the message conveyed through language.

Generalizations about kinds and categories, as in 1b,

are called generics and are fundamental to human cognition, because they allow us to conceptualize properties linked to categories, shaping how we perceive the world [ 4 ].

Moreover, distinguishing between generic and nongeneric meanings for abstract entities is less straightforward than for concrete ones, and for this reason evaluate the inclusiveness of an abstract noun or a NP is even more challenging. Indeed, inclusiveness is not an exclusive feature of concrete-only entities. Consider the examples in 1: 2.

a) Colorless green ideas sleep furiously. b) Be less curious about people and more cu

rious about ideas.

The concept behind the word idea is always referring

to an abstract entity, with slightly diferent grades of abstractness, but it shows a greater variation in terms of inclusiveness. The noun ideas in 2a includes only a re© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License stricted number of elements with respect to the universe Attribution 4.0 International (CC BY 4.0).

Token: Margherita Text: Le margherite di fronte alla mia casa saranno

in piena fioritura. Abstractness: 0.177 Inclusiveness: 0.187

Token: Ambizione Text: La sua ambizione lo rovinerà.

Abstractness: 0.478 Inclusiveness: 0.083 (a) Example of sample for the Margherita token. (b) Example of sample for the Ambizione token.

Token: Benzina Text: La benzina è nella bottiglia del latte.

Abstractness: 0.064 Inclusiveness: 0.063

Token: Benzina Text: In Italia è disponibile la benzina a 95 ottani.

Abstractness: 0.573 Inclusiveness: 0.653 (c) Example of sample for a more concrete Benzina token.

(d) Example of sample for a more abstract Benzina token. of the ideas (namely, only colorless green ones), while the sentence. reference in 2b shows a higher level of inclusiveness, not This task have some similarities with the CONcreTEXT distinguishing among them on the basis of their color. Task1 [ 6 ], which has been presented at the 2020 edition of

The ability to distinguish, interpret and use correctly EVALITA.2 Both tasks focus on the abstractness/concretethe variability that natural language ofers along these ness of target words in natural Italian sentences, asking two graduated semantic features, abstractness and inclu- judgments by means of Likert scales, but the ABRICOT siveness, is of paramount importance if we want to make Task goes beyond by including also the inclusiveness talking machines which not only simulate language, but feature of the targets. Moreover, for the construction of can also reason about natural language and the knowl- this dataset we considered exclusively nouns or NPs as edge of the world it depicts. targets, and in order to limit to the minimum the impact

The CALAMITA special event [ 5 ] ofers the possibil- of the variability deriving from diferent semantic role or ity to challenge Large Language Models on their ability syntactic function, all the sentences have been selected to understand the abstractness and inclusiveness of the with the target noun as subject of the main verb. words, and compare with humans their behaviour in judging Italian sentences. With this report we present 2.1. Tasks the ABRICOT Task: ABstRactness and Inclusiveness in COntexT.

We propose two separate tasks for this benchmark, Task 1: abstractness and Task 2: inclusiveness the two tasks are formally identical, we use the same metric and the same 2. Challenge: Description samples, however they measure two diferent scores, respectively abstractness_mean and inclusiveness_mean, the The ABRICOT Task aims to challenge Italian lan- first meant to measure the abstractness of the word in guage models on their understanding of abstractness context and the second its inclusiveness. and inclusiveness, features that we, as humans, naturally Since both these concepts are evident but fuzzy also express in everyday language. These features are not for humans, we don´t expect language models to have discrete binary dichotomies like abstract/concrete or a perfect understanding of them and we will limit our inclusive/exclusive; instead, they shade on a contin- metrics to regression ones. Despite the tasks being very uous spectrum, with the two extremes at opposite ends. similar from a formal perspective, we show how modThe collection of sentences in this Task shows the same els’ performance on these two tasks varies and there is NP in a variety of diferent contexts, so that its meaning sensible diference between the results in the two tasks. can oscillate between the extremes of both the axis of abstractness and inclusiveness.

We ask the participant models to express a judgment on a 5 point Likert scale for both the features of inclusiveness and abstractness of the target noun or NP in each

1lablita.github.io/CONcreTEXT 2www.evalita.it 3. Data description

3.1. Origin of data

The 20 target NPs of the dataset for the ABRICOT

Task are derived (and translated in Italian) from the set of target nouns in the Situation Entities Corpus (SitEnt [ 7 ]), a collection of English sentences in which speciifcity and genericity have been annotated with a binary labelling scheme (i.e., GENERIC vs. NON-GENERIC). Using those as seeds, representative Italian sentences have been manually harvested from OpenSubtitles3 and WikiHow.4 These are widely used sources, the first contains the openly available subtitles of an extensive collection of movies and TV series, while the second is a website gathering articles on how-to do a variety of diferent things.

More specifically, the sentences have been extracted from the Italian section of the multilingual The Human Instruction Dataset [ 8 ], a structured collection of WikiHow instructions pages, and from the Italian sub-corpus of the OpenSubtitles2018 corpus [ 9 ].

Our protocol proposes to the annotators groups of sentences (from a minimum of 4 to a maximum of 8), all containing the same noun, each to be evaluated using a continuous slider, from which values ranging from 0 to 1 will then be extracted.

After the annotation, the reliability of our data has been computed using the Intraclass Correlation Coeficient (ICC(k)). Human ratings have been then averaged, and the resulting figures will be used as gold standard.

An example of the samples present in the dataset can be seen in Figure ?? where examples with the NPs margherita (lilly), ambizione (ambition) and benzina (gasoline) are reported. In particular, Figure ?? and ?? show two examples containing the same token but in diferent contexts and report the efect of the context on the abstractness and inclusiveness of the token.

The data is stored on OSF [ 10 ].5 3.2. Data format The data is proposed in a tabular format, with 12 columns: • ID: a unique identifier for the sample; • target token: the focus of the dataset, to be assinged an abstraction score in context; • target lemma: the lemma of the target token; • text: the sentence where the token appears; • begin: the index of the first character of the token in the sentence;

3https://www.opensubtitles.org 4https://www.wikihow.com 5https://osf.io/ja89x/?view_only=91d683c7399c45f9aa63f2b34cfe6617

Abstractness Prompt: Assegna un valore di astrazione da 1 a 5 alla parola parola nel contesto della frase seguente: frase Descrizione dei valori: 1 - La parola è estremamente concreta (e.g. un cane specifico) 2 - La parola è lievemente concreta (e.g. un cane di una certa razza) 3 - La parola è neutra (e.g. un cane tra tanti) 4 - La prola è lievemente astratta (e.g. un cane è un animale da compagnia) 5 - La parola è estremamente astratta (e.g. il cane è un mammifero).

(a) Prompt used for the Inclusiveness Task.

Inclusiveness Prompt: Assegna un valore di inclusività da 1 a 5 alla parola parola nel contesto della frase seguente: frase Descrizione dei valori: 1 - La parola è estremamente specifica (e.g. un cane specifico) 2 - La prola è lievemente specifica (e.g. un cane di una certa razza) 3 - La parola è neutra (e.g. un cane tra tanti) 4 - La parola è lievemente inclusiva (e.g. un cane è un animale da compagnia) 5 - La parola è estremamente inclusiva (e.g. il cane è un mammifero)

(b) Prompt used for the Inclusiveness Task.

• end: the index of the last character of the token in the sentence; • domain: the source where the token come from; • inclusiveness mean: the average inclusiveness score assigned by the annotators; • inclusiveness std: the standard deviation of the inclusiveness scores; • abstractness mean: the average abstractness score assigned by the annotators; • abstractness std: the standard deviatio n of the abstractness scores; 3.3. Example of prompts used for zero or/and few shots We use diferent prompts for the two tasks, they are shown in Figure 2, we ask the model to directly output a score from 1 to 5 specific to the task, we then propose an explanation for each point from 1 to 5, explaining the (approximate) meaning of assigning that score together with a very high-level example and on top of the explanation, we use 3-shot evaluation, we found 0-shot to be dificult abstractness inclusiveness abstractness inclusiveness mean std mean std mean std mean std ness value around 0.8 while for inclusiveness the peak is around 0.1, showing a partial anti-pattern between the two scores, and the concept they are meant to distill.

To investigate the relevance of the context in the assessment of abstraction and inclusiveness, Table 1 shows the mean and standard deviation of the abstractness and inclusiveness of a token when varying context, for all the tokens in the dataset. The standard deviation is often between 0.2 and 0.4 for a score bound between 0 and 1, this shows significant sensitivity to context and highlights how, even if tokens are repeated, each sample is valuable on its own and provides diferent insights about the token.

4. Metrics

for this dataset as without some reference example, the scoring becomes too variable.

With a 3-shot approach and the prompts we used, all models we test appear to be able to understand the task and performance improves with these prompts when compared to less specific ones.

We measure Pearson correlation between the abstractness and inclusiveness scores predicted by the model and the gold human annotation. More specifically, since it 3.4. Detailed data statistics is challenging to have the models output a continuous The dataset contains 127 samples each sample focused value for the abstractness or inclusiveness of a token in on a token, the same token appears more than once in context, we have them generate a discrete score from 1 the dataset, on average 6.35 times, in diferent contexts. to 5.

While the dataset contains 127 samples (a limited The evaluation is done following a likelihood based amount), Figure 3 shows that both abstractness and in- approach, after prompting the model to answer our quesclusiveness are well spread across the dataset and there tion, we pick the highest likelihood token among 1, 2, 3, are samples for all values between 0 and 1. Interest- 4 and 5 and pick that as the model selection. After doing ingly, while the two concept under study are diferent, so for each sample, we compute the Pearson correlation the two scores are similarly distributed across the dataset, between these values and a discretized version of the conbut there is a higher number of samples with abstract- tinuous scores (discretization does not afect the results) assigned by humans to the same samples. 3.1 outperforms mistral 7b also by a large margin.

Table 2 shows our evaluation of three powerful, Finally, we remark that we avoid testing models that Emglish-first language models, mistral 7b [ 11 ], llama- have been tuned for Italian to let participants to the Chal3.1-8b and llama-3.1-70b [12], note that we use the in- lenge measure the performance improvements provided struct version of all three models, and we omit it from by Italian focused training. the names.

These initial results show that the models are able to capture both abstractness and inclusiveness, with the 5. Conclusions exception of mistral 7b that fails at understanding inclusiveness (Pearson correlation is 0). At the same time, a We propose the ABRICOT benchmark, a dataset compowerful LLM like llama-3.1-70b is not able to capture posed of 127 humanly annotated samples to measure the the full complexity of the task, with a Pearson correlation abstraction and concreteness of words. Each sample is that is as low as 0.53 for abstractness and 0.41 for inclu- annotated by 5 - 7 raters who ranked them with a consiveness. This shows that while not alien to the concept tinuous score from 0 to 1 from most concrete to most of abstractness and inclusiveness, the models are still far abstract and a second one measured in the same way from fully understanding it. from least to most inclusive.

Assessing abstractness seems to be easier for LLMs, We propose two Tasks, measuring abstractness and insince every model performs better in this task than in the clusiveness and we test three powerful language models inclusiveness one. This is interesting although hard to on our benchmark, mistral 7b, llama 3 8b and llama 3 70b, interpret. One possible explanation is that abstractness is we show that when correlating their generations with the a feature that is already made explicit by the choice of the humans scores, the highest result on abstractness is 0.53 stimuli. Those words do show a variation between dif- achieved by the largest llama 3 while on inclusiveness the ferent contexts of use, and this is one of the objectives of correlation is bound by 0.41, showing that inclusiveness such challenges with contextual information, but we can is harder to understand than abstractness. also organize these nouns, out of context, discretely along We hope that the ABRICOT benchmark will foster the axis of variation between abstract (e.g. ambizione – the development of new language models in Italian as well as new benchmarks investigating phenomena with ambition) and concrete (e.g. benzina – petrol). On the contrary, inclusiveness cannot be resolved in any way a theoretical linguistic foundation such as abstractness without considering a proper context; a word form by and inclusiveness. itself does not convey any information about how much generic, thus inclusive, is the concept behind that lexical 6. Limitations label. In light of this, we can hypothesize that when a model has to deal with abstractness/concreteness, it may The main limitation of the datasets is the low number not be able to rank two occurrences of the same word of samples it contains, in particular since samples can in slightly diferent contexts, but for sure it can judge as repeat tokens and there are indeed only 20 unique ones. more concrete or more abstract all the occurrences of one This can limit the validity of the models assessment, since target word with respect to those of another. But when it the topics and vocabulary we cover is rather limited, alcomes to inclusiveness, thus evaluate if one occurrence though we have shown that in terms of both abstractness is more specific or generic than another, the model is and inclusiveness, the dataset is well spread and provides probably struggling more. a good coverage of both concepts.

Another possible interpretation of these unbalanced results between abstractness and inclusiveness may depend on the quantity of information about the two features: Acknowledgments while on abstractness/concreteness there are many studies available online (on English and Italian, as well as on This work was partially supported by the Project PRIN other languages), inclusiveness (and also genericity/speci- 2022EPTPJ9 (WEMB – “Word EMBeddings: From Cogifcity, which are the most used terms in literature to refer nitive Linguistics to Language Engineering, and Back”), to this semantic feature) is an understudied topic. We funded by the Italian Ministry of University and Research can thus hypothesize that knowledge about abstractness (MUR), and the Project ERC-2021-STG-101039777 (ABis more formalised in training data, while inclusiveness STRACTION), funded by the European Union. Views and is not. opinions expressed are however those of the author(s)

Moreover, we confirm that also for this task larger only and do not necessarily reflect those of the Euromodels perform better, Llama 3.1-70b outperforms llama- pean Union or the European Research Council Executive 3.1-8b by a large margin, and that training on more data Agency. Neither the European Union nor the granting provides stronger models also in this case, indeed, llama authority can be held responsible for them.

G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.

A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: //arxiv.org/abs/2310.06825. arXiv:2310.06825. [12] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. AlDahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Marra, C. McConnell, C. Keller, C. Touret, C. Wu, C. Wong, C. C. Ferrer, C. Nikolaidis, D. Allonsius, D. Song, D. Pintz, D. Livshits, D. Esiobu, D. Choudhary, D. Mahajan, D. Garcia-Olano, D. Perino, D. Hupkes, E. Lakomkin, E. AlBadawy, E. Lobanova, E. Dinan, E. M. Smith, F. Radenovic, F. Zhang, G. Synnaeve, G. Lee, G. L. Anderson, G. Nail, G. Mialon, G. Pang, G. Cucurell, H. Nguyen, H. Korevaar, H. Xu, H. Touvron, I. Zarov, I. A. Ibarra, I. Kloumann, I. Misra, I. Evtimov, J. Copet, J. Lee, J. Gefert, J. Vranes, J. Park, J. Mahadeokar, J. Shah, J. van der Linde, J. Billock, J. Hong, J. Lee, J. Fu, J. Chi, J. Huang, J. Liu, J. Wang, J. Yu, J. Bitton, J. Spisak, J. Park, J. Rocca, J. Johnstun, J. Saxe, J. Jia, K. V. Alwala, K. Upasani, K. Plawiak, K. Li, K. Heafield, K. Stone, K. El-Arini, K. Iyer, K. Malik, K. Chiu, K. Bhalla, L. Rantala-Yeary, L. van der Maaten, L. Chen, L. Tan, L. Jenkins, L. Martin, L. Madaan, L. Malo, L. Blecher, L. Landzaat, L. de Oliveira, M. Muzzi, M. Pasupuleti, M. Singh, M. Paluri, M. Kardas, M. Oldham, M. Rita, M. Pavlova, M. Kambadur, M. Lewis, M. Si, M. K. Singh, M. Hassan, N. Goyal, N. Torabi, N. Bashlykov, N. Bogoychev, N. Chatterji, O. Duchenne, O. Çelebi, P. Alrassy, P. Zhang, P. Li, P. Vasic, P. Weng, P. Bhargava, P. Dubal, P. Krishnan, P. S. Koura, P. Xu, Q. He, Q. Dong, R. Srinivasan, R. Ganapathy, R. Calderer, R. S. Cabral, R. Stojnic, R. Raileanu, R. Girdhar, R. Patel, R. Sauvestre, R. Polidoro, R. Sumbaly, R. Taylor, R. Silva, R. Hou, R. Wang, S. Hosseini, S. Chennabasappa, S. Singh, S. Bell, S. S. Kim, S. Edunov, S. Nie, S. Narang, S. Raparthy, S. Shen, S. Wan, S. Bhosale, S. Zhang, S. Vandenhende, S. Batra, S. Whitman, S. Sootla, S. Collot, S. Gururangan, S. Borodinsky, T. Herman, T. Fowler, T. Sheasha, T. Georgiou, T. Scialom, T. Speckbacher, T. Mihaylov, T. Xiao, U. Karn, V. Goswami, V. Gupta, V. Ramanathan, V. Kerkez, V. Gonguet, V. Do, V. Vogeti, V. Petrovic, W. Chu, W. Xiong, W. Fu, W. Meers, X. Martinet, X. Wang, X. E. Tan, X. Xie, X. Jia, X. Wang, Y. Goldschlag, Y. Gaur, Y. Babaei, Y. Wen, Y. Song, Y. Zhang, Y. Li, Y. Mao, Z. D. Coudert, Z. Yan, Z. Chen, Z. Papakipos, A. Singh, A. Grattafiori, A. Jain, A. Kelsey, A. Shajnfeld, A. Gangidi, A. Victoria, A. Goldstand, A. Menon, A. Sharma, A. Boe

[1]

Krifka ,

F. J.

Pelletier ,

Carlson , A. ter Meulen, G. Chierchia, G. Link, Genericity: An introduction , in: G. N. Carlson , F. J. Pelletier (Eds.), The Generic Book , University of Chicago Press, 1995 , pp. 1 - 124 .

[2]

Behrens , Genericity from a cross-linguistic perspective , Linguistics ( 2005 ) 275 - 344 .

[3]

Dahl , The marking of the episodic/generic distinction in tense-aspect systems , in: G. N. Carlson , F. J. Pelletier (Eds.), The Generic Book , University of Chicago Press, 1995 .

[4]

D. L.

Chatzigoga , Genericity, in: The Oxford Handbook of Experimental Semantics and Pragmatics , Oxford University Press, 2019 , pp. 156 - 177 .

[5]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi ,

Scalena , CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), Pisa, Italy, December 4 - December 6, 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[6]

Gregori ,

Montefinese ,

D. P.

Radicioni ,

A. A.

Ravelli , R. Varvara, CONcreTEXT@EVALITA2020: The Concreteness in Context Task ., in: EVALITA, 2020 .

[7]

Friedrich ,

Palmer ,

M. P.

Sørensen ,

Pinkal , Annotating genericity: a survey, a scheme, and a corpus , in: Proceedings of the 9th Linguistic Annotation Workshop , 2015 , pp. 21 - 30 .

[8]

Chocron ,

Pareti , Vocabulary alignment for collaborative agents: a study with real-world multilingual how-to instructions , in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization , 2018 , pp. 159 - 165 . URL: https://doi.org/10.24963/ ijcai. 2018 /22. doi: 10 .24963/ijcai. 2018 /22.

[9]

Lison ,

Tiedemann , M. Kouylekov, OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora , in: N. Calzolari , K.

Choukri , C.

Cieri , T.

Declerck , S.

Goggi , K.

Hasida , H.

Isahara , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S.

Piperidis , T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018 ), European Language Resources Association (ELRA), Miyazaki , Japan, 2018 . URL: https://aclanthology.org/L18-1275.

[10]

A. A.

Ravelli , G. Puccetti,

Bolognesi , Abricot: Abstractness and inclusiveness in context, 2024 . URL: osf.io/ja89x. doi: 10 .17605/OSF.IO/JA89X.

[11]

A. Q.

Jiang ,

Sablayrolles ,

Mensch ,

Bamford ,

D. S.

Chaplot , D. de las Casas,

Bressand ,