1. Introduction

The Self-Contained Italian Negation Test (SCIN)

Viola Gullace

0 2 3

David Kletz

1 2

Thierry Poibeau

Alessandro Lenci

Pascal Amsili

2 0 CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa , Via Santa Maria, Pisa, 56126 , Italy 1 LLF, CNRS & Université Paris Cité , 8 Rue Albert Einstein 75013 Paris , France 2 Lattice, CNRS & ENS-PSl & U. Sorbonne-Nouvelle , 1 rue Maurice Arnoux F-92120 Montrouge , France 3 Scuola Normale Superiore , Piazza dei Cavalieri 7, Pisa, 56126 , Italy

Recent research has focused extensively on state-of-the-art pretrained language models, particularly those based on Transformer architectures, and how well they account for negation and other linguistic phenomena in various tasks. This study aims to evaluate the understanding of negation in Italian bert- and robert-based models, contrasting the predominant Englishfocused prior research. We develop the SCIN Set, an Italian dataset designed to model the influence of polarity constraints on models in a masked predictions task. Applying the SCIN Set reveals that these models do not adjust their behaviour based on sentences polarity, even when the resulting sentence is contradictory. We conclude that the tested models lack a clear understanding of how negation alters sentence meaning.

eol>negation Italian PLMs testing self-contained

1. Introduction

• bert-base for Italian, both in its basic and its XXL versions (bert-base-italian-cased, bert-base-italian-xxl-cased)1 [ 3 ], • m-bert (multilingual bert)2 [ 4 ], • alb3rt03 [ 5 ], and • UmBERTo4[ 6 ].

Section 5 will discuss the results, followed by a final section containing our general conclusions and ideas for further research.

2.2. The Self-Contained Neg Test

The Self-Contained Neg Test, developed by Kletz et al.

[ 2 ], is a set of pairs of sentences consisting of a context (C) and a target (T) sentence, either positive (p) or negative (n). The target sentence contains a masked position, syntactically constrained to be filled by a verb (2). (2)

Jessica is an architect who likes to dance. She isn’t happy to [MASK]. 2. Related work 2.1. Efect of negation on the model’s prediction The instances are designed in such a way that a model

Although negation plays an essential role in human com- that predicts (in the masked position of T) the last verb munication, it appears to present challenges for PLMs. of C will produce a semantically well-formed paragraph In recent years, much research has focused on this topic. only if C and T have the same polarity. For instance, in (2), the context is positive (Cp), the target is negative (Tn), and as a consequence a model predicting dance in the masked position produces an ill-formed paragraph: (3) Kassner and Schütze [ 7 ] and Ettinger [ 8 ] analyzed to what extent Transformer-based language models’ predictions are sensitive to the presence or absence of negation In contrast, a CnTn version of (3) would accept the verb in sentences involving factual knowledge, such as (1-a-b): dance in the same position: #Jessica is an architect who likes to dance. She isn’t happy to dance. (1)

Birds can [MASK].

Birds cannot [MASK].

(4)

Jessica is an architect who doesn’t like to dance. She isn’t happy to dance.

They found that in such pairs the top-1 predictions are To produce the sentences of the set, the pattern (5) is unchanged most of the time: models do not seem to take taken as a starting point, where NAME and PRON are into account the polarity of the environment (presence substituted with a proper noun and a compatible third or absence of a negation in the surrounding sentence) to person pronoun, PRO is substituted with a profession adapt their predictions. They concluded that models do name, and ACT is substituted with an action verb. not deal correctly with negation.

Gubelmann and Handschuh [ 9 ] criticized such studies, (5) NAME is a PROF who likes/doesn’t like to ACT. noting in particular that the pragmatic component was PRON is/isn’t happy to [MASK]. overlooked in Ettinger’s experiments. They noted that a statement containing a negation stating a false fact A large number of triplets (NAME, PRO, ACT) are tested (tcpfhaoonatrnenneoaxttiabfaomrlelrypamslfitetaf,letlheBydeitr)rd.nusIenegcafabatnucinvtt,eouatnsvflytuaastsu)etamcnlauesnnmtatbb,teeemmrmaoekfoniwrtne(gospradilytas,utcBrsouiiubredll,des wvfooenirrtebhisnisestuatsacecnhlhfctmweh,ahoatedstnsehul,CemamaninndogddtethThle’asatorten(o6epb)sooattnrhheeapatporamseridoetiidrcveetelit’oa(sCnipnpiresTedtpdha)ice.rteHAioteCnhrTsee, many of them with little association with the rest of the the triplet (Jessica, architect, dance) would be retained sentence. This makes it challenging for any single word while the triplet (Luke, janitor, swim) would not. to become the top prediction in the negative case. (6) a. Jessica is an architect who likes to dance. She

Gubelmann and Handschuh [ 9 ] developed a more prag- is happy to dance. matically informed test set, in which each instance is (in b. Luke is a janitor who likes to swim. He is [ 2 ]’s terms) self-contained. This means that each item happy to ski. in the set includes some context information, allowing direct evaluation of the model’s completion. Building Once triplets have been selected (the set of all triplets on this work, [ 2 ] developed the Self-Contained Neg Test, such that the ACT verb is repeated in CpTp instances), which aimed to address some issues in the test set from CpTn and CnTp instances can be formed, and the ex[ 9 ] and more accurately determine the model’s handling pectation is that a model that “understands” negation of negation without interference of world knowledge. should not predict the ACT verb in those cases since it would lead to contradictory instances. As a control, two additional confirgurations are considered: CnTn where it is expected that the repetition of ACT is possible (though 1https://huggingface.co/dbmdz/bert-base-italian-xxl-cased 2https://huggingface.co/bert-base-multilingual-cased 3https://github.com/marcopoli/AlBERTo-it 4https://github.com/musixmatchresearch/umberto not required), and CpTv in which an adverb (very) is inserted in the positive target, which should not change the preferred prediction of ACT since both sentences are positive. The diferent configurations are illustrated below. (7)

CpTp

CpTn CnTp CnTn CpTv

Jessica is an architect who likes to dance. She is happy to [MASK].

Jessica is an architect who likes to dance. She isn’t happy to [MASK].

Jessica is an architect who doesn’t like to dance. She is happy to [MASK].

Jessica is an architect who doesn’t like to dance. She isn’t happy to [MASK]. Jessica is an architect who likes to dance.

She is very happy to [MASK].

3. SCIN construction In Italian, negation is most commonly expressed by the

negative invariable proclitic non (not) [ 10 ].

It is this expression of negation that we use for the Italian adaptation of the Self-Contained Neg Test that we present in this section: the SCIN set.

We choose instead to rely on the pair (9), involving a semantic inference relation. (9) ha l’abitudine di / molto spesso

is used to / very often

The final form of the SCIN set is available in table 1. The

shape of the contexts is given in row 1, that of the targets in row 2, and the test target Tv is added in row 3.

Our assumption is that, if the model repeats the ACT token in the CpTp configuration, it is proof that the model has resolved the ha l’abitudine di / molto spesso inference. When confronted with the CpTn or CnTp configuration, the model should have the addition of the negation as the only element that can explain the modification of its predictions. Finally, the CpTv control allows us to check the extent to which the addition of a diferent, non-negative adverb in the sequence modifies the model’s predictions; we can assume that any modification of greater magnitude than that associated to CpTv are due to the influence of negation.

The complete list of new patterns is available in Table 1.

3.2. Pattern selection

3.1. Italian patterns The triplets (name, profession, verb) used for testing are selected by testing them on the CpTp configuration: only Following the preparation of the Self-Contained Neg Test, triplets leading to a repetition of the ACT token are rewe collect a list of Italian verbs, professions and names tained (see Table 2). This ensures that only patterns for that will be used to create the triplets to be tested. The which the model is already biased towards repetition are verbs are taken from the Dizionario Italiano Sabatini Co- tested, and the model has to understand the influence of letti 2022 (online version); only the intransitive (3138 negation on sentence semantics to reverse this tendency. verbs) are retained; among these, for each of the tested All available triplets are tested, i.e. all configurations models we further exclude the verbs that are not tok- between verbs monotokenized by the model, first names enized as a single token. The selected names are the 100 and occupations selected in subsection 3.1. As tokenizamost popular in Italy in 20245. Lastly, the professions are tion is model-dependent, the number of verbs tested is taken from a site specializing in job searches in Italy6; not the same for each model: details are available in the of those present on the site, only those consisting of a first row of table 3. single word have been selected. The results of this test are available in table 3.

The patterns cannot simply be a direct translation of The results are highly model-dependent: while the English patterns into Italian. In fact, for the test to be bert-base-italian-cased model predicts the ACT token in adequate for evaluating models, we need the masked almost 25% of cases, this is the case in only 0.03% of cases position to be syntactically constrained to be a verb. This for alb3rt0. would not be the case if we used a direct translation of the original sentences: for example, the sequence (8) can be completed with the token “questo” ( = PRON is happy 4. Testing to do this). (8)

NAME è un PROF che ama ACT. È felice di MASK.

NAME is a PROF who loves to ACT. (PRON) is happy to MASK.

5https://www.nostrofiglio.it/gravidanza/nomi-per-bambini/

i-100-nomi-per-bambini-piu-amati-dai-genitori-di-nostrofiglio-it 6https://www.wecanjob.it/pagina9_elenco-professioni.html 4.1. Setup

Tests are performed as in Kletz et al. [11]. Contexts (C)

and targets (T) are combined to create two test patterns CpTn, CnTp; in addition to these two, the test includes two control patterns CnTn and CpTv where the repetition of the ACT verb is not contradictory.

All selected triplets are then used to saturate the patterns, and the resulting patterns are provided as inputs to 1 2 3 pol. p n v

NAME è un(a) PROF che ha l’abitudine di ACT.

NAME is a PROF who is used to ACT-ing.

NAME è un(a) PROF che non ha l’abitudine di ACT.

NAME is a PROF who is not used to ACT-ing.

T(arget) PRON [MASK] molto spesso.

PRON [MASK] often.

PRON non [MASK] molto spesso.

PRON doesn’t [MASK] often.

PRON [MASK] davvero molto spesso.

PRON [MASK] really often.

Instantiated NAME/PROF:

Jessica / Ballerina (Dancer)

Tested verb: Fumare (To smoke) Tested example: Jessica è una ballerina che ha l’abitudine di fumare. Lei [MASK] spesso.

definitively attribute this to its logical function, the negation marker does exert a distinct influence.

Nevertheless, it is important to emphasize the very clear limitations of these results. Firstly, the drops never exceed 25%, meaning that 75% of the times the model Model pTorepd. 1 Retained? predicts a semantically prohibited token. On the other b-b-italian-c fuma ✓ hand, with the exception of m-bert, all the models have a b-b-italian-xxl-c fuma ✓ highe drop for the CnTn control than for the CnTp conm-bert balla no ifguration, thus indicating that even though the models alb3rt0 parla no have acquired a certain understanding of negation, this remains superficial and does not, for example, clearly inTable 2 clude an understanding of the positive value of a double An example of selecting a triplet for testing. A negation.

NAME/PROF/VERB triplet is used to saturate the CpTp A broader examination of the results reveals that while ipnaptutetrtnooaf PSCLMIN..ITfhtheesemqoudeenlcperceodnictatiionns aismthaeskAaCnTdtioskuesne,dthaes the drops in CpTn and CnTp configurations increase triplet is retained (indicated by the ✓ symbol). In the name together, the CnTn controls also show a corresponding of the models given as examples, “b-b” means bert-base, “it” increase. stands for italian and “c” for cased. Finally, the training corpus of the models seems to have an influence on their performance. For example, note that the alb3rt0 model is the model obtainthe models. Predictions at masked positions are collected. ing the results least in line with our expectations, while

We use drop as a measure of the models’ performance: bert-base-italian-xxl-cased and bert-base-italian-cased for each pattern, given the rate of repetitions of the Act had better drop values, with the former performing betToken in the predictions, the drop is defined as 100 − . ter than the latter. However, these three models have The higher the drop for the CpTn and CnTp patterns and identical numbers of layers, attention heads and hidden the lower for the CnTn and CpTv controls, the better the sizes, the diference between them only consisting in model has understood the negation. their training data. The alb3rt0 model was trained exclusively on tweets, which likely limits the diversity of 4.2. Results and Discussion its data, particularly with respect negation. In contrast, bert-base-italian-cased and bert-base-italian-xxl-cased Results are shown in table 4. models were trained on more varied corpora, with the

In contrast with the observations made by [ 8 ] and latter featuring a larger dataset. [ 7 ], the models are not insensitive to the presence of In the future, this should lead us to study the correnegation in a sentence: all the models show a drop in both lation between the performance of the models and the configurations CpTn and CnTp, showing an adaptation ifne-grained distribution of negative and afirmative conof their predictions to the presence of a negation cue. texts in their training corpus.

This observation is confirmed by the fact that the drops in the CpTv control are always lower than those observed 5. Comparison with English in CpTn or CnTp.

This shows that simply adding an adverb is not suficient to change the model’s predictions. While we cannot

In this section we compare the results obtained with the SCIN Set with those observed by [2] in English.

# tested contexts Repetitions % # retained contexts Pattern CpTn CnTp CnTn CpTv b-b-it-c

The scale of the drops in the two articles is notably that negation modifies their predictions, but that this very diferent: the maximum drop observed in Italian is does not happen consistently or in a way that is always 23% (CpTn m-bert), while in English it’s 82.8%. Similarly, coherent with the semantic efect that we expect negation the CpTv drops of Italian-speaking models hardly exceed to have on sentences. These results suggest a strong need 15%, while those of English-speaking models are never to adapt these models to make them more sensitive to less than 25%. negation and its semantic consequences.

On the other hand, model architecture and type of Nevertheless, we also noted a fairly marked diference training do not seem to have a major influence: Umberto in performance from one model to another, correlated has the same architecture as roberta-base, but while the with the diferent corpora used to train them. We thus latter is the best performing model in [ 2 ], the former’s suggest that a lexical and statistical study of these corpora drops are the lowest for all configurations of the SCIN could shed further light on the behavior of the models. Set. Conversely, the other Italian models are built with Lastly, it would be interesting to compare these results the same architecture as bert-base-cased, i.e. the worst with the performance of generative models, in order to performing model for English; however, even the worst study the relative importance of the number of model performing Italian model, namely alb3rt0, features higher parameters in relation to their architecture. drops than bert-base-cased. This confirms the observation from the previous section, that while architecture is indeed a limiting criterion, training data probably plays Acknowledgments a significant role.

In general, we note that none of these models, neither for Italian nor for English, shows definitive drops compatible with a full understanding of the semantic constraints of negation.

We would like to express our gratitude to Marie Candito

for her valuable assistance and guidance throughout the course of this study.

This work was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir" 6. Conclusion program, reference ANR-19-P3IA0001 (PRAIRIE 3IA Institute). This research was also partially funded by In this paper, we investigated the ability of several Italian the Labex EFL (ANR-10-LABX-0083) and by PNRR– PLMs to take negation into account in their predictions. M4C2–Investimento 1.3, Partenariato Esteso PE00000013To do this, we adapted to Italian the Self-Contained Neg –“FAIR—Future Artificial Intelligence Research”–Spoke 1 Test proposed by Kletz et al. [ 2 ], which is based on mini- “Human-centered AI,” funded by the European Commismal pairs of aligned sentences. sion under the NextGeneration EU programme.

Applying this test to six models enabled us to show

A. Verb statistics by PLM

[1]

L. R.

Horn ,

Wansing , Negation, in: E. N. Zalta, Details of the number of monotokenised intransitive U . Nodelman (Eds.), The Stanford Encyclopedia of verbs available for each PLM tested are available in taPhilosophy, Winter 2022 ed., Metaphysics Research ble 5 . Lab, Stanford University, 2022 .

[2]

Kletz ,

Amsili ,

Candito , The self-contained model monotokenized verbs negation test set , in: Y. Belinkov , S. Hao , J. Jumelet, bert -base-italian-cased 294

Kim ,

McCarthy ,

Mohebbi (Eds.), Proceed- bert-base-italian-xxl-cased 294 ings of the 6th BlackboxNLP Workshop: Analyzing m-bert 39 and Interpreting Neural Networks for NLP, Asso- alb3rt0 940 ciation for Computational Linguistics , Singapore, UmBERTo 14 2023 , pp. 212 - 221 . URL: https://aclanthology.org/ Table 5 2023.blackboxnlp- 1 .16. doi: 10 .18653/v1/2023. Detail of the number of Italian intransitive verbs tokenised as blackboxnlp-1.16. a single token for each of the Italian models tested .

[3]

Schweter , Italian bert and electra models , 2020 . URL: https://doi.org/10.5281/zenodo.4263142. doi: 10 .5281/zenodo.4263142.

[4]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[5]

Polignano ,

Basile ,

Basile , M. de Gemmis, G. Semeraro, AlBERTo: Modeling italian social media language with bert , IJCoL 25 ( 1984 ) 11 - 31 . URL: https://doi.org/10.4000/ijcol.472.

[6]

Parisi ,

Francia ,

Magnani , Umberto: an italian language model trained with whole word masking , https://github.com/musixmatchresearch/ umberto, 2020 .

[7]

Kassner ,

Schütze , Negated and misprimed probes for pretrained language models: Birds can talk, but can+not fly ( 2020 ). URL: https:// aclanthology.org/ 2020 .acl-main. 698 .

[8]

Ettinger , What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models , Transactions of the Association for Computational Linguistics 8 ( 2019 ) 34 - 48 . URL: https://doi.org/10.1162/tacl_a_ 00298 .

[9]

Gubelmann ,

Handschuh , Context matters: A pragmatic study of PLMs' negation understanding , in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Dublin, Ireland, 2022 , p. 4602 - 4621 . URL: https://aclanthology.org/ 2022 . acl-long . 315 .

[10]

Renzi ,

L. G.

Salvi ,

Cardinaletti , Grande grammatica italiana di consultazione , volume 2 , Il

Mulino

, Bologna, 2001 .

[11]

Kletz ,

Candito ,

Amsili , Probing structural constraints of negation in pretrained language models , in: The 24rd Nordic Conference on Computational Linguistics , 2023 . URL: https://openreview. net/forum?id=_7VPETQwnPX.