=Paper=
{{Paper
|id=Vol-3878/50_main_long
|storemode=property
|title=The Self-Contained Italian Negation Test (SCIN)
|pdfUrl=https://ceur-ws.org/Vol-3878/50_main_long.pdf
|volume=Vol-3878
|authors=Viola Gullace,David Kletz,Thierry Poibeau,Alessandro Lenci,Pascal Amsili
|dblpUrl=https://dblp.org/rec/conf/clic-it/GullaceKPLA24
}}
==The Self-Contained Italian Negation Test (SCIN)==
The Self-Contained Italian Negation Test (SCIN)
Viola Gullace1,2,3,† , David Kletz1,4,*,† , Thierry Poibeau1 , Alessandro Lenci2 and Pascal Amsili1
1
Lattice, CNRS & ENS-PSl & U. Sorbonne-Nouvelle, 1 rue Maurice Arnoux F-92120 Montrouge, France
2
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa, Via Santa Maria, Pisa, 56126, Italy
3
Scuola Normale Superiore, Piazza dei Cavalieri 7, Pisa, 56126, Italy
4
LLF, CNRS & Université Paris Cité, 8 Rue Albert Einstein 75013 Paris, France
Abstract
Recent research has focused extensively on state-of-the-art pretrained language models, particularly those based on Trans-
former architectures, and how well they account for negation and other linguistic phenomena in various tasks. This study
aims to evaluate the understanding of negation in Italian bert- and robert-based models, contrasting the predominant English-
focused prior research. We develop the SCIN Set, an Italian dataset designed to model the influence of polarity constraints on
models in a masked predictions task. Applying the SCIN Set reveals that these models do not adjust their behaviour based
on sentences polarity, even when the resulting sentence is contradictory. We conclude that the tested models lack a clear
understanding of how negation alters sentence meaning.
Keywords
negation, Italian PLMs, testing, self-contained
1. Introduction ison, potentially highlighting language-specific effects
or revealing new generalization. Therefore, we decide to
Compositionality is a fundamental feature of human lan- undertake a new experiment focusing on Italian negation.
guage, based on the principle that the meaning of a com- Thus, in this article, we aim to explore whether the
plex expression derives from its parts and their respective behavior of PLMs accurately models the polarity of sen-
arrangements. tences. We will investigate how the addition of negation
One notable compositional phenomenon is negation, to a sentence can alter its overall meaning (demonstrat-
formally defined as a semantic operator (or function) that ing the models’ capability to handle shifts in meaning
reverses the truth-value of a sentence [1]. due to structural changes).
Given its importance, understanding how well pre- Given the limitations explained above, our work has
trained language models (PLMs) can grasp and apply this deliberately chosen to concentrate on Italian. This choice
principle is crucial. not only addresses the need to explore how these mod-
These models achieve impressive performance across els perform with languages other than English but also
a wide array of language modeling tasks. Nonetheless, serves as a critical test for PLMs dedicated to Italian. We
they often reveal to rely on shallow heuristics or exhibit suspect that these models may not be as advanced or
other issues in handling specific aspects of language. effective as their English counterparts, highlighting the
A prominent bias in the body of research is that the need for further developments outside English.
vast majority of research on language models has pre- We adapt the test set developed for English by Kletz
dominantly concentrated on English. This focus raises et al. [2] to Italian, creating the Self-Contained Italian
concerns about the generalizability of findings to other Neg Set (SCIN Set). Using the dataset to evaluate bert-
languages which may be structurally different from En- and roberta-based models for Italian, we find that these
glish. Conducting similar experiments in other languages models are unable to adjust their prediction in response
could provide valuable context and material for compar- to constraints posed by negation, often generating con-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, tradictory text.
Dec 04 — 06, 2024, Pisa, Italy The article will be structured as follows. The rest of
*
Corresponding author. Section 1 will introduce compositional phenomena and
†
These authors contributed equally. Italian negation in particular. Section 2 will briefly review
$ viola.gullace@sns.it (V. Gullace); related work. Section 3 will detail the composition of the
david.kletz@sorbonne-nouvelle.fr (D. Kletz);
SCIN Set. Section 4 will present the tests conducted on
thierry.poibeau@ens.psl.eu (T. Poibeau); alessandro.lenci@unipi.it
(A. Lenci); Pascal.Amsili@ens.fr (P. Amsili) several bert-based Italian models using the SCIN Set;
https://people.unipi.it/alessandro_lenci/ (A. Lenci); in particular, we tested the following bert-base-cased
https://lattice.cnrs.fr/amsili/ (P. Amsili) models:
0000-0003-3669-4051 (T. Poibeau); 0000-0001-5790-4308
(A. Lenci); 0000-0002-5901-5050 (P. Amsili) • bert-base for Italian, both in its basic and
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). its XXL versions (bert-base-italian-cased,
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
bert-base-italian-xxl-cased)1 [3], 2.2. The Self-Contained Neg Test
• m-bert (multilingual bert)2 [4],
The Self-Contained Neg Test, developed by Kletz et al.
• alb3rt03 [5], and
[2], is a set of pairs of sentences consisting of a context
• UmBERTo4 [6].
(C) and a target (T) sentence, either positive (p) or nega-
Section 5 will discuss the results, followed by a final tive (n). The target sentence contains a masked position,
section containing our general conclusions and ideas for syntactically constrained to be filled by a verb (2).
further research.
(2) Jessica is an architect who likes to dance. She isn’t
happy to [MASK].
2. Related work
The instances are designed in such a way that a model
Although negation plays an essential role in human com- that predicts (in the masked position of T) the last verb
munication, it appears to present challenges for PLMs. of C will produce a semantically well-formed paragraph
In recent years, much research has focused on this topic. only if C and T have the same polarity. For instance, in
(2), the context is positive (Cp), the target is negative
2.1. Effect of negation on the model’s (Tn), and as a consequence a model predicting dance in
the masked position produces an ill-formed paragraph:
prediction
(3) #Jessica is an architect who likes to dance. She isn’t
Kassner and Schütze [7] and Ettinger [8] analyzed to
happy to dance.
what extent Transformer-based language models’ predic-
tions are sensitive to the presence or absence of negation In contrast, a CnTn version of (3) would accept the verb
in sentences involving factual knowledge, such as (1-a-b): dance in the same position:
(1) a. Birds can [MASK]. (4) Jessica is an architect who doesn’t like to dance.
b. Birds cannot [MASK]. She isn’t happy to dance.
They found that in such pairs the top-1 predictions are To produce the sentences of the set, the pattern (5) is
unchanged most of the time: models do not seem to take taken as a starting point, where NAME and PRON are
into account the polarity of the environment (presence substituted with a proper noun and a compatible third
or absence of a negation in the surrounding sentence) to person pronoun, PRO is substituted with a profession
adapt their predictions. They concluded that models do name, and ACT is substituted with an action verb.
not deal correctly with negation.
Gubelmann and Handschuh [9] criticized such studies, (5) NAME is a PROF who likes/doesn’t like to ACT.
noting in particular that the pragmatic component was PRON is/isn’t happy to [MASK].
overlooked in Ettinger’s experiments. They noted that
A large number of triplets (NAME, PRO, ACT) are tested
a statement containing a negation stating a false fact
with each model, and the ones that are retained are the
(for example, Birds cannot fly) can be more plausible
ones such that the model’s top one prediction is the ACT
than a formally true but unusual statement (say, Birds
verb itself when C and T are both positive (CpTp). Here
cannot breastfeed). In fact, a vast number of words could
for instance, assuming that (6) are a model’s predictions,
potentially fit the negative statement, making it true,
the triplet (Jessica, architect, dance) would be retained
many of them with little association with the rest of the
while the triplet (Luke, janitor, swim) would not.
sentence. This makes it challenging for any single word
to become the top prediction in the negative case. (6) a. Jessica is an architect who likes to dance. She
Gubelmann and Handschuh [9] developed a more prag- is happy to dance.
matically informed test set, in which each instance is (in b. Luke is a janitor who likes to swim. He is
[2]’s terms) self-contained. This means that each item happy to ski.
in the set includes some context information, allowing
direct evaluation of the model’s completion. Building Once triplets have been selected (the set of all triplets
on this work, [2] developed the Self-Contained Neg Test, such that the ACT verb is repeated in CpTp instances),
which aimed to address some issues in the test set from CpTn and CnTp instances can be formed, and the ex-
[9] and more accurately determine the model’s handling pectation is that a model that “understands” negation
of negation without interference of world knowledge. should not predict the ACT verb in those cases since it
1
would lead to contradictory instances. As a control, two
https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
2
https://huggingface.co/bert-base-multilingual-cased
additional confirgurations are considered: CnTn where it
3
https://github.com/marcopoli/AlBERTo-it is expected that the repetition of ACT is possible (though
4
https://github.com/musixmatchresearch/umberto
not required), and CpTv in which an adverb (very) is in- We choose instead to rely on the pair (9), involving a
serted in the positive target, which should not change semantic inference relation.
the preferred prediction of ACT since both sentences
are positive. The different configurations are illustrated (9) ha l’abitudine di / molto spesso
below. is used to / very often
(7) CpTp Jessica is an architect who likes to dance. The final form of the SCIN set is available in table 1. The
She is happy to [MASK]. shape of the contexts is given in row 1, that of the targets
CpTn Jessica is an architect who likes to dance. in row 2, and the test target Tv is added in row 3.
She isn’t happy to [MASK]. Our assumption is that, if the model repeats the ACT
CnTp Jessica is an architect who doesn’t like to token in the CpTp configuration, it is proof that the model
dance. She is happy to [MASK]. has resolved the ha l’abitudine di / molto spesso inference.
CnTn Jessica is an architect who doesn’t like to When confronted with the CpTn or CnTp configuration,
dance. She isn’t happy to [MASK]. the model should have the addition of the negation as the
CpTv Jessica is an architect who likes to dance. only element that can explain the modification of its pre-
She is very happy to [MASK].
dictions. Finally, the CpTv control allows us to check the
extent to which the addition of a different, non-negative
3. SCIN construction adverb in the sequence modifies the model’s predictions;
we can assume that any modification of greater magni-
In Italian, negation is most commonly expressed by the tude than that associated to CpTv are due to the influence
negative invariable proclitic non (not) [10]. of negation.
It is this expression of negation that we use for the The complete list of new patterns is available in Table 1.
Italian adaptation of the Self-Contained Neg Test that we
present in this section: the SCIN set.
3.2. Pattern selection
3.1. Italian patterns The triplets (name, profession, verb) used for testing are
selected by testing them on the CpTp configuration: only
Following the preparation of the Self-Contained Neg Test, triplets leading to a repetition of the ACT token are re-
we collect a list of Italian verbs, professions and names tained (see Table 2). This ensures that only patterns for
that will be used to create the triplets to be tested. The which the model is already biased towards repetition are
verbs are taken from the Dizionario Italiano Sabatini Co- tested, and the model has to understand the influence of
letti 2022 (online version); only the intransitive (3138 negation on sentence semantics to reverse this tendency.
verbs) are retained; among these, for each of the tested All available triplets are tested, i.e. all configurations
models we further exclude the verbs that are not tok- between verbs monotokenized by the model, first names
enized as a single token. The selected names are the 100 and occupations selected in subsection 3.1. As tokeniza-
most popular in Italy in 20245 . Lastly, the professions are tion is model-dependent, the number of verbs tested is
taken from a site specializing in job searches in Italy6 ; not the same for each model: details are available in the
of those present on the site, only those consisting of a first row of table 3.
single word have been selected. The results of this test are available in table 3.
The patterns cannot simply be a direct translation of The results are highly model-dependent: while the
English patterns into Italian. In fact, for the test to be bert-base-italian-cased model predicts the ACT token in
adequate for evaluating models, we need the masked almost 25% of cases, this is the case in only 0.03% of cases
position to be syntactically constrained to be a verb. This for alb3rt0.
would not be the case if we used a direct translation of
the original sentences: for example, the sequence (8) can
be completed with the token “questo” ( = PRON is happy 4. Testing
to do this).
4.1. Setup
(8) NAME è un PROF che ama ACT. È felice di MASK.
NAME is a PROF who loves to ACT. (PRON) is happy Tests are performed as in Kletz et al. [11]. Contexts (C)
to MASK. and targets (T) are combined to create two test patterns
CpTn, CnTp; in addition to these two, the test includes
two control patterns CnTn and CpTv where the repetition
of the ACT verb is not contradictory.
5
https://www.nostrofiglio.it/gravidanza/nomi-per-bambini/ All selected triplets are then used to saturate the pat-
i-100-nomi-per-bambini-piu-amati-dai-genitori-di-nostrofiglio-it terns, and the resulting patterns are provided as inputs to
6
https://www.wecanjob.it/pagina9_elenco-professioni.html
pol. C(ontext) T(arget)
p NAME è un(a) PROF che ha l’abitudine di ACT. PRON [MASK] molto spesso.
1
NAME is a PROF who is used to ACT-ing. PRON [MASK] often.
n NAME è un(a) PROF che non ha l’abitudine di ACT. PRON non [MASK] molto spesso.
2
NAME is a PROF who is not used to ACT-ing. PRON doesn’t [MASK] often.
v - PRON [MASK] davvero molto spesso.
3
PRON [MASK] really often.
Table 1
Complete list of contexts and targets used to build masked sequences in the SCIN dataset. Masks are always in the target.
Contexts and targets can be either positive or negative, and the target can also have an adverb added which is not a negation
cue. Patterns are made up of a context and a target, i.e. 5 possible patterns.
Instantiated NAME/PROF: definitively attribute this to its logical function, the nega-
Jessica / Ballerina (Dancer)
tion marker does exert a distinct influence.
Tested verb: Fumare (To smoke)
Nevertheless, it is important to emphasize the very
Tested example: Jessica è una ballerina che
ha l’abitudine di fumare. Lei [MASK] spesso.
clear limitations of these results. Firstly, the drops never
exceed 25%, meaning that 75% of the times the model
Model Top 1 Retained?
predicts a semantically prohibited token. On the other
pred.
b-b-italian-c fuma ✓
hand, with the exception of m-bert, all the models have a
b-b-italian-xxl-c fuma ✓ highe drop for the CnTn control than for the CnTp con-
m-bert balla no figuration, thus indicating that even though the models
alb3rt0 parla no have acquired a certain understanding of negation, this
remains superficial and does not, for example, clearly in-
Table 2 clude an understanding of the positive value of a double
An example of selecting a triplet for testing. A negation.
NAME/PROF/VERB triplet is used to saturate the CpTp
A broader examination of the results reveals that while
pattern of SCIN. The sequence contains a mask and is used as
input to a PLM. If the model prediction is the ACT token, the
the drops in CpTn and CnTp configurations increase
triplet is retained (indicated by the ✓ symbol). In the name together, the CnTn controls also show a corresponding
of the models given as examples, “b-b” means bert-base, “it” increase.
stands for italian and “c” for cased. Finally, the training corpus of the models seems to
have an influence on their performance. For exam-
ple, note that the alb3rt0 model is the model obtain-
the models. Predictions at masked positions are collected. ing the results least in line with our expectations, while
We use drop as a measure of the models’ performance: bert-base-italian-xxl-cased and bert-base-italian-cased
for each pattern, given the rate 𝑡𝑟 of repetitions of the Act had better drop values, with the former performing bet-
Token in the predictions, the drop is defined as 100 − 𝑡𝑟 . ter than the latter. However, these three models have
The higher the drop for the CpTn and CnTp patterns and identical numbers of layers, attention heads and hidden
the lower for the CnTn and CpTv controls, the better the sizes, the difference between them only consisting in
model has understood the negation. their training data. The alb3rt0 model was trained ex-
clusively on tweets, which likely limits the diversity of
its data, particularly with respect negation. In contrast,
4.2. Results and Discussion bert-base-italian-cased and bert-base-italian-xxl-cased
Results are shown in table 4. models were trained on more varied corpora, with the
In contrast with the observations made by [8] and latter featuring a larger dataset.
[7], the models are not insensitive to the presence of In the future, this should lead us to study the corre-
negation in a sentence: all the models show a drop in both lation between the performance of the models and the
configurations CpTn and CnTp, showing an adaptation fine-grained distribution of negative and affirmative con-
of their predictions to the presence of a negation cue. texts in their training corpus.
This observation is confirmed by the fact that the drops
in the CpTv control are always lower than those observed
in CpTn or CnTp. 5. Comparison with English
This shows that simply adding an adverb is not suffi- In this section we compare the results obtained with the
cient to change the model’s predictions. While we cannot SCIN Set with those observed by [2] in English.
Model b-b-it-c b-b-it-xxl m-bert alb3rt0 UmBERTo
# tested contexts 5880000 5880000 780000 18800000 280000
Repetitions 1498456 1236899 141609 5464 93284
% 25.48 21.03 18.16 0.03 33.31
# retained contexts 20000 20000 19973 2088 20000
Table 3
Details of the verb sets created for each model. The first line shows the number of triples available per model, the second the
number of these triples which, in a CpTp configuration, led to a repetition (prediction by the ACT token model), and line 3
the percentage of triples this represents.) The last line shows how many of the triplets leading to a repeat were retained, the
maximum for one model being 20,000. In the column titles, “b-b” means bert-base, “it” stands for italian and “c” for cased
Pattern b-b-it-c b-b-it-xxl m-bert alb3rt0 UmBERTo
CpTn 16.5 22.1 23.0 9.7 9.9
CnTp 11.0 14.5 19.7 4.4 11.9
CnTn 11.6 14.6 18.6 9.3 20.6
CpTv 1.3 14.3 1.0 0.2 1.7
Table 4
Drops of Italian pretrained language models on the SCIN Set, for each pattern type. In the two first rows, a high number is
expected — the higher number of each row in bold face; in the two last rows, a lower number is expected. In the column titles
“b-b” means bert-base, “it” stands for italian and “c” for cased
The scale of the drops in the two articles is notably that negation modifies their predictions, but that this
very different: the maximum drop observed in Italian is does not happen consistently or in a way that is always
23% (CpTn m-bert), while in English it’s 82.8%. Similarly, coherent with the semantic effect that we expect negation
the CpTv drops of Italian-speaking models hardly exceed to have on sentences. These results suggest a strong need
15%, while those of English-speaking models are never to adapt these models to make them more sensitive to
less than 25%. negation and its semantic consequences.
On the other hand, model architecture and type of Nevertheless, we also noted a fairly marked difference
training do not seem to have a major influence: Umberto in performance from one model to another, correlated
has the same architecture as roberta-base, but while the with the different corpora used to train them. We thus
latter is the best performing model in [2], the former’s suggest that a lexical and statistical study of these corpora
drops are the lowest for all configurations of the SCIN could shed further light on the behavior of the models.
Set. Conversely, the other Italian models are built with Lastly, it would be interesting to compare these results
the same architecture as bert-base-cased, i.e. the worst with the performance of generative models, in order to
performing model for English; however, even the worst study the relative importance of the number of model
performing Italian model, namely alb3rt0, features higher parameters in relation to their architecture.
drops than bert-base-cased. This confirms the observa-
tion from the previous section, that while architecture is
indeed a limiting criterion, training data probably plays Acknowledgments
a significant role.
We would like to express our gratitude to Marie Candito
In general, we note that none of these models, neither
for her valuable assistance and guidance throughout the
for Italian nor for English, shows definitive drops compat-
course of this study.
ible with a full understanding of the semantic constraints
This work was funded in part by the French gov-
of negation.
ernment under management of Agence Nationale de
la Recherche as part of the “Investissements d’avenir"
6. Conclusion program, reference ANR-19-P3IA0001 (PRAIRIE 3IA In-
stitute). This research was also partially funded by
In this paper, we investigated the ability of several Italian the Labex EFL (ANR-10-LABX-0083) and by PNRR–
PLMs to take negation into account in their predictions. M4C2–Investimento 1.3, Partenariato Esteso PE00000013-
To do this, we adapted to Italian the Self-Contained Neg –“FAIR—Future Artificial Intelligence Research”–Spoke 1
Test proposed by Kletz et al. [2], which is based on mini- “Human-centered AI,” funded by the European Commis-
mal pairs of aligned sentences. sion under the NextGeneration EU programme.
Applying this test to six models enabled us to show
References A. Verb statistics by PLM
[1] L. R. Horn, H. Wansing, Negation, in: E. N. Zalta, Details of the number of monotokenised intransitive
U. Nodelman (Eds.), The Stanford Encyclopedia of verbs available for each PLM tested are available in ta-
Philosophy, Winter 2022 ed., Metaphysics Research ble 5.
Lab, Stanford University, 2022.
[2] D. Kletz, P. Amsili, M. Candito, The self-contained model monotokenized verbs
negation test set, in: Y. Belinkov, S. Hao, J. Jumelet, bert-base-italian-cased 294
N. Kim, A. McCarthy, H. Mohebbi (Eds.), Proceed- bert-base-italian-xxl-cased 294
ings of the 6th BlackboxNLP Workshop: Analyzing m-bert 39
alb3rt0 940
and Interpreting Neural Networks for NLP, Asso-
UmBERTo 14
ciation for Computational Linguistics, Singapore,
2023, pp. 212–221. URL: https://aclanthology.org/ Table 5
2023.blackboxnlp-1.16. doi:10.18653/v1/2023. Detail of the number of Italian intransitive verbs tokenised as
blackboxnlp-1.16. a single token for each of the Italian models tested.
[3] S. Schweter, Italian bert and electra models,
2020. URL: https://doi.org/10.5281/zenodo.4263142.
doi:10.5281/zenodo.4263142.
[4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
BERT: pre-training of deep bidirectional trans-
formers for language understanding, CoRR
abs/1810.04805 (2018). URL: http://arxiv.org/abs/
1810.04805. arXiv:1810.04805.
[5] M. Polignano, V. Basile, P. Basile, M. de Gemmis,
G. Semeraro, AlBERTo: Modeling italian social
media language with bert, IJCoL 25 (1984) 11–31.
URL: https://doi.org/10.4000/ijcol.472.
[6] L. Parisi, S. Francia, P. Magnani, Umberto: an
italian language model trained with whole word
masking, https://github.com/musixmatchresearch/
umberto, 2020.
[7] N. Kassner, H. Schütze, Negated and misprimed
probes for pretrained language models: Birds
can talk, but can+not fly (2020). URL: https://
aclanthology.org/2020.acl-main.698.
[8] A. Ettinger, What BERT is not: Lessons from a
new suite of psycholinguistic diagnostics for lan-
guage models, Transactions of the Association
for Computational Linguistics 8 (2019) 34–48. URL:
https://doi.org/10.1162/tacl_a_00298.
[9] R. Gubelmann, S. Handschuh, Context matters: A
pragmatic study of PLMs’ negation understanding,
in: Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers), Association for Computational Lin-
guistics, Dublin, Ireland, 2022, p. 4602–4621. URL:
https://aclanthology.org/2022.acl-long.315.
[10] L. Renzi, L. G. Salvi, A. Cardinaletti, Grande
grammatica italiana di consultazione, volume 2, Il
Mulino, Bologna, 2001.
[11] D. Kletz, M. Candito, P. Amsili, Probing structural
constraints of negation in pretrained language mod-
els, in: The 24rd Nordic Conference on Computa-
tional Linguistics, 2023. URL: https://openreview.
net/forum?id=_7VPETQwnPX.